Arxiv今日论文 | 2024-11-13

本篇博文主要展示 2024-11-13 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决如何利用大型语言模型 (LLM) 生成具有可控因果结构的数据的问题。解决方案的关键在于提出了一种基于序列驱动的结构因果模型 (SD-SCM) 的框架，该框架能够将任意语言模型和有向无环图 (DAG) 转化为具有用户定义结构和LLM定义结构方程的因果模型。通过这一框架，研究者能够在观测、干预和反事实分布中进行采样，并生成个体级别的反事实数据，而无需手动指定变量间的函数关系。此外，该方法还支持对LLM中潜在的因果效应进行测试，从而为审核LLM中的错误信息、歧视或其他不良行为提供基础。

链接: https://arxiv.org/abs/2411.08019
作者: Lucius E.J. Bynum,Kyunghyun Cho
关键词-EN: large language model, based data generation, present a framework, framework for large, language model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present a framework for large language model (LLM) based data generation with controllable causal structure. In particular, we define a procedure for turning any language model and any directed acyclic graph (DAG) into a sequence-driven structural causal model (SD-SCM). Broadly speaking, an SD-SCM is a causal model with user-defined structure and LLM-defined structural equations. We characterize how an SD-SCM allows sampling from observational, interventional, and counterfactual distributions according to the desired causal structure. We then leverage this procedure to propose a new type of benchmark for causal inference methods, generating individual-level counterfactual data without needing to manually specify functional relationships between variables. We create an example benchmark consisting of thousands of datasets, and test a suite of popular estimation methods on these datasets for average, conditional average, and individual treatment effect estimation, both with and without hidden confounding. Apart from generating data, the same procedure also allows us to test for the presence of a causal effect that might be encoded in an LLM. This procedure can underpin auditing LLMs for misinformation, discrimination, or otherwise undesirable behavior. We believe SD-SCMs can serve as a useful tool in any application that would benefit from sequential data with controllable causal structure.
摘要：我们提出了一种基于大语言模型 (LLM) 的可控因果结构数据生成框架。具体而言，我们定义了一个过程，将任何语言模型和任何有向无环图 (DAG) 转换为序列驱动的结构因果模型 (SD-SCM)。广义上讲，SD-SCM 是一种具有用户定义结构和 LLM 定义结构方程的因果模型。我们描述了 SD-SCM 如何根据所需的因果结构从观测分布、干预分布和反事实分布中进行采样。随后，我们利用这一过程提出了一种新的因果推断方法基准，生成个体级别的反事实数据，而无需手动指定变量之间的函数关系。我们创建了一个包含数千个数据集的示例基准，并在这些数据集上测试了一系列流行的估计方法，用于平均效应、条件平均效应和个体处理效应的估计，无论是否存在隐藏混杂因素。除了生成数据外，相同的过程还允许我们测试 LLM 中可能编码的因果效应的存在。这一过程可以作为审核 LLM 中错误信息、歧视或其他不良行为的依据。我们相信，SD-SCM 可以作为任何受益于具有可控因果结构的序列数据的应用中的有用工具。

[NLP-1] ExpressivityArena: Can LLM s Express Information Implicitly?

【速读】：该论文试图解决的问题是评估大型语言模型（LLMs）在表达隐含语言线索方面的能力，这些线索是人类用于有效沟通的关键。解决方案的关键在于开发了一个名为ExpressivityArena的Python库，该库提供了一个全面的框架来测量任意LLMs的表达能力。通过定义和细化“表达能力”的测量标准，并在一系列创意和逻辑任务（如诗歌、编码和基于情感的响应）中进行实验，论文展示了如何使用ExpressivityArena来评估LLMs的表达能力。实验结果通过自动化评分系统进行评估，验证了该方法在测试表达能力方面的实用性。最终，论文通过这些实验深化了对LLMs表达能力的理解，并指出了其在生成和理解表达性内容方面的局限性，为未来开发和部署更具表达能力的LLMs提供了指导。

链接: https://arxiv.org/abs/2411.08010
作者: Joshua Tint,Som Sagar,Aditya Taparia,Kelly Raines,Bimsara Pathiraja,Caleb Liu,Ransalu Senanayake
关键词-EN: Large Language Models, Language Models, express implicit language, implicit language cues, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 22 figures

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated remarkable performance in certain dimensions, their ability to express implicit language cues that human use for effective communication remains unclear. This paper presents ExpressivityArena, a Python library for measuring the implicit communication abilities of LLMs. We provide a comprehensive framework to evaluate expressivity of arbitrary LLMs and explore its practical implications. To this end, we refine the definition and measurements of ``expressivity,‘’ and use our framework in a set of small experiments. These experiments test LLMs in creative and logical tasks such as poetry, coding, and emotion-based responses. They are then evaluated by an automated grader, through ExpressivityArena, which we verify to be the most pragmatic for testing expressivity. Building on these experiments, we deepen our understanding of the expressivity of LLMs by assessing their ability to remain expressive in conversations. Our findings indicate that LLMs are capable of generating and understanding expressive content, however, with some limitations. These insights will inform the future development and deployment of expressive LLMs. We provide the code for ExpressivityArena alongside our paper.
摘要：尽管大语言模型 (Large Language Models, LLMs) 在某些方面展示了卓越的性能，但其表达人类用于有效沟通的隐含语言线索的能力仍不明确。本文介绍了 ExpressivityArena，一个用于测量 LLMs 隐含沟通能力的 Python 库。我们提供了一个全面的框架来评估任意 LLMs 的表现力，并探讨其实际应用。为此，我们细化了“表现力”的定义和测量方法，并在一系列小型实验中应用了我们的框架。这些实验测试了 LLMs 在创作和逻辑任务中的表现，如诗歌、编码和基于情感的回应。随后，这些实验由 ExpressivityArena 中的自动化评分器进行评估，我们验证了该评分器在测试表现力方面最为实用。基于这些实验，我们通过评估 LLMs 在对话中保持表现力的能力，深化了对 LLMs 表现力的理解。我们的研究结果表明，LLMs 能够生成和理解表现力内容，但存在一定的局限性。这些见解将为未来表现力 LLMs 的开发和部署提供指导。我们随论文提供了 ExpressivityArena 的代码。

[NLP-2] Can adversarial attacks by large language models be attributed?

【速读】：该论文试图解决在对抗性环境中（如网络攻击和虚假信息），如何对大型语言模型（LLMs）的输出进行归因的问题。解决方案的关键在于利用形式语言理论，特别是Gold和Angluin引入的极限语言识别理论，将LLM的输出建模为形式语言，并分析有限文本样本是否能唯一确定其来源模型。研究结果表明，由于某些语言类别的不可识别性，在某些关于微调模型输出重叠的温和假设下，理论上不可能以确定性方式将输出归因于特定的LLMs。即使考虑Transformer架构的表达能力限制，或在直接模型访问或全面监控的情况下，计算障碍也严重阻碍了归因工作。这些发现强调了需要采取积极措施来缓解对抗性LLM使用带来的风险，随着其影响力的不断扩大。

链接: https://arxiv.org/abs/2411.08003
作者: Manuel Cebrian,Jan Arne Telle
关键词-EN: Large Language Models, disinformation-presents significant challenges, Large Language, Attributing outputs, grow in importance
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Formal Languages and Automata Theory (cs.FL)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation-presents significant challenges that are likely to grow in importance. We investigate this attribution problem using formal language theory, specifically language identification in the limit as introduced by Gold and extended by Angluin. By modeling LLM outputs as formal languages, we analyze whether finite text samples can uniquely pinpoint the originating model. Our results show that due to the non-identifiability of certain language classes, under some mild assumptions about overlapping outputs from fine-tuned models it is theoretically impossible to attribute outputs to specific LLMs with certainty. This holds also when accounting for expressivity limitations of Transformer architectures. Even with direct model access or comprehensive monitoring, significant computational hurdles impede attribution efforts. These findings highlight an urgent need for proactive measures to mitigate risks posed by adversarial LLM use as their influence continues to expand.
摘要：在对抗性环境中（如网络攻击和虚假信息），对大语言模型 (LLM) 输出的归属问题提出了重大挑战，这些问题的重要性可能会日益增加。我们利用形式语言理论，特别是由 Gold 引入并由 Angluin 扩展的极限语言识别，来研究这一归属问题。通过将 LLM 输出建模为形式语言，我们分析了有限文本样本是否能唯一确定其来源模型。我们的研究结果表明，由于某些语言类别的不可识别性，在某些关于微调模型输出重叠的温和假设下，理论上不可能确定地归属输出到特定 LLM。即使在考虑 Transformer 架构的表达能力限制时，这一结论仍然成立。即使直接访问模型或进行全面监控，重大的计算障碍也阻碍了归属工作的进行。这些发现突显了迫切需要采取积极措施，以减轻对抗性 LLM 使用所带来的风险，因为其影响力持续扩大。

[NLP-3] Derivational Morphology Reveals Analogical Generalization in Large Language Models

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）中的语言泛化机制是否可以通过类比过程（analogical processes）来解释，而不仅仅是基于规则的机制。解决方案的关键在于：通过引入一种新的方法，即对GPT-J模型进行认知模型的拟合，比较规则学习和类比学习模型在训练数据上的预测与GPT-J的实际预测，特别是在英语形容词名词化（adjective nominalization）这一具有显著变异性的语言现象上。研究发现，对于具有变异性的名词化模式，类比模型提供了更好的匹配，且GPT-J的行为对单个词频敏感，这与类比机制一致，而与规则机制不符。这些发现否定了GPT-J在形容词名词化中的语言泛化涉及规则的假设，表明存储样本上的相似性操作是潜在机制，从而支持类比过程在LLMs语言泛化中扮演更重要角色的观点。

链接: https://arxiv.org/abs/2411.07990
作者: Valentin Hofmann,Leonie Weissweiler,David Mortensen,Hinrich Schütze,Janet Pierrehumbert
关键词-EN: linguistic generalization, underlie linguistic generalization, large language models, LLMs, analogical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogical processes, which can be formalized as similarity operations on stored exemplars. A key shortcoming of prior research is its focus on linguistic phenomena with a high degree of regularity, for which rule-based and analogical approaches make the same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, which displays notable variability. We introduce a new method for investigating linguistic generalization in LLMs: focusing on GPT-J, we fit cognitive models that instantiate rule-based and analogical learning to the LLM training data and compare their predictions on a set of nonce adjectives with those of the LLM, allowing us to draw direct conclusions regarding underlying mechanisms. As expected, rule-based and analogical models explain the predictions of GPT-J equally well for adjectives with regular nominalization patterns. However, for adjectives with variable nominalization patterns, the analogical model provides a much better match. Furthermore, GPT-J’s behavior is sensitive to the individual word frequencies, even for regular forms, a behavior that is consistent with an analogical account of regular forms but not a rule-based one. These findings refute the hypothesis that GPT-J’s linguistic generalization on adjective nominalization involves rules, suggesting similarity operations on stored exemplars as the underlying mechanism. Overall, our study suggests that analogical processes play a bigger role in the linguistic generalization of LLMs than previously thought.
摘要：大语言模型（LLMs）中的语言泛化机制是什么？这一问题引起了广泛关注，多数研究分析了LLMs的语言技能在多大程度上类似于规则。然而，目前尚不清楚LLMs中的语言泛化是否同样可以解释为类比过程的结果，这些过程可以形式化为存储样本上的相似性操作。先前研究的一个主要缺陷是其专注于高度规律性的语言现象，这些现象使得基于规则和类比的方法产生相同的预测。在此，我们转而研究派生形态学，特别是英语形容词的名词化，这一现象表现出显著的变异性。我们引入了一种新的方法来研究LLMs中的语言泛化：聚焦于GPT-J，我们将基于规则和类比学习的认知模型拟合到LLM的训练数据中，并比较它们对一组新创形容词的预测与LLM的预测，从而直接得出关于底层机制的结论。如预期，对于具有规则名词化模式的形容词，基于规则和类比模型都能很好地解释GPT-J的预测。然而，对于具有变异名词化模式的形容词，类比模型提供了更好的匹配。此外，GPT-J的行为对单个词频敏感，即使是规则形式，这种行为与类比解释一致，但与基于规则的解释不一致。这些发现反驳了GPT-J在形容词名词化中的语言泛化涉及规则的假设，暗示存储样本上的相似性操作是底层机制。总体而言，我们的研究表明，类比过程在LLMs的语言泛化中扮演的角色比先前认为的更大。

[NLP-4] JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

【速读】：该论文试图解决图像理解和生成在单一模型中的统一问题。解决方案的关键在于引入了一个名为JanusFlow的框架，该框架通过将自回归语言模型与矫正流（rectified flow）相结合，实现了在大型语言模型框架内直接训练矫正流，从而避免了复杂的架构修改。此外，通过解耦理解和生成编码器，并在统一训练过程中对齐它们的表示，进一步提升了模型的性能。实验结果表明，JanusFlow在标准基准测试中显著优于现有的统一方法，并在各自领域内与专用模型表现相当或更优。

链接: https://arxiv.org/abs/2411.07975
作者: Yiyang Ma,Xingchao Liu,Xiaokang Chen,Wen Liu,Chengyue Wu,Zhiyu Wu,Zizheng Pan,Zhenda Xie,Haowei Zhang,Xingkai yu,Liang Zhao,Yisong Wang,Jiaying Liu,Chong Ruan
关键词-EN: unifies image understanding, unifies image, present JanusFlow, powerful framework, image understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.
摘要：我们提出了 JanusFlow，这是一个强大的框架，能够在一个模型中统一图像理解和生成。JanusFlow 引入了一种极简的架构，将自回归语言模型与矫正流（rectified flow）相结合，后者是生成建模领域中的先进方法。我们的关键发现表明，矫正流可以简单地在大语言模型框架内进行训练，无需复杂的架构修改。为了进一步提升统一模型的性能，我们采用了两种关键策略：（i）解耦理解和生成编码器，以及（ii）在统一训练期间对齐它们的表示。广泛的实验表明，JanusFlow 在其各自领域中达到了与专用模型相当或更优的性能，同时在标准基准测试中显著超越了现有的统一方法。这项工作代表了向更高效和多功能的视觉-语言模型迈进的一步。

[NLP-5] From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents

【速读】：该论文试图解决现有角色扮演代理（Role-Playing Agents, RPAs）基准测试中存在的通用性差、判断隐晦且不准确以及上下文过长等问题。解决方案的关键在于提出了一种自动、可扩展且通用的范式。具体来说，论文通过从通用知识图谱中提取关系，并利用RPA固有的幻觉特性，使其在角色间进行交互，同时采用ChatGPT进行立场检测，定义了关系幻觉及其相关三个指标。实验结果验证了这些指标的有效性和稳定性，并探讨了影响这些指标的因素以及关系幻觉与事实性之间的权衡。

链接: https://arxiv.org/abs/2411.07965
作者: Chuyi Kong,Ziyang Luo,Hongzhan Lin,Zhiyuan Fan,Yaxin Fan,Yuxi Sun,Jing Ma
关键词-EN: Large Language Models, developing Role-Playing Agents, advanced role-playing capabilities, Language Models, Large Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advanced role-playing capabilities of Large Language Models (LLMs) have paved the way for developing Role-Playing Agents (RPAs). However, existing benchmarks, such as HPD, which incorporates manually scored character relationships into the context for LLMs to sort coherence, and SocialBench, which uses specific profiles generated by LLMs in the context of multiple-choice tasks to assess character preferences, face limitations like poor generalizability, implicit and inaccurate judgments, and excessive context length. To address the above issues, we propose an automatic, scalable, and generalizable paradigm. Specifically, we construct a benchmark by extracting relations from a general knowledge graph and leverage RPA’s inherent hallucination properties to prompt it to interact across roles, employing ChatGPT for stance detection and defining relationship hallucination along with three related metrics. Extensive experiments validate the effectiveness and stability of our metrics. Our findings further explore factors influencing these metrics and discuss the trade-off between relationship hallucination and factuality.
摘要：大语言模型（LLMs）在角色扮演方面的先进能力为开发角色扮演智能体（RPAs）铺平了道路。然而，现有的基准测试，如HPD，它将手动评分的角色关系纳入上下文以供LLMs排序连贯性，以及SocialBench，它在使用LLMs生成的特定配置文件的多选任务上下文中评估角色偏好，存在诸如泛化能力差、隐性且不准确的判断以及上下文长度过长等局限性。为了解决上述问题，我们提出了一种自动、可扩展且具有泛化能力的新范式。具体而言，我们通过从通用知识图谱中提取关系来构建基准，并利用RPA固有的幻觉特性来提示其在角色间进行互动，采用ChatGPT进行立场检测，并定义了关系幻觉及其相关的三个指标。广泛的实验验证了我们指标的有效性和稳定性。我们的研究进一步探讨了影响这些指标的因素，并讨论了关系幻觉与事实性之间的权衡。

[NLP-6] CryptoLLM : Unleashing the Power of Prompted LLM s for SmartQnA and Classification of Crypto Posts

【速读】：该论文试图解决的问题是如何对加密货币相关社交媒体帖子进行准确分类，并从一组帖子中识别出与特定问题最相关的答案。解决方案的关键在于利用先进的语言模型（LLMs），特别是通过提示（prompt-based）技术对Reddit和Twitter帖子进行分类，以及使用64-shot技术结合GPT-4-Turbo模型来判断答案与问题的相关性。这些方法旨在增强对加密货币讨论的理解和过滤，从而帮助在这个波动性较大的领域中做出更明智的决策。

链接: https://arxiv.org/abs/2411.07917
作者: Aniket Deroy,Subhankar Maity
关键词-EN: user-generated content, social media posts, cryptocurrency-related social media, social media, rapid growth
类目: Computation and Language (cs.CL)
备注: Accepted at FIRE 2024 (Track: Opinion Extraction and Question Answering from CryptoCurrency-Related Tweets and Reddit posts (CryptOQA))

点击查看摘要

Abstract:The rapid growth of social media has resulted in an large volume of user-generated content, particularly in niche domains such as cryptocurrency. This task focuses on developing robust classification models to accurately categorize cryptocurrency-related social media posts into predefined classes, including but not limited to objective, positive, negative, etc. Additionally, the task requires participants to identify the most relevant answers from a set of posts in response to specific questions. By leveraging advanced LLMs, this research aims to enhance the understanding and filtering of cryptocurrency discourse, thereby facilitating more informed decision-making in this volatile sector. We have used a prompt-based technique to solve the classification task for reddit posts and twitter posts. Also, we have used 64-shot technique along with prompts on GPT-4-Turbo model to determine whether a answer is relevant to a question or not.
摘要：社交媒体的迅速增长导致了大量用户生成内容，特别是在加密货币等小众领域。本任务专注于开发稳健的分类模型，以准确地将加密货币相关的社交媒体帖子分类为预定义的类别，包括但不限于客观、正面、负面等。此外，任务要求参与者从一组帖子中识别出对特定问题的最相关回答。通过利用先进的大语言模型，本研究旨在增强对加密货币讨论的理解和过滤，从而促进在这一波动性较强的领域中做出更明智的决策。我们采用了基于提示的技术来解决Reddit帖子和Twitter帖子的分类任务。同时，我们使用了64样本技术结合提示在GPT-4-Turbo模型上，以确定一个回答是否与问题相关。

[NLP-7] Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus

【速读】：该论文试图解决播客生态系统的大规模计算分析问题，由于现有数据的局限性，难以进行深入研究。解决方案的关键在于引入了一个包含超过110万播客转录文本的庞大数据集，该数据集涵盖了2020年5月至6月期间通过公共RSS源可获取的所有英语播客。数据不仅限于文本，还包括37万集的音频特征、说话者切换信息，以及所有110万集的说话者角色推断和其他元数据。通过这一数据集，研究者能够进行基础性的内容、结构和响应性分析，为后续的计算研究打开了大门。

链接: https://arxiv.org/abs/2411.07892
作者: Benjamin Litterer,David Jurgens,Dallas Card
关键词-EN: unique on-demand modality, provide highly diverse, Podcasts provide highly, massive listener base, highly diverse content
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.
摘要：播客通过独特的按需模式为庞大的听众群体提供高度多样化的内容。然而，由于数据有限，播客生态系统的大规模计算分析一直受阻。为了填补这一空白，我们引入了一个包含超过110万播客转录文本的大规模数据集，该数据集主要涵盖了2020年5月和6月通过公共RSS源提供的所有英语播客。该数据不仅限于文本，还包括37万集的音频特征和说话者切换信息，以及所有110万集的说话者角色推断和其他元数据。利用这些数据，我们还对这一生态系统的内容、结构和响应性进行了基础性研究。我们的数据和分析为这一流行且具有影响力的媒介的持续计算研究打开了大门。

[NLP-8] rustful LLM s: Customizing and Grounding Text Generation with Knowledge Bases and Dual Decoders

【速读】：该论文试图解决大型语言模型（LLMs）在特定领域应用时生成的内容可能存在的不完整性、未验证性及幻觉（hallucination）问题。解决方案的关键在于：1) 提出了一种后处理算法，利用检索增强生成（RAG）上下文中的知识三元组来纠正幻觉；2) 设计了一种双解码器模型，通过融合RAG上下文来指导生成过程，从而提高生成内容的正确性和接地性。

链接: https://arxiv.org/abs/2411.07870
作者: Xiaofeng Zhu,Jaya Krishna Mandivarapu
关键词-EN: content generation skills, large language models, people are impressed, skills of large, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although people are impressed by the content generation skills of large language models, the use of LLMs, such as ChatGPT, is limited by the domain grounding of the content. The correctness and groundedness of the generated content need to be based on a verified context, such as results from Retrieval-Augmented Generation (RAG). One important issue when adapting LLMs to a customized domain is that the generated responses are often incomplete, or the additions are not verified and may even be hallucinated. Prior studies on hallucination detection have focused on evaluation metrics, which are not easily adaptable to dynamic domains and can be vulnerable to attacks like jail-breaking. In this work, we propose 1) a post-processing algorithm that leverages knowledge triplets in RAG context to correct hallucinations and 2) a dual-decoder model that fuses RAG context to guide the generation process.
摘要：尽管人们对大语言模型的内容生成能力印象深刻，但像 ChatGPT 这样的大语言模型的应用受到内容领域基础的限制。生成内容的正确性和基础性需要基于经过验证的上下文，例如来自检索增强生成 (Retrieval-Augmented Generation, RAG) 的结果。在将大语言模型适应于定制领域时，一个重要问题是生成的响应往往不完整，或者添加的内容未经验证，甚至可能出现幻觉。先前关于幻觉检测的研究主要集中在评估指标上，这些指标不易适应动态领域，并且容易受到如越狱攻击等攻击。在本研究中，我们提出了 1) 一种利用 RAG 上下文中的知识三元组进行幻觉修正的后处理算法，以及 2) 一种融合 RAG 上下文以指导生成过程的双解码器模型。

[NLP-9] Verbosity neq Veracity: Demystify Verbosity Compensation Behavior of Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）中普遍存在的“冗余补偿”（Verbosity Compensation, VC）问题，即模型在不确定答案时倾向于生成冗长的回复，导致用户理解困难、效率低下以及服务成本增加。解决方案的关键在于提出了一种简单的级联算法，通过替换冗长回复为其他模型生成的简洁回复，有效降低了VC的发生频率。实验结果表明，该方法在Qasper数据集上将Mistral模型的VC频率从63.81%显著降低至16.16%。

链接: https://arxiv.org/abs/2411.07858
作者: Yusen Zhang,Sarkar Snigdha Sarathi Das,Rui Zhang
关键词-EN: Verbosity Compensation, humans often respond, hoping that part, define Verbosity Compensation, Verbosity
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:When unsure about an answer, humans often respond with more words than necessary, hoping that part of the response will be correct. We observe a similar behavior in large language models (LLMs), which we term “Verbosity Compensation” (VC). VC is harmful because it confuses the user understanding, leading to low efficiency, and influences the LLM services by increasing the latency and cost of generating useless tokens. In this paper, we present the first work that defines and analyzes Verbosity Compensation, explores its causes, and proposes a simple mitigating approach. We define Verbosity Compensation as the behavior of generating responses that can be compressed without information loss when prompted to write concisely. Our experiments, conducted on five datasets of knowledge and reasoning-based QA tasks with 14 newly developed LLMs, reveal three conclusions. 1) We reveal a pervasive presence of verbosity compensation across all models and all datasets. Notably, GPT-4 exhibits a VC frequency of 50.40%. 2) We reveal the large performance gap between verbose and concise responses, with a notable difference of 27.61% on the Qasper dataset. We also demonstrate that this difference does not naturally diminish as LLM capability increases. Both 1) and 2) highlight the urgent need to mitigate the frequency of VC behavior and disentangle verbosity with veracity. We propose a simple yet effective cascade algorithm that replaces the verbose responses with the other model-generated responses. The results show that our approach effectively alleviates the VC of the Mistral model from 63.81% to 16.16% on the Qasper dataset. 3) We also find that verbose responses exhibit higher uncertainty across all five datasets, suggesting a strong connection between verbosity and model uncertainty. Our dataset and code are available at this https URL.
摘要：当不确定答案时，人类往往会用比实际需要更多的词语来回应，希望其中一部分回答是正确的。我们观察到大语言模型 (LLM) 也表现出类似的行为，我们称之为“冗余补偿” (Verbosity Compensation, VC)。VC 是有害的，因为它会混淆用户的理解，导致效率低下，并通过增加生成无用 Token 的延迟和成本来影响 LLM 服务。本文首次定义并分析了冗余补偿，探讨了其成因，并提出了一种简单的缓解方法。我们将冗余补偿定义为在要求简洁写作时，生成可以无信息损失地压缩的回应的行为。我们的实验在五个知识与推理为基础的问答任务数据集上进行，涉及 14 个新开发的大语言模型，得出了三个结论。1) 我们揭示了所有模型和所有数据集中普遍存在的冗余补偿现象。值得注意的是，GPT-4 的 VC 频率为 50.40%。2) 我们揭示了冗长回应与简洁回应之间存在显著的性能差距，在 Qasper 数据集上差异达到 27.61%。我们还证明，随着 LLM 能力的提升，这种差异并不会自然缩小。1) 和 2) 都强调了迫切需要减少 VC 行为的频率，并区分冗余与真实性。我们提出了一种简单而有效的级联算法，用其他模型生成的回应替换冗长回应。结果显示，我们的方法在 Qasper 数据集上将 Mistral 模型的 VC 从 63.81% 降低到 16.16%。3) 我们还发现，冗长回应在所有五个数据集中表现出更高的不确定性，这表明冗余与模型不确定性之间存在强烈关联。我们的数据集和代码可在以下链接获取：https URL。

[NLP-10] ucano: Advancing Neural Text Generation for Portuguese

【速读】：该论文试图解决当前自然语言处理领域中，深度学习语言模型在数据和计算资源需求上的高门槛问题，特别是针对葡萄牙语等低资源语言的性能和自主性不足的问题。解决方案的关键在于开发了一个名为GigaVerbo的葡萄牙语文本语料库，该语料库包含去重后的200亿个token，并通过此语料库训练了一系列名为Tucano的解码器转换器模型。这些模型在多个葡萄牙语基准测试中表现优于或等于其他同规模的葡萄牙语和多语言模型，同时揭示了当前葡萄牙语NLP社区使用的许多基准测试在评估生成式语言模型性能时的局限性。

链接: https://arxiv.org/abs/2411.07854
作者: Nicholas Kluge Corrêa,Aniket Sen,Sophia Falk,Shiza Fatimah
关键词-EN: natural language processing, Significant advances, recent years, made in natural, processing in recent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Significant advances have been made in natural language processing in recent years. However, our current deep learning approach to language modeling requires substantial resources in terms of data and computation. One of the side effects of this data-hungry paradigm is the current schism between languages, separating those considered high-resource, where most of the development happens and resources are available, and the low-resource ones, which struggle to attain the same level of performance and autonomy. This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Via this corpus, we trained a series of decoder-transformers named Tucano. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks. The evaluation of our models also reveals that model performance on many currently available benchmarks used by the Portuguese NLP community has little to no correlation with the scaling of token ingestion during training, highlighting the limitations of such evaluations when it comes to the assessment of Portuguese generative language models. All derivatives of our study are openly released on GitHub and Hugging Face. See this https URL
摘要：近年来，自然语言处理领域取得了显著进展。然而，我们当前基于深度学习的语言建模方法在数据和计算资源方面需求巨大。这种数据密集型范式的一个副作用是目前语言之间的鸿沟，将那些被认为是高资源的语言（大多数开发活动发生且资源丰富）与低资源的语言（难以达到相同水平的性能和自主性）分隔开来。本研究旨在引入一套新的资源，以促进葡萄牙语神经文本生成的未来发展。在此工作中，我们记录了GigaVerbo的开发过程，这是一个由去重后的葡萄牙语文本语料库组成的集合，总计达2000亿个Token。通过这一语料库，我们训练了一系列名为Tucano的解码器Transformer模型。我们的模型在多个葡萄牙语基准测试中表现与同等规模的葡萄牙语和其他多语言模型相当或更优。对我们的模型的评估还揭示，当前葡萄牙语NLP社区使用的许多可用基准测试在模型性能上与训练期间Token摄入量的扩展几乎没有相关性，这突显了这些评估在评估葡萄牙语生成式语言模型方面的局限性。我们研究的所有衍生成果均在GitHub和Hugging Face上公开发布。详见此https链接。

[NLP-11] IAE: Irony-based Adversarial Examples for Sentiment Analysis Systems

【速读】：该论文试图解决文本数据中的对抗性样本问题，特别是通过利用讽刺（irony）这一修辞手法来生成对抗性文本。解决方案的关键在于提出了一种基于讽刺的对抗性样本生成方法（Irony-based Adversarial Examples, IAE），该方法通过将直接的句子转换为讽刺性的句子，从而诱导深度学习模型在情感分析等任务中产生错误。IAE方法的关键在于准确地定位评估词、用适当的搭配替换它们，并在保持语义连贯性的同时扩展文本以包含适当的讽刺元素。这种方法不依赖于预先存在的讽刺语料库，因此具有广泛的适用性。

链接: https://arxiv.org/abs/2411.07850
作者: Xiaoyin Yi,Jiacheng Huang
关键词-EN: inputs deliberately perturbed, induce model errors, deep neural networks, neural networks, inputs deliberately
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adversarial examples, which are inputs deliberately perturbed with imperceptible changes to induce model errors, have raised serious concerns for the reliability and security of deep neural networks (DNNs). While adversarial attacks have been extensively studied in continuous data domains such as images, the discrete nature of text presents unique challenges. In this paper, we propose Irony-based Adversarial Examples (IAE), a method that transforms straightforward sentences into ironic ones to create adversarial text. This approach exploits the rhetorical device of irony, where the intended meaning is opposite to the literal interpretation, requiring a deeper understanding of context to detect. The IAE method is particularly challenging due to the need to accurately locate evaluation words, substitute them with appropriate collocations, and expand the text with suitable ironic elements while maintaining semantic coherence. Our research makes the following key contributions: (1) We introduce IAE, a strategy for generating textual adversarial examples using irony. This method does not rely on pre-existing irony corpora, making it a versatile tool for creating adversarial text in various NLP tasks. (2) We demonstrate that the performance of several state-of-the-art deep learning models on sentiment analysis tasks significantly deteriorates when subjected to IAE attacks. This finding underscores the susceptibility of current NLP systems to adversarial manipulation through irony. (3) We compare the impact of IAE on human judgment versus NLP systems, revealing that humans are less susceptible to the effects of irony in text.
摘要：对抗样本（Adversarial Examples）是指通过微小且难以察觉的改动输入数据，以诱导模型产生错误输出的样本，这引发了人们对深度神经网络（DNNs）可靠性和安全性的严重担忧。尽管在图像等连续数据领域中，对抗攻击已得到广泛研究，但文本的离散性带来了独特的挑战。本文提出了一种基于反讽的对抗样本（Irony-based Adversarial Examples, IAE）方法，该方法通过将直白的句子转化为反讽句来生成对抗性文本。这种方法利用了反讽这一修辞手法，其中隐含的意义与字面解释相反，需要对上下文有更深入的理解才能识别。由于需要准确地定位评价词、用适当的搭配词替换，并在保持语义连贯性的同时扩展文本以加入合适的反讽元素，IAE方法具有特别的挑战性。我们的研究做出了以下主要贡献：（1）我们引入了IAE，这是一种利用反讽生成文本对抗样本的策略。该方法不依赖于现有的反讽语料库，使其成为在各种自然语言处理（NLP）任务中创建对抗性文本的多功能工具。（2）我们证明了在情感分析任务中，几种最先进的深度学习模型在遭受IAE攻击时，其性能显著下降。这一发现突显了当前NLP系统对通过反讽进行对抗性操纵的脆弱性。（3）我们比较了IAE对人类判断与NLP系统的影响，结果显示人类对文本中的反讽效应较为不敏感。

[NLP-12] Ethical Concern Identification in NLP: A Corpus of ACL Anthology Ethics Statements

【速读】：该论文试图解决大语言模型（LLM）研究中存在的伦理问题，并提出了一种名为EthiCon的语料库，该语料库包含从ACL Anthology中提取的1,580条伦理关注声明。解决方案的关键在于通过提取伦理关注关键词，实现伦理关注识别过程的自动化，并通过调查比较语料库中的伦理关注与公众和专业人士的关注点，以及与现有分类法进行对比，揭示研究中的差距和未来研究方向。

链接: https://arxiv.org/abs/2411.07845
作者: Antonia Karamolegkou,Sandrine Schiller Hansen,Ariadni Christopoulou,Filippos Stamatiou,Anne Lauscher,Anders Søgaard
关键词-EN: LLM researchers, ACL Anthology, ethical concerns, ethical, ethical concern statements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:What ethical concerns, if any, do LLM researchers have? We introduce EthiCon, a corpus of 1,580 ethical concern statements extracted from scientific papers published in the ACL Anthology. We extract ethical concern keywords from the statements and show promising results in automating the concern identification process. Through a survey, we compare the ethical concerns of the corpus to the concerns listed by the general public and professionals in the field. Finally, we compare our retrieved ethical concerns with existing taxonomies pointing to gaps and future research directions.
摘要：大语言模型 (LLM) 研究人员有哪些伦理关切？我们引入了 EthiCon，这是一个从 ACL 文集中发表的科学论文中提取的包含 1,580 条伦理关切陈述的语料库。我们从这些陈述中提取了伦理关切关键词，并展示了在自动化关切识别过程中取得的有前景的结果。通过一项调查，我们将语料库中的伦理关切与公众和领域专业人士列出的关切进行了比较。最后，我们将检索到的伦理关切与现有分类法进行比较，指出了其中的差距和未来的研究方向。

[NLP-13] Chain Association-based Attacking and Shielding Natural Language Processing Systems

【速读】：该论文试图解决自然语言处理系统（NLP）在面对基于链式关联的对抗攻击时的脆弱性问题。解决方案的关键在于利用人类与机器在理解上的差异，通过生成基于关联范式的链式关联图（chain association graph）来构建潜在对抗样本的搜索空间，并采用离散粒子群优化算法（discrete particle swarm optimization algorithm）搜索最优的对抗样本。实验结果表明，即使是先进的大语言模型也容易受到这种攻击，而人类则能较好地理解被扰动的文本。此外，论文还探讨了对抗训练和基于关联图的恢复方法来防御这种攻击。

链接: https://arxiv.org/abs/2411.07843
作者: Jiacheng Huang,Long Chen
关键词-EN: completely straightforward words, gift enables people, gift enables, completely straightforward, straightforward words
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Association as a gift enables people do not have to mention something in completely straightforward words and allows others to understand what they intend to refer to. In this paper, we propose a chain association-based adversarial attack against natural language processing systems, utilizing the comprehension gap between humans and machines. We first generate a chain association graph for Chinese characters based on the association paradigm for building search space of potential adversarial examples. Then, we introduce an discrete particle swarm optimization algorithm to search for the optimal adversarial examples. We conduct comprehensive experiments and show that advanced natural language processing models and applications, including large language models, are vulnerable to our attack, while humans appear good at understanding the perturbed text. We also explore two methods, including adversarial training and associative graph-based recovery, to shield systems from chain association-based attack. Since a few examples that use some derogatory terms, this paper contains materials that may be offensive or upsetting to some people.
摘要：作为礼物，关联性使人们无需完全直白地提及某事，而能让他人理解其意图所指。本文提出了一种基于链式关联的对抗攻击方法，针对自然语言处理系统，利用了人类与机器之间的理解差距。首先，我们基于关联范式为汉字生成链式关联图，以构建潜在对抗样本的搜索空间。接着，我们引入了一种离散粒子群优化算法，用于搜索最优的对抗样本。通过全面的实验，我们展示了包括大语言模型在内的先进自然语言处理模型和应用，对我们的攻击方法表现出脆弱性，而人类则擅长理解被扰动的文本。此外，我们还探讨了两种防御方法，包括对抗训练和基于关联图的恢复，以保护系统免受链式关联攻击。由于本文中包含使用某些贬义词的少数例子，可能会对部分读者造成冒犯或不适。

[NLP-14] Query Optimization for Parametric Knowledge Refinement in Retrieval-Augmented Large Language Models

【速读】：该论文试图解决在检索增强生成 (Retrieval-Augmented Generation, RAG) 系统中，预检索阶段信息不匹配的问题。解决方案的关键在于提出了 Extract-Refine-Retrieve-Read (ERRR) 框架，该框架通过针对大型语言模型 (Large Language Models, LLMs) 特定知识需求的查询优化来弥合这一信息差距。ERRR 框架首先从 LLMs 中提取参数化知识，然后使用专门的查询优化器对这些查询进行精炼，确保仅检索生成准确响应所需的最相关信息。此外，为了提高灵活性和降低计算成本，论文提出了一种可训练的方案，使用较小的可调模型作为查询优化器，并通过知识蒸馏从更大的教师模型中进行优化。实验结果表明，ERRR 在多种问答 (Question-Answering, QA) 数据集和不同检索系统中均优于现有基线，证明了其作为提高 RAG 系统效用和准确性的多功能且成本有效的模块。

链接: https://arxiv.org/abs/2411.07820
作者: Youan Cong,Cheng Wang,Pritom Saha Akash,Kevin Chen-Chuan Chang
关键词-EN: Large Language Models, Large Language, specific knowledge requirements, Retrieval-Augmented Generation, pre-retrieval information gap
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We introduce the Extract-Refine-Retrieve-Read (ERRR) framework, a novel approach designed to bridge the pre-retrieval information gap in Retrieval-Augmented Generation (RAG) systems through query optimization tailored to meet the specific knowledge requirements of Large Language Models (LLMs). Unlike conventional query optimization techniques used in RAG, the ERRR framework begins by extracting parametric knowledge from LLMs, followed by using a specialized query optimizer for refining these queries. This process ensures the retrieval of only the most pertinent information essential for generating accurate responses. Moreover, to enhance flexibility and reduce computational costs, we propose a trainable scheme for our pipeline that utilizes a smaller, tunable model as the query optimizer, which is refined through knowledge distillation from a larger teacher model. Our evaluations on various question-answering (QA) datasets and with different retrieval systems show that ERRR consistently outperforms existing baselines, proving to be a versatile and cost-effective module for improving the utility and accuracy of RAG systems.
摘要：我们提出了提取-精炼-检索-阅读 (Extract-Refine-Retrieve-Read, ERRR) 框架，这是一种新颖的方法，旨在通过针对大语言模型 (Large Language Models, LLMs) 特定知识需求的查询优化，填补检索增强生成 (Retrieval-Augmented Generation, RAG) 系统中的预检索信息空白。与传统的 RAG 查询优化技术不同，ERRR 框架首先从 LLMs 中提取参数化知识，然后使用专门的查询优化器对这些查询进行精炼。这一过程确保了仅检索生成准确响应所需的最相关信息。此外，为了增强灵活性和降低计算成本，我们提出了一种可训练的方案，该方案利用一个较小且可调的模型作为查询优化器，并通过从更大的教师模型中进行知识蒸馏来精炼。我们在各种问答 (Question-Answering, QA) 数据集和不同的检索系统上进行的评估表明，ERRR 始终优于现有的基线，证明了它是一个多功能且成本效益高的模块，能够提升 RAG 系统的实用性和准确性。

[NLP-15] Likelihood as a Performance Gauge for Retrieval-Augmented Generation NAACL2025

【速读】：该论文试图解决生成式 AI (Generative AI) 在检索增强生成过程中受检索文档顺序影响的问题，并探讨如何利用这一现象进行有效的提示工程 (prompt engineering)。解决方案的关键在于提出了一种基于问题似然度 (question likelihood) 的方法，通过分析问题在语料库级别和实例级别的似然度与答案准确性之间的关联，来指导提示的选择和构建。具体方法包括利用问题似然度来识别上下文中任务相关信息的位置，并据此优化提示，从而提高模型性能。该方法的效率较高，因为它仅需计算输入的似然度，而不需要像传统启发式提示工程方法那样生成大量响应。

链接: https://arxiv.org/abs/2411.07773
作者: Tianyu Liu,Jirui Qi,Paul He,Arianna Bisazza,Mrinmaya Sachan,Ryan Cotterell
关键词-EN: Recent work finds, Recent work, large language models, retrieval-augmented generation, generation with large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review at NAACL 2025. Code is available at this https URL

点击查看摘要

Abstract:Recent work finds that retrieval-augmented generation with large language models is prone to be influenced by the order of retrieved documents in the context. However, the lack of in-depth analysis limits the use of this phenomenon for prompt engineering in practice. In this study, we posit that likelihoods serve as an effective gauge for language model performance. Through experiments on two question-answering datasets with a variety of state-of-the-art language models, we reveal correlations between answer accuracy and the likelihood of the question at both the corpus level and the instance level. In addition, we find that question likelihood can also indicate the position of the task-relevant information in the context. Based on these findings, we propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance. We demonstrate their effectiveness with experiments. In addition, our likelihood-based methods are efficient, as they only need to compute the likelihood of the input, requiring much fewer language model passes than heuristic prompt engineering methods that require generating responses. Our analysis deepens our understanding of how input prompts affect model performance and provides a promising direction for efficient prompt optimization.
摘要：最近的研究发现，使用大语言模型进行检索增强生成时，检索文档在上下文中的顺序容易影响模型的表现。然而，由于缺乏深入的分析，这一现象在实际的提示工程中并未得到充分利用。在本研究中，我们提出似然值可以作为评估语言模型性能的有效指标。通过在两个问答数据集上对多种最先进的语言模型进行实验，我们揭示了答案准确率与问题似然值在语料库级别和实例级别上的相关性。此外，我们发现问题似然值还能指示任务相关信息在上下文中的位置。基于这些发现，我们提出了两种利用问题似然值作为指标来选择和构建提示的方法，以提升模型性能。我们通过实验证明了这些方法的有效性。此外，基于似然值的方法具有高效性，因为它们仅需计算输入的似然值，相比需要生成响应的启发式提示工程方法，所需的语言模型推理次数大大减少。我们的分析深化了对输入提示如何影响模型性能的理解，并为高效的提示优化提供了有前景的方向。

[NLP-16] Automatic Album Sequencing

【速读】：该论文试图解决专辑排序过程中技术门槛高的问题，特别是对于非技术用户难以应用现有的基于数据驱动的叙事本质提取方法。解决方案的关键在于开发了一个用户友好的基于网络的工具，该工具允许用户一键上传音乐轨道并执行排序算法，结果以清晰的视觉化形式呈现。此外，论文还引入了一种新的直接基于Transformer的专辑排序方法，以增加可用模板数量并弥补先前工作的不足。尽管这种方法在性能上未能超越叙事本质提取方法，但它优于随机基线，并且与叙事本质方法一起集成在用户界面中，使得该工具更加全面和实用。

链接: https://arxiv.org/abs/2411.07772
作者: Vincent Herrmann,Dylan R. Ashley,Jürgen Schmidhuber
关键词-EN: album production process, production process, critical part, Album sequencing, album production
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
备注: presented as a late breaking demo in the 25th International Society for Music Information Retrieval Conference; 3 pages in main text, 3 figures in main text; source code available at this https URL

点击查看摘要

Abstract:Album sequencing is a critical part of the album production process. Recently, a data-driven approach was proposed that sequences general collections of independent media by extracting the narrative essence of the items in the collections. While this approach implies an album sequencing technique, it is not widely accessible to a less technical audience, requiring advanced knowledge of machine learning techniques to use. To address this, we introduce a new user-friendly web-based tool that allows a less technical audience to upload music tracks, execute this technique in one click, and subsequently presents the result in a clean visualization to the user. To both increase the number of templates available to the user and address shortcomings of previous work, we also introduce a new direct transformer-based album sequencing method. We find that our more direct method outperforms a random baseline but does not reach the same performance as the narrative essence approach. Both methods are included in our web-based user interface, and this – alongside a full copy of our implementation – is publicly available at this https URL
摘要：专辑排序是专辑制作过程中的关键环节。近期，一种数据驱动的方法被提出，通过提取集合中各个项目的叙事精髓来对一般性的独立媒体集合进行排序。尽管这种方法暗示了一种专辑排序技术，但它对于技术水平较低的受众来说并不易用，需要具备机器学习技术的深入知识才能操作。为此，我们推出了一款用户友好的基于网页的工具，允许技术水平较低的受众上传音乐曲目，一键执行该技术，并以清晰的视觉化方式向用户展示结果。为了增加用户可用的模板数量并弥补先前工作的不足，我们还引入了一种新的基于直接 Transformer 的专辑排序方法。我们发现，这种更直接的方法在性能上优于随机基准，但未能达到叙事精髓方法的性能水平。这两种方法均包含在我们的网页用户界面中，并且我们的完整实现代码也公开发布，访问地址为：[https URL]。

[NLP-17] Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

【速读】：该论文试图解决在实际企业环境中，从文本生成SQL查询的复杂性问题。解决方案的关键在于引入Spider 2.0评估框架，该框架包含632个源自企业级数据库使用案例的真实世界文本到SQL工作流问题。Spider 2.0的数据库来自实际数据应用，通常包含超过1000列，并存储在如BigQuery和Snowflake等本地或云数据库系统中。解决Spider 2.0中的问题通常需要理解和搜索数据库元数据、方言文档，甚至项目级代码库，这要求模型能够处理极长的上下文、进行复杂的推理，并生成包含多种操作的多个SQL查询，通常超过100行，远超传统的文本到SQL挑战。论文通过评估表明，当前的语言模型在处理这类复杂任务时表现不佳，需要显著改进以适应实际企业需求。

链接: https://arxiv.org/abs/2411.07763
作者: Fangyu Lei,Jixuan Chen,Yuxiao Ye,Ruisheng Cao,Dongchan Shin,Hongjin Su,Zhaoqing Suo,Hongcheng Gao,Wenjing Hu,Pengcheng Yin,Victor Zhong,Caiming Xiong,Ruoxi Sun,Qian Liu,Sida Wang,Tao Yu
关键词-EN: Spider, multiple SQL queries, transformation to analytics, multiple SQL, SQL queries
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Real-world enterprise text-to-SQL workflows often involve complex cloud or local data across various database systems, multiple SQL queries in various dialects, and diverse operations from data transformation to analytics. We introduce Spider 2.0, an evaluation framework comprising 632 real-world text-to-SQL workflow problems derived from enterprise-level database use cases. The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake. We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-level codebases. This challenge calls for models to interact with complex SQL workflow environments, process extremely long contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines, which goes far beyond traditional text-to-SQL challenges. Our evaluations indicate that based on o1-preview, our code agent framework successfully solves only 17.0% of the tasks, compared with 91.2% on Spider 1.0 and 73.0% on BIRD. Our results on Spider 2.0 show that while language models have demonstrated remarkable performance in code generation – especially in prior text-to-SQL benchmarks – they require significant improvement in order to achieve adequate performance for real-world enterprise usage. Progress on Spider 2.0 represents crucial steps towards developing intelligent, autonomous, code agents for real-world enterprise settings. Our code, baseline models, and data are available at this https URL.
摘要：现实世界中的企业级文本到SQL工作流程通常涉及跨多种数据库系统的复杂云端或本地数据、多种方言的SQL查询，以及从数据转换到分析的多样化操作。我们引入了Spider 2.0，这是一个评估框架，包含632个从企业级数据库使用案例中提取的真实世界文本到SQL工作流程问题。Spider 2.0中的数据库来源于实际数据应用，通常包含超过1,000列，并存储在BigQuery和Snowflake等本地或云数据库系统中。我们发现，解决Spider 2.0中的问题经常需要理解和搜索数据库元数据、方言文档，甚至项目级别的代码库。这一挑战要求模型能够与复杂的SQL工作流程环境交互，处理极长的上下文，执行复杂的推理，并生成包含多样化操作的多个SQL查询，通常超过100行，这远远超出了传统的文本到SQL挑战。我们的评估显示，基于o1-preview，我们的代码智能体框架仅成功解决了17.0%的任务，相比之下，在Spider 1.0上为91.2%，在BIRD上为73.0%。我们在Spider 2.0上的结果表明，尽管大语言模型在代码生成方面表现出色——尤其是在之前的文本到SQL基准测试中——但它们在实现企业级实际应用的充分性能方面仍需显著改进。Spider 2.0上的进展代表了开发适用于现实世界企业环境的智能、自主代码智能体的关键步骤。我们的代码、基线模型和数据可通过此https URL获取。

[NLP-18] Mitigating Bias in Queer Representation within Large Language Models : A Collaborative Agent Approach NEURIPS2024

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 在代词使用中存在的偏见问题，特别是当需要使用包容性语言来准确代表所有身份时，模型不恰当地使用传统性别代词（如“he”和“she”）的问题。解决方案的关键在于引入一个协作代理流程，通过分析和优化代词使用以增强包容性。该多代理框架包括专门用于偏见检测和纠正的代理，实验评估结果显示，与GPT-4相比，该方法在正确识别和反对不恰当的传统性别代词方面显著提升，正确率提高了32.6个百分点，突显了代理驱动框架在提升AI生成内容公平性和包容性方面的潜力。

链接: https://arxiv.org/abs/2411.07656
作者: Tianyi Huang(1),Arya Somasundaram(1) ((1) App-In Club)
关键词-EN: Large Language Models, Large Language, Language Models, leading to misrepresentation, queer individuals
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: NeurIPS 2024 Queer in AI Workshop

点击查看摘要

Abstract:Large Language Models (LLMs) often perpetuate biases in pronoun usage, leading to misrepresentation or exclusion of queer individuals. This paper addresses the specific problem of biased pronoun usage in LLM outputs, particularly the inappropriate use of traditionally gendered pronouns (“he,” “she”) when inclusive language is needed to accurately represent all identities. We introduce a collaborative agent pipeline designed to mitigate these biases by analyzing and optimizing pronoun usage for inclusivity. Our multi-agent framework includes specialized agents for both bias detection and correction. Experimental evaluations using the Tango dataset-a benchmark focused on gender pronoun usage-demonstrate that our approach significantly improves inclusive pronoun classification, achieving a 32.6 percentage point increase over GPT-4o in correctly disagreeing with inappropriate traditionally gendered pronouns (\chi^2 = 38.57, p 0.0001) . These results accentuate the potential of agent-driven frameworks in enhancing fairness and inclusivity in AI-generated content, demonstrating their efficacy in reducing biases and promoting socially responsible AI.
摘要：大语言模型（LLMs）在代词使用上常常延续偏见，导致对性少数群体的错误表述或排除。本文针对LLM输出中代词使用偏见的问题，特别是当需要使用包容性语言以准确代表所有身份时，不恰当地使用传统性别代词（“他”，“她”）的现象。我们引入了一个协作智能体管道，通过分析和优化代词使用以增强包容性来缓解这些偏见。我们的多智能体框架包括专门用于偏见检测和纠正的智能体。使用Tango数据集——一个专注于性别代词使用的基准——进行的实验评估表明，我们的方法显著提高了包容性代词分类的准确性，与GPT-4o相比，在正确反对不恰当的传统性别代词方面提高了32.6个百分点（χ² = 38.57，p < 0.0001）。这些结果突显了智能体驱动框架在提升AI生成内容中的公平性和包容性方面的潜力，展示了其在减少偏见和促进社会责任AI方面的有效性。

[NLP-19] Annotating Constructions with UD: the experience of the Italian Constructicon

【速读】：该论文首次尝试将意大利语构词法（constructicon）与通用依存资源（Universal Dependencies, UD）进行链接。其关键在于建立一个系统化的方法，将意大利语的构词结构与UD框架中的依存关系模型相结合，从而实现对意大利语句法和语义的更精确描述和分析。

链接: https://arxiv.org/abs/2411.07623
作者: Ludovica Pannitto,Beatrice Bernasconi,Lucia Busso,Flavio Pisciotta,Giulia Rambelli,Francesca Masini
关键词-EN: linking the Italian, Italian constructicon, paper descirbes, attempt of linking, Abstract
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The paper descirbes a first attempt of linking the Italian constructicon to UD resources
摘要：本文首次尝试将意大利语构词系统与通用依存资源（UD resources）进行关联。

[NLP-20] Direct Preference Optimization Using Sparse Feature-Level Constraints

【速读】：该论文试图解决大型语言模型（LLMs）与人类偏好对齐过程中的计算效率和训练稳定性问题。解决方案的关键是提出了一种名为特征级约束偏好优化（Feature-level constrained Preference Optimization, FPO）的新方法。FPO通过利用预训练的稀疏自编码器（Sparse Autoencoders, SAEs）并引入特征级约束，实现了高效且稳定的对齐过程。具体来说，FPO利用稀疏特征在训练良好的稀疏自编码器中的激活，并通过特征级的离线参考来优化序列KL散度，从而在保持计算效率的同时显著提升了对齐效果。实验结果表明，FPO在基准数据集上相较于最先进的基线方法，在胜率上实现了5.08%的绝对提升，同时大幅降低了计算成本。

链接: https://arxiv.org/abs/2411.07618
作者: Qingyu Yin,Chak Tou Leong,Hongbo Zhang,Minjun Zhu,Hanqi Yan,Qiang Zhang,Yulan He,Wenjie Li,Jun Wang,Yue Zhang,Linyi Yang
关键词-EN: large language models, Direct Preference Optimization, human preferences remains, Preference Optimization, constrained Preference Optimization
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.
摘要：大语言模型（LLM）与人类偏好的对齐仍然是一个关键挑战。尽管基于人类反馈的强化学习（RLHF）和直接偏好优化（DPO）等训练后技术取得了显著成功，但它们往往引入计算效率低下和训练不稳定性。本文提出了一种新颖的方法——特征级约束偏好优化（FPO），旨在简化对齐过程的同时确保稳定性。FPO利用预训练的稀疏自编码器（SAE）并引入特征级约束，实现了高效且稀疏强制的对齐。我们的方法通过使用在训练良好的稀疏自编码器中激活的稀疏特征，以及通过使用特征级的离线参考来优化序列KL散度，从而提高了效率。在基准数据集上的实验结果表明，与最先进的基线相比，FPO在胜率上实现了5.08%的绝对提升，同时计算成本显著降低，使其成为高效且可控的LLM对齐的潜在解决方案。

[NLP-21] Multimodal Clinical Reasoning through Knowledge-augmented Rationale Generation

【速读】：该论文试图解决在疾病诊断任务中，现有模型主要依赖判别方法而忽视生成支持性解释（rationales）的问题。解决方案的关键在于引入ClinRaGen，这是一种针对多模态解释生成进行优化的较小语言模型（SLM）。ClinRaGen通过独特的知识增强注意力机制，将领域知识与时间序列电子健康记录（EHR）数据融合，并采用逐步解释蒸馏策略，生成基于文本和时间序列的临床解释。这种方法显著提升了SLM解释多模态EHR数据和生成准确临床解释的能力，从而支持更可靠的疾病诊断，推动了大型语言模型（LLM）在医疗领域的应用，并缩小了LLM与SLM之间的性能差距。

链接: https://arxiv.org/abs/2411.07611
作者: Shuai Niu,Jing Ma,Liang Bai,Zhihua Wang,Yida Xu,Yunya Song,Xian Yang
关键词-EN: generating supportive rationales, disease diagnosis, play a pivotal, pivotal role, predominantly use discriminative
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages. 4 figures

点击查看摘要

Abstract:Clinical rationales play a pivotal role in accurate disease diagnosis; however, many models predominantly use discriminative methods and overlook the importance of generating supportive rationales. Rationale distillation is a process that transfers knowledge from large language models (LLMs) to smaller language models (SLMs), thereby enhancing the latter’s ability to break down complex tasks. Despite its benefits, rationale distillation alone is inadequate for addressing domain knowledge limitations in tasks requiring specialized expertise, such as disease diagnosis. Effectively embedding domain knowledge in SLMs poses a significant challenge. While current LLMs are primarily geared toward processing textual data, multimodal LLMs that incorporate time series data, especially electronic health records (EHRs), are still evolving. To tackle these limitations, we introduce ClinRaGen, an SLM optimized for multimodal rationale generation in disease diagnosis. ClinRaGen incorporates a unique knowledge-augmented attention mechanism to merge domain knowledge with time series EHR data, utilizing a stepwise rationale distillation strategy to produce both textual and time series-based clinical rationales. Our evaluations show that ClinRaGen markedly improves the SLM’s capability to interpret multimodal EHR data and generate accurate clinical rationales, supporting more reliable disease diagnosis, advancing LLM applications in healthcare, and narrowing the performance divide between LLMs and SLMs.
摘要：临床推理在疾病准确诊断中起着关键作用；然而，许多模型主要采用判别方法，忽视了生成支持性推理的重要性。推理蒸馏是将知识从大语言模型 (LLM) 转移到小语言模型 (SLM) 的过程，从而增强后者分解复杂任务的能力。尽管推理蒸馏有益，但它单独不足以解决在需要专业知识的任务（如疾病诊断）中领域知识的局限性。在 SLM 中有效嵌入领域知识是一个重大挑战。当前的 LLM 主要面向处理文本数据，而结合时间序列数据（特别是电子健康记录 (EHR)）的多模态 LLM 仍在发展中。为应对这些局限，我们引入了 ClinRaGen，一种针对疾病诊断中多模态推理生成优化的 SLM。ClinRaGen 采用独特的知识增强注意力机制，将领域知识与时间序列 EHR 数据融合，利用逐步推理蒸馏策略生成基于文本和时间序列的临床推理。我们的评估显示，ClinRaGen 显著提升了 SLM 解释多模态 EHR 数据和生成准确临床推理的能力，支持更可靠的疾病诊断，推进了 LLM 在医疗领域的应用，并缩小了 LLM 与 SLM 之间的性能差距。

[NLP-22] Circuit Complexity Bounds for RoPE-based Transformer Architecture

【速读】：该论文试图解决Transformer架构中旋转位置嵌入（Rotary Position Embedding, RoPE）的表达能力问题，特别是其在复杂度界限方面的限制。解决方案的关键在于建立了一个更严格的电路复杂度界限，表明除非 $\mathsf{TC}^0 = \mathsf{NC}^1$ ，否则基于RoPE的Transformer架构在多项式精度、常数层数和隐藏维度 $d \leq O(n)$ 的情况下，无法解决算术问题或布尔公式值问题。这一结果揭示了RoPE-based Transformer架构在表达能力上的根本局限性，尽管其在实际应用中取得了巨大成功。该理论框架不仅提供了更严格的复杂度界限，还可能指导未来对RoPE-based Transformer的研究。

链接: https://arxiv.org/abs/2411.07602
作者: Bo Chen,Xiaoyu Li,Yingyu Liang,Jiangxuan Long,Zhenmei Shi,Zhao Song
关键词-EN: Characterizing the express, based Transformer, mathsf, Rotary Position Embedding, based Transformer architectures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Characterizing the express power of the Transformer architecture is critical to understanding its capacity limits and scaling law. Recent works provide the circuit complexity bounds to Transformer-like architecture. On the other hand, Rotary Position Embedding ( \mathsfRoPE ) has emerged as a crucial technique in modern large language models, offering superior performance in capturing positional information compared to traditional position embeddings, which shows great potential in application prospects, particularly for the long context scenario. Empirical evidence also suggests that \mathsfRoPE -based Transformer architectures demonstrate greater generalization capabilities compared to conventional Transformer models. In this work, we establish a tighter circuit complexity bound for Transformers with \mathsfRoPE attention. Our key contribution is that we show that unless \mathsfTC^0 = \mathsfNC^1 , a \mathsfRoPE -based Transformer with \mathrmpoly(n) -precision, O(1) layers, hidden dimension d \leq O(n) cannot solve the arithmetic problem or the Boolean formula value problem. This result significantly demonstrates the fundamental limitation of the expressivity of the \mathsfRoPE -based Transformer architecture, although it achieves giant empirical success. Our theoretical framework not only establishes tighter complexity bounds but also may instruct further work on the \mathsfRoPE -based Transformer.
摘要：Transformer架构的表达能力表征对于理解其能力极限和扩展规律至关重要。近期研究为类似Transformer的架构提供了电路复杂度界限。另一方面，旋转位置嵌入（Rotary Position Embedding, \mathsfRoPE）已成为现代大语言模型中的关键技术，相较于传统位置嵌入，其在捕捉位置信息方面表现出更优越的性能，显示出在长上下文场景中的广阔应用前景。实证证据还表明，基于\mathsfRoPE的Transformer架构相比传统Transformer模型展现出更强的泛化能力。在本研究中，我们为采用\mathsfRoPE注意力的Transformer建立了更严格的电路复杂度界限。我们的主要贡献在于证明，除非\mathsfTC^0 = \mathsfNC^1，否则基于\mathsfRoPE的Transformer，在\mathrmpoly(n)精度、O(1)层数、隐藏维度d \leq O(n)的条件下，无法解决算术问题或布尔公式值问题。这一结果显著揭示了基于\mathsfRoPE的Transformer架构在表达能力上的根本局限性，尽管其在实际应用中取得了巨大成功。我们的理论框架不仅建立了更严格的复杂度界限，还可能指导未来基于\mathsfRoPE的Transformer研究工作。

[NLP-23] Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations EMNLP2024

【速读】：该论文试图解决在开放式对话（如辅导课程或商务会议）中，如何有效地将对话内容与预定义的参考材料（如工作表或会议要点）进行关联和分割的问题。解决方案的关键在于引入了一种名为“面向问题的分割检索 (Problem-Oriented Segmentation Retrieval, POSR)”的新框架，该框架通过联合任务的方式，将对话分解为多个片段，并将每个片段与相关的参考项进行链接。论文通过在教育领域的实际应用，展示了POSR的有效性，并提出了多种联合和独立的方法（包括分段技术如TextTiling、检索技术如ColBERT以及大型语言模型方法）来实现这一目标。研究结果表明，将POSR建模为一个联合任务是至关重要的，POSR方法在联合指标上比独立的分段和检索管道高出76%，在分段指标上比传统的分段方法高出78%。

链接: https://arxiv.org/abs/2411.07598
作者: Rose E. Wang,Pawan Wirawarn,Kenny Lam,Omar Khattab,Dorottya Demszky
关键词-EN: pre-defined reference materials, meeting bullets, revolve around pre-defined, business meetings, POSR
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024 Findings. Our code and dataset are open-sourced at this https URL

点击查看摘要

Abstract:Many open-ended conversations (e.g., tutoring lessons or business meetings) revolve around pre-defined reference materials, like worksheets or meeting bullets. To provide a framework for studying such conversation structure, we introduce Problem-Oriented Segmentation Retrieval (POSR), the task of jointly breaking down conversations into segments and linking each segment to the relevant reference item. As a case study, we apply POSR to education where effectively structuring lessons around problems is critical yet difficult. We present LessonLink, the first dataset of real-world tutoring lessons, featuring 3,500 segments, spanning 24,300 minutes of instruction and linked to 116 SAT math problems. We define and evaluate several joint and independent approaches for POSR, including segmentation (e.g., TextTiling), retrieval (e.g., ColBERT), and large language models (LLMs) methods. Our results highlight that modeling POSR as one joint task is essential: POSR methods outperform independent segmentation and retrieval pipelines by up to +76% on joint metrics and surpass traditional segmentation methods by up to +78% on segmentation metrics. We demonstrate POSR’s practical impact on downstream education applications, deriving new insights on the language and time use in real-world lesson structures.
摘要：许多开放式对话（例如辅导课程或商务会议）围绕预定义的参考材料展开，如工作表或会议要点。为了提供研究此类对话结构的框架，我们引入了面向问题的分段检索（Problem-Oriented Segmentation Retrieval, POSR），该任务旨在将对话联合分解为多个片段，并将每个片段链接到相关的参考项。作为一个案例研究，我们将POSR应用于教育领域，因为在围绕问题有效组织课程方面，尽管至关重要但难度较大。我们提出了LessonLink，这是首个真实世界辅导课程的数据集，包含3,500个片段，涵盖24,300分钟的教学内容，并与116道SAT数学题相关联。我们定义并评估了几种联合和独立的方法用于POSR，包括分段（如TextTiling）、检索（如ColBERT）以及大语言模型（LLMs）方法。我们的研究结果强调，将POSR建模为一个联合任务至关重要：POSR方法在联合指标上比独立的分段和检索流程高出最多+76%，并且在分段指标上比传统的分段方法高出最多+78%。我们展示了POSR在下游教育应用中的实际影响，从中获得了关于真实世界课程结构中语言和时间使用的新见解。

[NLP-24] Entropy Controllable Direct Preference Optimization

【速读】：该论文试图解决在大型语言模型（LLMs）的后训练阶段，直接偏好优化（DPO）方法在最小化反向KL散度（reverse KL divergence）时可能无法准确捕捉参考分布的模式，从而影响策略性能的问题。解决方案的关键在于提出了一种名为H-DPO的简单改进方法，通过控制生成策略的熵（entropy）来增强分布的锐度（sharpness），从而更有效地实现模式寻求拟合（mode-seeking fitting）。H-DPO在实验中表现出优于DPO的性能，特别是在数学任务的pass@k评估中，且其实现简单，仅需对DPO的损失计算进行微小修改，具有广泛的实际应用潜力。

链接: https://arxiv.org/abs/2411.07595
作者: Motoki Omura,Yasuhiro Fujita,Toshiki Kataoka
关键词-EN: Reinforcement Learning, Human Feedback, achieve generation aligned, large language models, Direct Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy’s performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution’s sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@ k evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.
摘要：在大语言模型 (LLM) 的训练后阶段，基于人类反馈的强化学习 (RLHF) 是一种有效的方法，以实现与人类偏好一致的生成。直接偏好优化 (DPO) 允许使用简单的二元交叉熵损失进行策略训练，而无需奖励模型。DPO 的目标通过反向 KL 散度进行正则化，鼓励对参考策略的模式寻求拟合。然而，我们指出，最小化反向 KL 散度可能无法捕捉到参考分布的一个模式，这可能会损害策略的性能。基于这一观察，我们提出了一种简单的 DPO 修改，称为 H-DPO，它允许对生成的策略的熵进行控制，增强分布的锐度，从而更有效地实现模式寻求拟合。在我们的实验中，我们展示了 H-DPO 在各种任务中优于 DPO，在数学任务的 pass@k 评估中表现出更优异的结果。此外，H-DPO 易于实现，仅需对 DPO 的损失计算进行微小修改，这使其在 LLM 训练中具有广泛的实际应用前景。

[NLP-25] Contrastive Language Prompting to Ease False Positives in Medical Anomaly Detection

【速读】：该论文试图解决在医疗应用中使用视觉-语言模型（如CLIP）进行异常检测时存在的误报问题。解决方案的关键在于引入了一种对比语言提示（Contrastive LAnguage Prompting, CLAP）方法，该方法通过同时利用正向和负向文本提示来减少误报。具体来说，正向提示用于识别潜在的病变区域，而负向提示则用于减弱对正常区域的注意力，从而提高异常检测的准确性。实验结果表明，CLAP方法在BMAD数据集和六个生物医学基准测试中显著提升了异常检测的性能。

链接: https://arxiv.org/abs/2411.07546
作者: YeongHyeon Park,Myung Jin Kim,Hyeong Seok Kim
关键词-EN: pre-trained visual-language model, contrastive language-image pre-training, visual-language model, language-image pre-training, successfully accomplishes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 pages, 3 figures, 2 tables

点击查看摘要

Abstract:A pre-trained visual-language model, contrastive language-image pre-training (CLIP), successfully accomplishes various downstream tasks with text prompts, such as finding images or localizing regions within the image. Despite CLIP’s strong multi-modal data capabilities, it remains limited in specialized environments, such as medical applications. For this purpose, many CLIP variants-i.e., BioMedCLIP, and MedCLIP-SAMv2-have emerged, but false positives related to normal regions persist. Thus, we aim to present a simple yet important goal of reducing false positives in medical anomaly detection. We introduce a Contrastive LAnguage Prompting (CLAP) method that leverages both positive and negative text prompts. This straightforward approach identifies potential lesion regions by visual attention to the positive prompts in the given image. To reduce false positives, we attenuate attention on normal regions using negative prompts. Extensive experiments with the BMAD dataset, including six biomedical benchmarks, demonstrate that CLAP method enhances anomaly detection performance. Our future plans include developing an automated fine prompting method for more practical usage.
摘要：预训练的视觉-语言模型，即对比语言-图像预训练 (CLIP)，通过文本提示成功完成了多种下游任务，如查找图像或定位图像中的区域。尽管 CLIP 在多模态数据处理方面表现出色，但在医疗应用等专业环境中仍存在局限性。为此，出现了许多 CLIP 的变体，如 BioMedCLIP 和 MedCLIP-SAMv2，但这些模型在正常区域仍存在误报问题。因此，我们旨在提出一个简单但重要的目标，即减少医学异常检测中的误报。我们引入了一种对比语言提示 (CLAP) 方法，该方法利用正向和负向文本提示。这种直接的方法通过视觉注意力集中在给定图像中的正向提示来识别潜在的病变区域。为了减少误报，我们使用负向提示来减弱对正常区域的注意力。在 BMAD 数据集上的广泛实验，包括六个生物医学基准测试，表明 CLAP 方法提升了异常检测的性能。我们未来的计划包括开发一种自动化的精细提示方法，以实现更实际的应用。

[NLP-26] Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning

【速读】：该论文试图解决大语言模型（LLMs）在符号（形式）和所指（意义）理解上的评估问题，特别是传统心理语言学评估方法可能存在的统计偏差。解决方案的关键在于引入神经语言学方法，通过结合最小对和诊断探针的新颖方法，分析模型各层的激活模式，从而详细考察LLMs如何表示形式和意义，并验证这些表示是否在不同语言间一致。该方法不仅揭示了神经语言学和心理语言学评估方法在LLM评估中的不同模式，还展示了LLMs在形式理解上的优势及其与意义理解的相关性，并提供了新的中文（COMPS-ZH）和德语（COMPS-DE）概念最小对数据集。

链接: https://arxiv.org/abs/2411.07533
作者: Linyang He,Ercong Nie,Helmut Schmid,Hinrich Schütze,Nima Mesgarani,Jonathan Brennan
关键词-EN: Large Language Models, understanding of Large, LLM evaluation paradigms, Large Language, study investigates
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) by distinguishing two LLM evaluation paradigms: psycholinguistic and neurolinguistic. Traditional psycholinguistic evaluations often reflect statistical biases that may misrepresent LLMs’ true linguistic capabilities. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers. This method allows for a detailed examination of how LLMs represent form and meaning, and whether these representations are consistent across languages. Our contributions are three-fold: (1) We compare neurolinguistic and psycholinguistic methods, revealing distinct patterns in LLM assessment; (2) We demonstrate that LLMs exhibit higher competence in form compared to meaning, with the latter largely correlated to the former; (3) We present new conceptual minimal pair datasets for Chinese (COMPS-ZH) and German (COMPS-DE), complementing existing English datasets.
摘要：本研究通过区分心理语言学和神经语言学两种大语言模型（LLM）评估范式，探讨了LLM对符号（形式）和所指（意义）的语言理解。传统的心理语言学评估往往反映出统计偏差，可能误导LLM真实语言能力的评估。我们引入了一种神经语言学方法，采用了一种结合最小对和诊断探针的新方法，分析模型各层中的激活模式。这种方法能够详细考察LLM如何表示形式和意义，以及这些表示是否在不同语言间保持一致。我们的贡献有三方面：（1）我们比较了神经语言学和心理语言学方法，揭示了LLM评估中的不同模式；（2）我们证明了LLM在形式上的表现优于意义，后者在很大程度上与前者相关联；（3）我们提出了新的中文（COMPS-ZH）和德语（COMPS-DE）概念最小对数据集，补充了现有的英语数据集。

[NLP-27] SecEncoder: Logs are All You Need in Security

【速读】：该论文试图解决通用大型语言模型（Large Language Models, LMs）在处理特定领域任务（如安全领域）时的不足问题。解决方案的关键在于引入了一个专门针对安全日志数据预训练的小型语言模型——SecEncoder。SecEncoder通过专注于安全日志中的独特语言和模式，显著提升了在安全相关任务中的表现，如事件优先级排序和威胁情报文档检索，超越了主要基于自然语言预训练的模型（如BERTlarge、DeBERTa-v3-large和OpenAI的textembedding-ada-002）。这一研究为未来开发针对特定领域的语言模型及其在安全领域的应用提供了新的方向。

链接: https://arxiv.org/abs/2411.07528
作者: Muhammed Fatih Bulut,Yingqi Liu,Naveed Ahmad,Maximilian Turner,Sami Ait Ouahmane,Cameron Andrews,Lloyd Greenwald
关键词-EN: Book Corpus, publicly accessible platforms, Large and Small, volumes of text, web scraping
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large and Small Language Models (LMs) are typically pretrained using extensive volumes of text, which are sourced from publicly accessible platforms such as Wikipedia, Book Corpus, or through web scraping. These models, due to their exposure to a wide range of language data, exhibit impressive generalization capabilities and can perform a multitude of tasks simultaneously. However, they often fall short when it comes to domain-specific tasks due to their broad training data. This paper introduces SecEncoder, a specialized small language model that is pretrained using security logs. SecEncoder is designed to address the domain-specific limitations of general LMs by focusing on the unique language and patterns found in security logs. Experimental results indicate that SecEncoder outperforms other LMs, such as BERTlarge, DeBERTa-v3-large and OpenAI’s Embedding (textembedding-ada-002) models, which are pretrained mainly on natural language, across various tasks. Furthermore, although SecEncoder is primarily pretrained on log data, it outperforms models pretrained on natural language for a range of tasks beyond log analysis, such as incident prioritization and threat intelligence document retrieval. This suggests that domain specific pretraining with logs can significantly enhance the performance of LMs in security. These findings pave the way for future research into security-specific LMs and their potential applications.
摘要：大语言模型和小语言模型 (LMs) 通常使用大量文本进行预训练，这些文本来源于公开可访问的平台，如维基百科、书籍语料库或通过网络爬虫获取。由于接触到广泛的语言数据，这些模型展现出令人印象深刻的泛化能力，并能同时执行多种任务。然而，在面对特定领域的任务时，由于其广泛的训练数据，它们往往表现不佳。本文介绍了 SecEncoder，一种专门针对安全日志进行预训练的小语言模型。SecEncoder 旨在通过专注于安全日志中独特的语言和模式，来解决通用语言模型在特定领域中的局限性。实验结果表明，在各种任务中，SecEncoder 的表现优于其他语言模型，如 BERTlarge、DeBERTa-v3-large 和 OpenAI 的 Embedding (textembedding-ada-002) 模型，这些模型主要基于自然语言进行预训练。此外，尽管 SecEncoder 主要基于日志数据进行预训练，但在日志分析之外的一系列任务中，如事件优先级排序和威胁情报文档检索，其表现也优于基于自然语言预训练的模型。这表明，使用日志进行特定领域的预训练可以显著提升语言模型在安全领域的性能。这些发现为未来研究安全专用语言模型及其潜在应用铺平了道路。

[NLP-28] Prompt-enhanced Network for Hateful Meme Classification

【速读】：该论文试图解决社交媒体平台上仇恨表情包（hateful memes）的高效识别和移除问题。解决方案的关键在于开发了一种基于提示学习（prompt learning）的提示增强网络框架（Pen）。具体来说，Pen通过提示方法构建序列并使用语言模型进行编码，然后进行区域信息的全局提取以实现多视角感知。此外，引入提示感知对比学习（prompt-aware contrastive learning）以增强模型在特征空间中的推理能力，从而提高样本特征分布的质量。通过这些方法，Pen显著提升了模型在仇恨表情包分类任务中的准确性和泛化能力。

链接: https://arxiv.org/abs/2411.07527
作者: Junxi Liu,Yanyan Feng,Jiehai Chen,Yun Xue,Fenghuan Li
关键词-EN: hateful meme classification, media platforms, social media, multimodal hateful meme, hateful meme
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence Main Track. Pages 6397-6405

点击查看摘要

Abstract:The dynamic expansion of social media has led to an inundation of hateful memes on media platforms, accentuating the growing need for efficient identification and removal. Acknowledging the constraints of conventional multimodal hateful meme classification, which heavily depends on external knowledge and poses the risk of including irrelevant or redundant content, we developed Pen – a prompt-enhanced network framework based on the prompt learning approach. Specifically, after constructing the sequence through the prompt method and encoding it with a language model, we performed region information global extraction on the encoded sequence for multi-view perception. By capturing global information about inference instances and demonstrations, Pen facilitates category selection by fully leveraging sequence information. This approach significantly improves model classification accuracy. Additionally, to bolster the model’s reasoning capabilities in the feature space, we introduced prompt-aware contrastive learning into the framework to improve the quality of sample feature distributions. Through extensive ablation experiments on two public datasets, we evaluate the effectiveness of the Pen framework, concurrently comparing it with state-of-the-art model baselines. Our research findings highlight that Pen surpasses manual prompt methods, showcasing superior generalization and classification accuracy in hateful meme classification tasks. Our code is available at this https URL.
摘要：社交媒体的动态扩展导致了媒体平台上仇恨表情包的泛滥，加剧了对高效识别和移除的迫切需求。鉴于传统多模态仇恨表情包分类方法对外部知识的严重依赖，并存在包含无关或冗余内容的潜在风险，我们开发了 Pen——一种基于提示学习方法的提示增强网络框架。具体而言，通过提示方法构建序列并使用语言模型进行编码后，我们对编码序列进行区域信息全局提取，以实现多视角感知。通过捕捉推理实例和示范的全局信息，Pen 充分利用序列信息促进类别选择，显著提高了模型分类准确性。此外，为了增强模型在特征空间中的推理能力，我们将提示感知对比学习引入框架，以提升样本特征分布的质量。通过在两个公开数据集上的广泛消融实验，我们评估了 Pen 框架的有效性，并同时将其与最先进的模型基线进行比较。研究结果表明，Pen 在仇恨表情包分类任务中超越了手动提示方法，展示了卓越的泛化能力和分类准确性。我们的代码可在以下链接获取：https URL。

[NLP-29] Fair Summarization: Bridging Quality and Diversity in Extractive Summaries NEURIPS2024

【速读】：该论文试图解决多文档摘要中用户生成内容的公平性问题，即现有摘要方法在不同社会群体间未能确保公平代表性，导致输出偏见。解决方案的关键在于提出了两种新颖的公平抽取式摘要方法：基于聚类的FairExtract和利用GPT-3.5-turbo并加入公平性约束的FairGPT。这两种方法通过在Divsumm数据集上进行评估，并与相关基线方法对比，展示了在保持摘要质量的同时显著提升公平性的能力。论文还引入了综合质量与公平性的复合评价指标（如SUPERT+F, BLANC+F），为理解和平衡这两个目标提供了更细致的框架。

链接: https://arxiv.org/abs/2411.07521
作者: Sina Bagheri Nezhad,Sayan Bandyapadhyay,Ameeta Agrawal
关键词-EN: natural language processing, user-generated content remains, language processing, user-generated content, content remains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Algorithmic Fairness through the Lens of Metrics and Evaluation Workshop @ NeurIPS 2024

点击查看摘要

Abstract:Fairness in multi-document summarization of user-generated content remains a critical challenge in natural language processing (NLP). Existing summarization methods often fail to ensure equitable representation across different social groups, leading to biased outputs. In this paper, we introduce two novel methods for fair extractive summarization: FairExtract, a clustering-based approach, and FairGPT, which leverages GPT-3.5-turbo with fairness constraints. We evaluate these methods using Divsumm summarization dataset of White-aligned, Hispanic, and African-American dialect tweets and compare them against relevant baselines. The results obtained using a comprehensive set of summarization quality metrics such as SUPERT, BLANC, SummaQA, BARTScore, and UniEval, as well as a fairness metric F, demonstrate that FairExtract and FairGPT achieve superior fairness while maintaining competitive summarization quality. Additionally, we introduce composite metrics (e.g., SUPERT+F, BLANC+F) that integrate quality and fairness into a single evaluation framework, offering a more nuanced understanding of the trade-offs between these objectives. This work highlights the importance of fairness in summarization and sets a benchmark for future research in fairness-aware NLP models.
摘要：用户生成内容的多文档摘要中的公平性仍然是自然语言处理 (NLP) 中的一个关键挑战。现有的摘要方法往往无法确保不同社会群体之间的公平代表性，导致输出结果存在偏见。本文中，我们提出了两种新的公平抽取式摘要方法：基于聚类的 FairExtract 和利用 GPT-3.5-turbo 并加入公平性约束的 FairGPT。我们使用包含白人、西班牙裔和非裔美国人方言推文的 Divsumm 摘要数据集对这些方法进行评估，并与相关基线方法进行比较。通过使用一系列全面的摘要质量指标（如 SUPERT、BLANC、SummaQA、BARTScore 和 UniEval）以及公平性指标 F 进行评估，结果表明 FairExtract 和 FairGPT 在保持竞争性摘要质量的同时，实现了更高的公平性。此外，我们引入了综合指标（如 SUPERT+F、BLANC+F），将质量和公平性整合到一个单一的评估框架中，提供了对这些目标之间权衡的更细致理解。这项工作强调了摘要中公平性的重要性，并为未来公平感知 NLP 模型的研究设定了基准。

[NLP-30] SparrowVQE: Visual Question Explanation for Course Content Understanding

【速读】：该论文试图解决视觉问答（Visual Question Answering, VQA）方法通常只能提供过于简单和简短答案的问题。解决方案的关键在于引入视觉问题解释（Visual Question Explanation, VQE），通过增强VQA系统提供详细解释的能力，以满足与视觉内容更复杂交互的需求。具体实现包括创建MLVQE数据集，提出SparrowVQE模型，并采用三阶段训练机制：多模态预训练（图像和文本特征对齐）、指令调优（使用文本和QA对调优预训练模型）和领域微调（微调图像和QA对）。最终，SparrowVQE结合SigLIP模型和Phi-2语言模型，通过MLP适配器实现对视觉信息和文本的理解与连接，实验结果表明其在MLVQE数据集和其他五个基准VQA数据集上均表现优异。

链接: https://arxiv.org/abs/2411.07516
作者: Jialu Li,Manish Kumar Thota,Ruslan Gokhman,Radek Holik,Youshan Zhang
关键词-EN: Visual Question Answering, yield overly simplistic, Question Answering, Visual Question Explanation, Visual Question
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Visual Question Answering (VQA) research seeks to create AI systems to answer natural language questions in images, yet VQA methods often yield overly simplistic and short answers. This paper aims to advance the field by introducing Visual Question Explanation (VQE), which enhances the ability of VQA to provide detailed explanations rather than brief responses and address the need for more complex interaction with visual content. We first created an MLVQE dataset from a 14-week streamed video machine learning course, including 885 slide images, 110,407 words of transcripts, and 9,416 designed question-answer (QA) pairs. Next, we proposed a novel SparrowVQE, a small 3 billion parameters multimodal model. We trained our model with a three-stage training mechanism consisting of multimodal pre-training (slide images and transcripts feature alignment), instruction tuning (tuning the pre-trained model with transcripts and QA pairs), and domain fine-tuning (fine-tuning slide image and QA pairs). Eventually, our SparrowVQE can understand and connect visual information using the SigLIP model with transcripts using the Phi-2 language model with an MLP adapter. Experimental results demonstrate that our SparrowVQE achieves better performance in our developed MLVQE dataset and outperforms state-of-the-art methods in the other five benchmark VQA datasets. The source code is available at \urlthis https URL.
摘要：视觉问答 (Visual Question Answering, VQA) 研究旨在创建能够回答图像中自然语言问题的 AI 系统，然而现有的 VQA 方法往往只能提供过于简单且简短的答案。本文旨在通过引入视觉问答解释 (Visual Question Explanation, VQE) 来推动该领域的发展，VQE 增强了 VQA 提供详细解释而非简短回答的能力，并满足了与视觉内容进行更复杂交互的需求。我们首先从一门为期 14 周的流媒体视频机器学习课程中创建了 MLVQE 数据集，该数据集包括 885 张幻灯片图像、110,407 字的转录文本以及 9,416 对设计好的问答 (QA) 对。接着，我们提出了一种新颖的 SparrowVQE，这是一个拥有 30 亿参数的多模态模型。我们采用三阶段训练机制对模型进行了训练，包括多模态预训练（幻灯片图像与转录文本特征对齐）、指令微调（使用转录文本和 QA 对微调预训练模型）以及领域微调（微调幻灯片图像和 QA 对）。最终，我们的 SparrowVQE 能够使用 SigLIP 模型理解并连接视觉信息，并通过带有 MLP 适配器的 Phi-2 语言模型处理转录文本。实验结果表明，SparrowVQE 在我们开发的 MLVQE 数据集上表现优异，并在其他五个基准 VQA 数据集上超越了当前最先进的方法。源代码可在 \urlthis https URL 获取。

[NLP-31] Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

【速读】：该论文试图解决大语言模型 (LLMs) 在面对滥用时的安全性问题。解决方案的关键在于开发快速响应技术，通过观察少数攻击实例后迅速阻止整个类别的越狱攻击。论文提出了RapidResponseBench基准，用于评估防御措施在适应少量观察到的攻击实例后对各种越狱策略的鲁棒性。研究的核心方法是通过自动生成与观察到的攻击相似的额外越狱攻击（jailbreak proliferation），并微调输入分类器以阻止这些生成的攻击。实验结果显示，这种方法在观察到每个越狱策略的一个实例后，能够显著降低攻击成功率，特别是在分布内和分布外的越狱攻击上分别降低了240倍和15倍。此外，研究还表明，生成模型的质量和生成的示例数量对防御效果有重要影响。总体而言，论文强调了快速响应新越狱攻击以限制LLM滥用的潜力。

链接: https://arxiv.org/abs/2411.07494
作者: Alwin Peng,Julian Michael,Henry Sleight,Ethan Perez,Mrinank Sharma
关键词-EN: large language models, grow more powerful, ensuring their safety, large language, jailbreaks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. To study this setting, we develop RapidResponseBench, a benchmark that measures a defense’s robustness against various jailbreak strategies after adapting to a few observed examples. We evaluate five rapid response methods, all of which use jailbreak proliferation, where we automatically generate additional jailbreaks similar to the examples observed. Our strongest method, which fine-tunes an input classifier to block proliferated jailbreaks, reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set, having observed just one example of each jailbreaking strategy. Moreover, further studies suggest that the quality of proliferation model and number of proliferated examples play an key role in the effectiveness of this defense. Overall, our results highlight the potential of responding rapidly to novel jailbreaks to limit LLM misuse.
摘要：随着大语言模型 (LLM) 的能力不断增强，确保其免受滥用变得至关重要。尽管研究人员致力于开发强大的防御措施，但目前尚无方法能够完全抵御所有攻击。我们提出了一种替代方法：与其追求完美的对抗性鲁棒性，我们开发了快速响应技术，旨在仅通过观察少数攻击实例后，便能阻止整个类别的越狱攻击。为了研究这一场景，我们开发了 RapidResponseBench，这是一个基准测试，用于衡量防御措施在适应少数观察到的示例后，对各种越狱策略的鲁棒性。我们评估了五种快速响应方法，这些方法均利用了越狱攻击的扩散，即自动生成与观察到的示例类似的额外越狱攻击。我们最强大的方法是通过微调输入分类器来阻止扩散的越狱攻击，在分布内越狱攻击集上将攻击成功率降低了超过 240 倍，在分布外越狱攻击集上降低了超过 15 倍，仅观察到每种越狱策略的一个示例。此外，进一步的研究表明，扩散模型的质量和扩散示例的数量在防御效果中起着关键作用。总体而言，我们的研究结果突显了快速响应新型越狱攻击以限制大语言模型滥用的潜力。

[NLP-32] Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models

【速读】：该论文试图解决的问题是评估语言模型（LMs）在低资源语言中的句法泛化能力，特别是针对巴斯克语、印地语和斯瓦希里语这三种低资源语言。解决方案的关键在于开发针对这些低资源语言的句法评估测试，并使用这些测试来评估五个开放访问的多语言Transformer LMs。通过这些测试，研究揭示了不同句法任务的难度差异，例如在巴斯克语中处理包含间接宾语的句子的一致性问题，以及在斯瓦希里语中处理介词短语跨度的一致性问题。此外，研究还发现了公开可用Transformer模型中的一些问题，如多语言BERT在印地语中对习惯性方面的偏见，以及XGLM-4.5B在性能上不如类似大小的模型。

链接: https://arxiv.org/abs/2411.07474
作者: Daria Kryvosheieva,Roger Levy
关键词-EN: human-like syntactic knowledge, capable of acquiring, acquiring elements, elements of human-like, Targeted syntactic evaluation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models (LMs) are capable of acquiring elements of human-like syntactic knowledge. Targeted syntactic evaluation tests have been employed to measure how well they form generalizations about syntactic phenomena in high-resource languages such as English. However, we still lack a thorough understanding of LMs’ capacity for syntactic generalizations in low-resource languages, which are responsible for much of the diversity of syntactic patterns worldwide. In this study, we develop targeted syntactic evaluation tests for three low-resource languages (Basque, Hindi, and Swahili) and use them to evaluate five families of open-access multilingual Transformer LMs. We find that some syntactic tasks prove relatively easy for LMs while others (agreement in sentences containing indirect objects in Basque, agreement across a prepositional phrase in Swahili) are challenging. We additionally uncover issues with publicly available Transformers, including a bias toward the habitual aspect in Hindi in multilingual BERT and underperformance compared to similar-sized models in XGLM-4.5B.
摘要：语言模型（Language Models, LMs）能够获取类似人类的句法知识元素。针对高资源语言（如英语）的句法现象，已经采用了专门的句法评估测试来衡量这些模型在这些语言中的泛化能力。然而，对于低资源语言中LMs的句法泛化能力，我们仍然缺乏全面的理解，而这些低资源语言正是全球句法模式多样性的主要来源。在本研究中，我们为三种低资源语言（巴斯克语、印地语和斯瓦希里语）开发了专门的句法评估测试，并使用这些测试来评估五个系列的开放访问多语言Transformer LMs。我们发现，某些句法任务对LMs来说相对容易，而其他任务（如巴斯克语中包含间接宾语的句子的一致性，以及斯瓦希里语中介词短语跨度的一致性）则具有挑战性。此外，我们还发现了公开可用Transformer模型的一些问题，包括多语言BERT在印地语中对习惯性方面的偏见，以及XGLM-4.5B在性能上不如类似规模的模型。

[NLP-33] IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

【速读】：该论文试图解决当前大语言模型（LLMs）在指代消解（coreference resolution）评估中，传统输出格式和评估指标未能充分捕捉模型对指代理解的问题。解决方案的关键在于引入了一个新的基准测试——IdentifyMe，该基准采用多项选择题（MCQ）格式，通过长篇叙述和排除易于识别的指代，创建了一个更具挑战性的任务。IdentifyMe还包含了不同类型的指代和对应实体的混合，允许对模型性能进行细粒度分析。通过在IdentifyMe上评估闭源和开源LLMs，研究发现，当前最先进的10亿参数以下的开源模型与闭源模型之间存在显著的性能差距（20-30%），并揭示了模型在处理代词指代和嵌套结构中指代重叠时的困难。

链接: https://arxiv.org/abs/2411.07466
作者: Kawshik Manikantan,Makarand Tapaswi,Vineet Gandhi,Shubham Toshniwal
关键词-EN: Recent evaluations, traditional output formats, models’ referential understanding, evaluation metrics, revealed that traditional
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Recent evaluations of LLMs on coreference resolution have revealed that traditional output formats and evaluation metrics do not fully capture the models’ referential understanding. To address this, we introduce IdentifyMe, a new benchmark for mention resolution presented in a multiple-choice question (MCQ) format, commonly used for evaluating LLMs. IdentifyMe features long narratives and employs heuristics to exclude easily identifiable mentions, creating a more challenging task. The benchmark also consists of a curated mixture of different mention types and corresponding entities, allowing for a fine-grained analysis of model performance. We evaluate both closed- and open source LLMs on IdentifyMe and observe a significant performance gap (20-30%) between the state-of-the-art sub-10B open models vs. closed ones. We observe that pronominal mentions, which have limited surface information, are typically much harder for models to resolve than nominal mentions. Additionally, we find that LLMs often confuse entities when their mentions overlap in nested structures. The highest-scoring model, GPT-4o, achieves 81.9% accuracy, highlighting the strong referential capabilities of state-of-the-art LLMs while also indicating room for further improvement.
摘要：最近对大语言模型 (LLM) 在指代消解 (coreference resolution) 方面的评估显示，传统的输出格式和评估指标未能充分捕捉模型的指代理解能力。为此，我们引入了 IdentifyMe，这是一个新的指代消解基准测试，采用多选题 (MCQ) 格式，这种格式常用于评估大语言模型。IdentifyMe 包含长篇叙述，并运用启发式方法排除易于识别的指代，从而创建更具挑战性的任务。该基准测试还包括精心挑选的不同指代类型及其对应实体的混合体，允许对模型性能进行细致分析。我们在 IdentifyMe 上评估了闭源和开源的大语言模型，并观察到最先进的子 10B 开源模型与闭源模型之间存在显著的性能差距（20-30%）。我们发现，代词指代（具有有限的表面信息）通常比名词指代更难被模型解析。此外，我们还发现，当指代在嵌套结构中重叠时，大语言模型常常混淆实体。得分最高的模型 GPT-4o 达到了 81.9% 的准确率，突显了最先进大语言模型在指代能力方面的强大表现，同时也表明仍有进一步改进的空间。

[NLP-34] BudgetMLAgent : A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks

【速读】：该论文试图解决在复杂机器学习任务中，大型语言模型（LLMs）如GPT-4在生成代码时成本高且效果不佳的问题。解决方案的关键在于提出了一种基于多智能体（Multi-Agent）的系统，该系统通过组合专家模型、利用性能分析、高效检索过往观察、LLM级联和专家咨询调用等策略，显著降低了成本并提高了任务成功率。具体来说，该系统以低成本模型Gemini作为基础LLM，结合GPT-4进行级联和专家咨询，实现了在MLAgentBench基准测试中平均成功率从22.72%提升至32.95%，同时成本降低了94.2%。

链接: https://arxiv.org/abs/2411.07464
作者: Shubham Gandhi,Manasi Patwardhan,Lovekesh Vig,Gautam Shroff
关键词-EN: complex Machine Learning, Large Language Models, Large Language, Machine Learning, diverse applications including
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Presented at AIMLSystems '24

点击查看摘要

Abstract:Large Language Models (LLMs) excel in diverse applications including generation of code snippets, but often struggle with generating code for complex Machine Learning (ML) tasks. Although existing LLM single-agent based systems give varying performance depending on the task complexity, they purely rely on larger and expensive models such as GPT-4. Our investigation reveals that no-cost and low-cost models such as Gemini-Pro, Mixtral and CodeLlama perform far worse than GPT-4 in a single-agent setting. With the motivation of developing a cost-efficient LLM based solution for solving ML tasks, we propose an LLM Multi-Agent based system which leverages combination of experts using profiling, efficient retrieval of past observations, LLM cascades, and ask-the-expert calls. Through empirical analysis on ML engineering tasks in the MLAgentBench benchmark, we demonstrate the effectiveness of our system, using no-cost models, namely Gemini as the base LLM, paired with GPT-4 in cascade and expert to serve occasional ask-the-expert calls for planning. With 94.2% reduction in the cost (from \ 0.931 per run cost averaged over all tasks for GPT-4 single agent system to \ 0.054), our system is able to yield better average success rate of 32.95% as compared to GPT-4 single-agent system yielding 22.72% success rate averaged over all the tasks of MLAgentBench.
摘要：大语言模型（LLMs）在多种应用中表现出色，包括代码片段的生成，但在处理复杂机器学习（ML）任务时往往表现不佳。尽管现有的基于单一智能体的LLM系统在不同任务复杂度下表现各异，但它们完全依赖于GPT-4等更大、更昂贵的模型。我们的研究揭示，在单一智能体环境下，如Gemini-Pro、Mixtral和CodeLlama等无成本或低成本模型，其性能远不及GPT-4。基于开发一种成本效益高的LLM解决方案来解决ML任务的动机，我们提出了一种基于多智能体的LLM系统，该系统通过专家组合、过往观察的高效检索、LLM级联以及专家咨询调用等手段来实现。通过在MLAgentBench基准上的ML工程任务的实证分析，我们展示了该系统的有效性，使用无成本模型Gemini作为基础LLM，并与GPT-4级联和专家配对，以应对偶尔的专家咨询调用进行规划。与GPT-4单一智能体系统相比，我们的系统在成本上减少了94.2%（从每运行成本平均为0.931美元降至0.054美元），并且在MLAgentBench的所有任务中，平均成功率从22.72%提升至32.95%。

[NLP-35] DecoPrompt : Decoding Prompts Reduces Hallucinations when Large Language Models Meet False Premises

【速读】：该论文试图解决大型语言模型（LLMs）在面对错误前提（false premises）时产生的幻觉输出问题。解决方案的关键在于提出了一种名为DecoPrompt的新提示算法，该算法通过利用LLMs“解码”错误前提提示，而不实际引发幻觉输出来缓解幻觉现象。DecoPrompt的核心在于其能够有效减少不同LLMs输出的幻觉，并展示了跨模型的可迁移性，从而适用于大型LLMs或无法获取模型逻辑的场景。

链接: https://arxiv.org/abs/2411.07457
作者: Nan Xu,Xuezhe Ma
关键词-EN: demonstrated increasing power, factually correct statements, increasing power, correct statements, demonstrated increasing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have demonstrated increasing power, they have also called upon studies on their hallucinated outputs that deviate from factually correct statements. In this paper, we focus on one important scenario of false premises, where LLMs are distracted by misaligned claims although the model possesses the required factual knowledge to answer original questions accurately. Inspired by the observation that entropy of the false-premise prompt is closely related to its likelihood to elicit hallucination generation, we propose a new prompting algorithm, named DecoPrompt, to mitigate hallucination. DecoPrompt leverages LLMs to “decode” the false-premise prompts without really eliciting hallucination output from LLMs. We perform experiments on two datasets, demonstrating that DecoPrompt can reduce hallucinations effectively on outputs from different LLMs. Moreover, DecoPrompt exhibits cross-model transferability, which facilitates its applications to scenarios such as LLMs of large sizes or unavailable model logits.
摘要：尽管大语言模型 (Large Language Models, LLMs) 展示了日益增强的能力，但它们也引发了关于其输出偏离事实正确陈述的幻觉现象的研究。本文聚焦于一个重要的错误前提场景，即尽管模型具备准确回答原始问题所需的事实知识，但仍被误导性陈述所干扰。受观察到错误前提提示的熵与其引发幻觉生成的可能性密切相关的启发，我们提出了一种新的提示算法，名为 DecoPrompt，以减轻幻觉现象。DecoPrompt 利用大语言模型在不真正引发幻觉输出的情况下“解码”错误前提提示。我们在两个数据集上进行了实验，结果表明 DecoPrompt 能够有效减少不同大语言模型输出的幻觉现象。此外，DecoPrompt 展示了跨模型的可转移性，这有助于其在大型大语言模型或不可用模型对数等场景中的应用。

[NLP-36] Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection

【速读】：该论文试图解决自动提示工程中，现有方法仅利用当前步骤的反馈信息，而忽视历史和未选择的反馈信息的问题，以及示例选择仅考虑一般语义关系，可能与优化后的提示不匹配的问题。解决方案的关键在于提出了一个带有记忆机制的示例引导反思方法（Exemplar-Guided Reflection with Memory mechanism, ERM）。具体来说，该方法设计了一个示例引导的反思机制，其中反馈生成不仅依赖于当前信息，还受到生成的示例的引导。此外，构建了两种记忆机制，以充分利用历史反馈信息，并支持更有效的示例检索，从而实现更高效和准确的提示优化。

链接: https://arxiv.org/abs/2411.07446
作者: Cilin Yan,Jingyun Wang,Lin Zhang,Ruihui Zhao,Xiaopu Wu,Kai Xiong,Qingsong Liu,Guoliang Kang,Yangyang Kang
关键词-EN: Automatic prompt engineering, large language models, prompt engineering aims, Automatic prompt, language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic prompt engineering aims to enhance the generation quality of large language models (LLMs). Recent works utilize feedbacks generated from erroneous cases to guide the prompt optimization. During inference, they may further retrieve several semantically-related exemplars and concatenate them to the optimized prompts to improve the performance. However, those works only utilize the feedback at the current step, ignoring historical and unseleccted feedbacks which are potentially beneficial. Moreover, the selection of exemplars only considers the general semantic relationship and may not be optimal in terms of task performance and matching with the optimized prompt. In this work, we propose an Exemplar-Guided Reflection with Memory mechanism (ERM) to realize more efficient and accurate prompt optimization. Specifically, we design an exemplar-guided reflection mechanism where the feedback generation is additionally guided by the generated exemplars. We further build two kinds of memory to fully utilize the historical feedback information and support more effective exemplar retrieval. Empirical evaluations show our method surpasses previous state-of-the-arts with less optimization steps, i.e., improving F1 score by 10.1 on LIAR dataset, and reducing half of the optimization steps on ProTeGi.
摘要：自动提示工程旨在提升大语言模型（LLM）的生成质量。近期的工作利用从错误案例中生成的反馈来指导提示优化。在推理过程中，它们可能会进一步检索若干语义相关的示例，并将这些示例与优化后的提示连接起来，以提高性能。然而，这些工作仅利用当前步骤的反馈，忽略了历史和未被选中的反馈，这些反馈可能具有潜在的益处。此外，示例的选择仅考虑了一般的语义关系，可能在任务性能和与优化提示的匹配度方面并非最优。在本研究中，我们提出了一种带有记忆机制的示例引导反思方法（ERM），以实现更高效和准确的提示优化。具体而言，我们设计了一种示例引导的反思机制，其中反馈生成额外受到生成示例的引导。我们进一步构建了两种记忆机制，以充分利用历史反馈信息，并支持更有效的示例检索。实证评估显示，我们的方法在优化步骤较少的情况下超越了以往的最先进技术，例如在LIAR数据集上将F1分数提高了10.1，并在ProTeGi上减少了优化步骤的一半。

[NLP-37] Untangling Hate Speech Definitions: A Semantic Componential Analysis Across Cultures and Domains

【速读】：该论文试图解决跨文化和跨领域中仇恨言论定义的多样性和复杂性问题。解决方案的关键在于提出了一个语义成分分析框架（Semantic Componential Analysis, SCA），通过从五个不同领域（在线词典、研究论文、维基百科文章、立法和在线平台）收集仇恨言论定义，并将其分解为语义成分进行分析。研究发现，不同领域的定义存在差异，但许多领域在定义仇恨言论时并未充分考虑目标文化的影响。论文还通过零样本模型实验，使用三种流行的开源大型语言模型（LLMs）来评估不同定义对仇恨言论检测的影响，结果表明LLMs对定义的复杂性敏感，检测结果会随定义的复杂性而变化。

链接: https://arxiv.org/abs/2411.07417
作者: Katerina Korre,Arianna Muti,Federico Ruggeri,Alberto Barrón-Cedeño
关键词-EN: varying individual interpretations, speech relies heavily, Hate speech relies, Hate speech, Semantic Componential Analysis
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hate speech relies heavily on cultural influences, leading to varying individual interpretations. For that reason, we propose a Semantic Componential Analysis (SCA) framework for a cross-cultural and cross-domain analysis of hate speech definitions. We create the first dataset of definitions derived from five domains: online dictionaries, research papers, Wikipedia articles, legislation, and online platforms, which are later analyzed into semantic components. Our analysis reveals that the components differ from definition to definition, yet many domains borrow definitions from one another without taking into account the target culture. We conduct zero-shot model experiments using our proposed dataset, employing three popular open-sourced LLMs to understand the impact of different definitions on hate speech detection. Our findings indicate that LLMs are sensitive to definitions: responses for hate speech detection change according to the complexity of definitions used in the prompt.
摘要：仇恨言论严重依赖于文化影响，导致个体对其解释存在差异。为此，我们提出了一个语义成分分析 (Semantic Componential Analysis, SCA) 框架，用于跨文化和跨领域的仇恨言论定义分析。我们创建了首个包含五个领域定义的数据集：在线词典、研究论文、维基百科文章、法律法规和在线平台，这些定义随后被分析为语义成分。我们的分析发现，这些成分在不同定义之间存在差异，然而许多领域在不考虑目标文化的情况下相互借鉴定义。我们使用所提出的数据集进行了零样本模型实验，采用三种流行的开源大语言模型 (LLM) 来理解不同定义对仇恨言论检测的影响。研究结果表明，大语言模型对定义敏感：仇恨言论检测的响应会根据提示中使用的定义复杂性而变化。

[NLP-38] Using Generative AI and Multi-Agents to Provide Automatic Feedback

【速读】：该论文试图解决生成式 AI (Generative AI) 在教育评估中提供自动反馈时常见的过度表扬 (over-praise) 和过度推断 (over-inference) 问题。解决方案的关键在于开发了一种名为 AutoFeedback 的多智能体系统 (multi-agent system)，该系统由两个 AI 智能体组成：一个负责生成反馈，另一个负责验证和优化反馈。通过这种双智能体协作机制，AutoFeedback 显著减少了单一大型语言模型 (LLM) 中常见的错误，提供了更准确且教育学上合理的反馈。这一研究结果表明，多智能体系统在教育环境中提供自动化反馈方面具有更高的可靠性和潜力，为个性化学习支持提供了可扩展的解决方案。

链接: https://arxiv.org/abs/2411.07407
作者: Shuchen Guo,Ehsan Latif,Yifan Zhou,Xuan Huang,Xiaoming Zhai
关键词-EN: provide automatic feedback, multi-agent systems, provide automatic, student constructed responses, feedback
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates the use of generative AI and multi-agent systems to provide automatic feedback in educational contexts, particularly for student constructed responses in science assessments. The research addresses a key gap in the field by exploring how multi-agent systems, called AutoFeedback, can improve the quality of GenAI-generated feedback, overcoming known issues such as over-praise and over-inference that are common in single-agent large language models (LLMs). The study developed a multi-agent system consisting of two AI agents: one for generating feedback and another for validating and refining it. The system was tested on a dataset of 240 student responses, and its performance was compared to that of a single-agent LLM. Results showed that AutoFeedback significantly reduced the occurrence of over-praise and over-inference errors, providing more accurate and pedagogically sound feedback. The findings suggest that multi-agent systems can offer a more reliable solution for generating automated feedback in educational settings, highlighting their potential for scalable and personalized learning support. These results have important implications for educators and researchers seeking to leverage AI in formative assessments, offering a pathway to more effective feedback mechanisms that enhance student learning outcomes.
摘要：本研究探讨了在教育环境中使用生成式 AI (Generative AI) 和多智能体系统 (multi-agent systems) 提供自动反馈，特别是在科学评估中学生构建的回答。该研究通过探索名为 AutoFeedback 的多智能体系统如何提高生成式 AI 生成的反馈质量，填补了该领域的一个关键空白，克服了单智能体大语言模型 (LLM) 中常见的过度表扬和过度推断等问题。研究开发了一个由两个 AI 智能体组成的多智能体系统：一个用于生成反馈，另一个用于验证和优化反馈。该系统在一个包含 240 份学生回答的数据集上进行了测试，并与单智能体大语言模型的表现进行了比较。结果显示，AutoFeedback 显著减少了过度表扬和过度推断错误的发生，提供了更准确且符合教学原则的反馈。研究结果表明，多智能体系统可以为教育环境中生成自动反馈提供更可靠的解决方案，突显了其在可扩展和个性化学习支持方面的潜力。这些结果对寻求利用 AI 进行形成性评估的教育工作者和研究人员具有重要意义，为提升学生学习成果的有效反馈机制提供了途径。

[NLP-39] Controllable Context Sensitivity and the Knob Behind It

【速读】：该论文试图解决语言模型在预测时如何平衡依赖上下文与先验知识的问题。解决方案的关键在于寻找一个控制模型对上下文敏感度的“旋钮”，即一个能够决定模型是基于上下文还是先验知识来回答问题的机制。通过设计一个可控上下文敏感度的任务，论文成功地在多个模型（如Llama-3.1, Mistral-v0.3, Gemma-2）中找到了一个1维子空间，该子空间在单层中编码了模型遵循上下文或先验知识的倾向。这一发现不仅适用于微调后的模型，还适用于未微调的指令模型和基础模型，表明这一机制具有普遍性。最终，研究显示模型性能与该子空间中上下文一致与忽略答案的区分度之间存在强相关性，暗示了这一简单机制在控制模型选择上下文或先验知识中的核心作用。

链接: https://arxiv.org/abs/2411.07404
作者: Julian Minder,Kevin Du,Niklas Stoehr,Giovanni Monea,Chris Wendler,Robert West,Ryan Cotterell
关键词-EN: model, context, prior knowledge, making predictions, prior
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When making predictions, a language model must trade off how much it relies on its context vs. its prior knowledge. Choosing how sensitive the model is to its context is a fundamental functionality, as it enables the model to excel at tasks like retrieval-augmented generation and question-answering. In this paper, we search for a knob which controls this sensitivity, determining whether language models answer from the context or their prior knowledge. To guide this search, we design a task for controllable context sensitivity. In this task, we first feed the model a context (Paris is in England) and a question (Where is Paris?); we then instruct the model to either use its prior or contextual knowledge and evaluate whether it generates the correct answer for both intents (either France or England). When fine-tuned on this task, instruction-tuned versions of Llama-3.1, Mistral-v0.3, and Gemma-2 can solve it with high accuracy (85-95%). Analyzing these high-performing models, we narrow down which layers may be important to context sensitivity using a novel linear time algorithm. Then, in each model, we identify a 1-D subspace in a single layer that encodes whether the model follows context or prior knowledge. Interestingly, while we identify this subspace in a fine-tuned model, we find that the exact same subspace serves as an effective knob in not only that model but also non-fine-tuned instruct and base models of that model family. Finally, we show a strong correlation between a model’s performance and how distinctly it separates context-agreeing from context-ignoring answers in this subspace. These results suggest a single subspace facilitates how the model chooses between context and prior knowledge, hinting at a simple fundamental mechanism that controls this behavior.
摘要：在做出预测时，语言模型必须在依赖上下文与其先验知识之间做出权衡。选择模型对上下文的敏感度是一个基本功能，因为它使模型能够在诸如检索增强生成和问答等任务中表现出色。本文中，我们寻找一个控制这种敏感度的旋钮，以确定语言模型是根据上下文还是先验知识来回答问题。为了指导这一搜索，我们设计了一个可控上下文敏感度的任务。在该任务中，我们首先向模型输入一个上下文（巴黎在英格兰）和一个问题（巴黎在哪里？）；然后指示模型使用其先验知识或上下文知识，并评估其是否为两种意图（法国或英格兰）生成正确的答案。经过对此任务的微调，Llama-3.1、Mistral-v0.3 和 Gemma-2 的指令微调版本能够以高准确率（85-95%）解决该任务。通过分析这些高性能模型，我们使用一种新颖的线性时间算法，缩小了可能对上下文敏感度重要的层。然后，在每个模型中，我们识别出一个单层中的一维子空间，该子空间编码了模型是遵循上下文还是先验知识。有趣的是，尽管我们在微调模型中识别出这个子空间，但我们发现，同一子空间不仅在该模型中，而且在该模型家族的非微调指令和基础模型中，都同样有效地作为控制旋钮。最后，我们展示了模型性能与在该子空间中区分上下文一致答案和上下文忽略答案的明显程度之间的强相关性。这些结果表明，单一子空间有助于模型在上下文和先验知识之间做出选择，暗示了一个控制这种行为的简单基本机制。

[NLP-40] Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews

【速读】：该论文试图解决从移动应用评论中自动提取与伦理问题（如隐私、安全等）相关的评论的挑战。解决方案的关键在于结合自然语言推理（NLI）和大型语言模型（LLM），特别是使用DeBERTa-v3-base-mnli-fever-anli NLI模型和Llama3.1-8B-Instruct LLM，通过深度理解语言细微差别和大规模分类能力，有效提取出之前基于关键词方法未能识别的新隐私相关评论。

链接: https://arxiv.org/abs/2411.07398
作者: Aakash Sorathiya,Gouri Ginde
关键词-EN: concerns surrounding ethics, app reviews, reviews, everyday experiences, surged significantly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:With the increasing proliferation of mobile applications in our everyday experiences, the concerns surrounding ethics have surged significantly. Users generally communicate their feedback, report issues, and suggest new functionalities in application (app) reviews, frequently emphasizing safety, privacy, and accountability concerns. Incorporating these reviews is essential to developing successful products. However, app reviews related to ethical concerns generally use domain-specific language and are expressed using a more varied vocabulary. Thus making automated ethical concern-related app review extraction a challenging and time-consuming effort. This study proposes a novel Natural Language Processing (NLP) based approach that combines Natural Language Inference (NLI), which provides a deep comprehension of language nuances, and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. Utilizing 43,647 app reviews from the mental health domain, the proposed methodology 1) Evaluates four NLI models to extract potential privacy reviews and compares the results of domain-specific privacy hypotheses with generic privacy hypotheses; 2) Evaluates four LLMs for classifying app reviews to privacy concerns; and 3) Uses the best NLI and LLM models further to extract new privacy reviews from the dataset. Results show that the DeBERTa-v3-base-mnli-fever-anli NLI model with domain-specific hypotheses yields the best performance, and Llama3.1-8B-Instruct LLM performs best in the classification of app reviews. Then, using NLI+LLM, an additional 1,008 new privacy-related reviews were extracted that were not identified through the keyword-based approach in previous research, thus demonstrating the effectiveness of the proposed approach. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2411.07398 [cs.CL] (or arXiv:2411.07398v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.07398 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：随着移动应用在我们日常生活中的普及，围绕伦理问题的关注度显著增加。用户通常在应用（app）评论中表达他们的反馈、报告问题并建议新功能，这些评论经常强调安全、隐私和责任问题。整合这些评论对于开发成功的产品至关重要。然而，与伦理问题相关的应用评论通常使用特定领域的语言，并采用更为多样的词汇表达，这使得自动提取与伦理问题相关的应用评论成为一个具有挑战性和耗时的任务。本研究提出了一种基于自然语言处理（Natural Language Processing, NLP）的新方法，该方法结合了自然语言推理（Natural Language Inference, NLI），以深入理解语言细微差别，以及一个仅解码器（类似于LLaMA）的大语言模型（Large Language Model, LLM），以大规模提取与伦理问题相关的应用评论。利用来自心理健康领域的43,647条应用评论，所提出的方法1）评估了四种NLI模型以提取潜在的隐私评论，并比较了特定领域隐私假设与通用隐私假设的结果；2）评估了四种LLM用于分类应用评论以识别隐私问题；3）使用最佳的NLI和LLM模型进一步从数据集中提取新的隐私评论。结果显示，DeBERTa-v3-base-mnli-fever-anli NLI模型在特定领域假设下表现最佳，而Llama3.1-8B-Instruct LLM在应用评论分类中表现最佳。随后，使用NLI+LLM方法，额外提取了1,008条新的隐私相关评论，这些评论在前人研究中基于关键词的方法未能识别，从而证明了所提出方法的有效性。

主题：计算与语言（cs.CL）；人工智能（cs.AI）；软件工程（cs.SE）
引用为：arXiv:2411.07398 [cs.CL]（或arXiv:2411.07398v1 [cs.CL]用于此版本）
https://doi.org/10.48550/arXiv.2411.07398
通过DataCite发布的arXiv DOI（待注册）

[NLP-41] oward Optimal Search and Retrieval for RAG NEURIPS2024

【速读】：该论文试图解决大语言模型（LLMs）在记忆相关挑战中的问题，特别是通过检索增强生成（RAG）方法来优化问答（QA）任务的性能。解决方案的关键在于理解检索器在RAG管道中的作用及其对下游任务性能的影响。论文通过实验揭示了检索与RAG性能之间的关系，并提出了一些优化策略，例如降低搜索准确性对RAG性能影响较小，但可能提高检索速度和内存效率，这对开发高性能RAG管道具有实际意义。

链接: https://arxiv.org/abs/2411.07396
作者: Alexandria Leto,Cecilia Aguerrebere,Ishwar Bhati,Ted Willke,Mariano Tepper,Vy Ai Vo
关键词-EN: Large Language Models, Language Models, Large Language, Retrieval-augmented generation, RAG
类目: Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2024 Workshop ATTRIB

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a promising method for addressing some of the memory-related challenges associated with Large Language Models (LLMs). Two separate systems form the RAG pipeline, the retriever and the reader, and the impact of each on downstream task performance is not well-understood. Here, we work towards the goal of understanding how retrievers can be optimized for RAG pipelines for common tasks such as Question Answering (QA). We conduct experiments focused on the relationship between retrieval and RAG performance on QA and attributed QA and unveil a number of insights useful to practitioners developing high-performance RAG pipelines. For example, lowering search accuracy has minor implications for RAG performance while potentially increasing retrieval speed and memory efficiency.
摘要：检索增强生成 (Retrieval-augmented Generation, RAG) 是一种有前景的方法，用于解决与大语言模型 (Large Language Models, LLMs) 相关的部分记忆挑战。RAG 管道由两个独立的系统组成，即检索器和阅读器，但每个系统对下游任务性能的影响尚未得到充分理解。在此，我们的目标是理解如何针对常见任务（如问答 (Question Answering, QA)）优化检索器以提升 RAG 管道的性能。我们进行了实验，重点研究了检索与 RAG 在 QA 和属性化 QA 上的性能之间的关系，并揭示了一些对开发高性能 RAG 管道有用的见解。例如，降低搜索准确性对 RAG 性能的影响较小，同时可能提高检索速度和内存效率。

[NLP-42] Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages

【速读】：该论文试图解决端到端语音翻译（End-to-end Speech Translation, ST）中翻译长度与源语音时长不匹配的问题，特别是在考虑语音和停顿段落的等时性（isochrony）时。解决方案的关键在于改进序列到序列ST模型的时长对齐组件，通过在翻译过程中预测语音和停顿的时长，并将这些时间信息提供给解码器，使其在生成翻译时能够跟踪剩余的语音和停顿时长，从而实现对翻译长度的精确控制。

链接: https://arxiv.org/abs/2411.07387
作者: Midia Yousefi,Yao Qian,Junkun Chen,Gang Wang,Yanqing Liu,Dongmei Wang,Xiaofei Wang,Jian Xue
关键词-EN: garnered significant attention, target language text, translates source language, language speech directly, source language speech
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years. Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio, including both speech and pause segments. Previous methods often controlled the number of words or characters generated by the Machine Translation model to approximate the source sentence’s length without considering the isochrony of pauses and speech segments, as duration can vary between languages. To address this, we present improvements to the duration alignment component of our sequence-to-sequence ST model. Our method controls translation length by predicting the duration of speech and pauses in conjunction with the translation process. This is achieved by providing timing information to the decoder, ensuring it tracks the remaining duration for speech and pauses while generating the translation. The evaluation on the Zh-En test set of CoVoST 2, demonstrates that the proposed Isochrony-Controlled ST achieves 0.92 speech overlap and 8.9 BLEU, which has only a 1.4 BLEU drop compared to the ST baseline.
摘要：端到端语音翻译（ST）近年来引起了广泛关注，它能够直接将源语言的语音转换为目标语言的文本。许多ST应用需要严格的长度控制，以确保翻译的时长与源音频的时长相匹配，包括语音和停顿段。以往的方法通常通过控制机器翻译模型生成的单词或字符数量来近似源句子的长度，而未考虑停顿和语音段的时间同步性，因为不同语言的时长可能有所不同。为此，我们对序列到序列ST模型的时长对齐组件进行了改进。我们的方法通过在翻译过程中预测语音和停顿的时长来控制翻译长度。具体实现是通过向解码器提供时间信息，使其在生成翻译时跟踪剩余的语音和停顿时长。在CoVoST 2的Zh-En测试集上的评估结果显示，提出的时间同步控制ST方法实现了0.92的语音重叠率和8.9的BLEU分数，相较于ST基线仅下降了1.4个BLEU分数。

[NLP-43] BeeManc at the PLABA Track of TAC-2024: RoBERTa for task 1 and LLaMA3.1 and GPT-4o for task 2

【速读】：该论文旨在解决生物医学摘要的平实语言改编问题，特别是针对PLABA 2024共享任务中的两个子任务。解决方案的关键在于：在任务一中，通过微调的ReBERTa-Base模型识别和分类生物医学摘要中的难懂术语、行话和缩略词，并报告了F1分数；在任务二中，利用Llamma3.1-70B-Instruct和GPT-4o模型结合单次提示完成摘要改编，并报告了BLEU、SARI、BERTScore、LENS和SALSA等指标的评分。尽管由于时间限制未完成替换任务，但微调的RoBERTa-Base模型在任务1A和1B中分别排名第3和第2，并在9个评估系统中平均F1分数排名第1。

链接: https://arxiv.org/abs/2411.07381
作者: Zhidong Ling,Zihao Li,Pablo Romeo,Lifeng Han,Goran Nenadic
关键词-EN: Plain Language Adaptation, task Plain Language, shared task Plain, Plain Language, Language Adaptation
类目: Computation and Language (cs.CL)
备注: ongoing work - system report

点击查看摘要

Abstract:This report is the system description of the BeeManc team for shared task Plain Language Adaptation of Biomedical Abstracts (PLABA) 2024. This report contains two sections corresponding to the two sub-tasks in PLABA 2024. In task one, we applied fine-tuned ReBERTa-Base models to identify and classify the difficult terms, jargon and acronyms in the biomedical abstracts and reported the F1 score. Due to time constraints, we didn’t finish the replacement task. In task two, we leveraged Llamma3.1-70B-Instruct and GPT-4o with the one-shot prompts to complete the abstract adaptation and reported the scores in BLEU, SARI, BERTScore, LENS, and SALSA. From the official Evaluation from PLABA-2024 on Task 1A and 1B, our \textbfmuch smaller fine-tuned RoBERTa-Base model ranked 3rd and 2nd respectively on the two sub-task, and the \textbf1st on averaged F1 scores across the two tasks from 9 evaluated systems. Our share our fine-tuned models and related resources at \urlthis https URL
摘要：本报告为 BeeManc 团队针对 2024 年共享任务“生物医学摘要的平实语言改编 (PLABA)”的系统描述。报告包含两个部分，分别对应 PLABA 2024 中的两个子任务。在任务一中，我们应用了微调后的 ReBERTa-Base 模型来识别和分类生物医学摘要中的难懂术语、专业术语和缩略词，并报告了 F1 分数。由于时间限制，我们未能完成替换任务。在任务二中，我们利用 Llamma3.1-70B-Instruct 和 GPT-4o 结合少样本提示完成了摘要改编，并报告了 BLEU、SARI、BERTScore、LENS 和 SALSA 的分数。根据 PLABA-2024 对任务 1A 和 1B 的官方评估，我们微调后的 RoBERTa-Base 模型在两个子任务中分别排名第 3 和第 2，并且在 9 个评估系统中，平均 F1 分数排名第 1。我们在此分享微调模型及相关资源，链接为 \urlthis https URL。

[NLP-44] Multi-head Span-based Detector for AI-generated Fragments in Scientific Papers

【速读】：该论文旨在解决在DAGPap24竞赛中区分AI生成的和人类撰写的科学文献片段的问题。解决方案的关键在于采用了一种多任务学习架构，该架构包含两个头部，能够处理连续数百个字符的类别跨度。通过使用不同的编码器变体来获取序列中每个token的状态向量，并调整片段分割成token的方式，进一步输入到基于transformer的编码器中，从而实现了相对于基线解决方案9%的质量提升（从0.86到0.95的平均宏F1-score），并在竞赛的封闭测试数据集上达到了0.96的分数。

链接: https://arxiv.org/abs/2411.07343
作者: German Gritsai,Ildar Khabutdinov,Andrey Grabovoy
关键词-EN: Scientific Document Processing, human-written scientific excerpts, Fourth Workshop, Document Processing, paper describes
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes a system designed to distinguish between AI-generated and human-written scientific excerpts in the DAGPap24 competition hosted within the Fourth Workshop on Scientific Document Processing. In this competition the task is to find artificially generated token-level text fragments in documents of a scientific domain. Our work focuses on the use of a multi-task learning architecture with two heads. The application of this approach is justified by the specificity of the task, where class spans are continuous over several hundred characters. We considered different encoder variations to obtain a state vector for each token in the sequence, as well as a variation in splitting fragments into tokens to further feed into the input of a transform-based encoder. This approach allows us to achieve a 9% quality improvement relative to the baseline solution score on the development set (from 0.86 to 0.95) using the average macro F1-score, as well as a score of 0.96 on a closed test part of the dataset from the competition.
摘要：本文介绍了一个系统，该系统旨在区分由 AI 生成的和人类撰写的科学摘录，该系统在第四届科学文档处理研讨会内的 DAGPap24 竞赛中进行了测试。在此次竞赛中，任务是在科学领域的文档中识别出人工生成的 Token 级文本片段。我们的工作重点是采用一种多任务学习架构，该架构包含两个头部。这种方法的应用是基于任务的特殊性，即类别跨度在数百个字符上连续。我们考虑了不同的编码器变体，以获取序列中每个 Token 的状态向量，同时还考虑了将片段分割成 Token 的不同方式，以便进一步输入到基于 Transformer 的编码器中。这种方法使我们能够在开发集上相对于基线解决方案的得分实现了 9% 的质量提升（从 0.86 提升到 0.95），使用的是平均宏 F1 分数，同时在竞赛数据集的封闭测试部分获得了 0.96 的分数。

[NLP-45] SetLexSem Challenge: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models NEURIPS2024

【速读】：该论文试图解决的问题是评估大型语言模型（LLMs）在处理集合操作时的鲁棒性，特别是在集合成员的词汇和语义变化下的表现。解决方案的关键在于提出了SetLexSem挑战，这是一个合成基准测试，通过系统地采样集合成员在词汇和语义维度上的变化，来评估LLMs在集合操作中的指令遵循能力和鲁棒性。研究结果表明，LLMs在面对操作和操作数的变化时表现出较差的鲁棒性，并且在特定的“欺骗性”集合分组中显示出独特的失败模式。

链接: https://arxiv.org/abs/2411.07336
作者: Bardiya Akhbari,Manish Gawali,Nicholas A. Dronen
关键词-EN: theory is foundational, foundational to mathematics, set operations, Set, Set theory
类目: Computation and Language (cs.CL)
备注: 10 pages, 8 figures, NeurIPS 2024 Datasets and Benchmarks track

点击查看摘要

Abstract:Set theory is foundational to mathematics and, when sets are finite, to reasoning about the world. An intelligent system should perform set operations consistently, regardless of superficial variations in the operands. Initially designed for semantically-oriented NLP tasks, large language models (LLMs) are now being evaluated on algorithmic tasks. Because sets are comprised of arbitrary symbols (e.g. numbers, words), they provide an opportunity to test, systematically, the invariance of LLMs’ algorithmic abilities under simple lexical or semantic variations. To this end, we present the SetLexSem Challenge, a synthetic benchmark that evaluates the performance of LLMs on set operations. SetLexSem assesses the robustness of LLMs’ instruction-following abilities under various conditions, focusing on the set operations and the nature and construction of the set members. Evaluating seven LLMs with SetLexSem, we find that they exhibit poor robustness to variation in both operation and operands. We show – via the framework’s systematic sampling of set members along lexical and semantic dimensions – that LLMs are not only not robust to variation along these dimensions but demonstrate unique failure modes in particular, easy-to-create semantic groupings of “deceptive” sets. We find that rigorously measuring language model robustness to variation in frequency and length is challenging and present an analysis that measures them independently. The code for reproducing the results of this paper, and for generating the SetLexSem Challenge dataset, is available at \hrefthis https URLthis https URL.
摘要：集合论是数学的基础，并且在集合为有限时，也是推理世界的基础。一个智能系统应当能够一致地执行集合操作，无论操作数在表面上如何变化。最初为语义导向的自然语言处理（NLP）任务设计的大语言模型（LLMs），现在正被评估其在算法任务上的表现。由于集合由任意符号（如数字、词语）组成，它们提供了一个系统测试LLMs算法能力在简单词汇或语义变化下不变性的机会。为此，我们提出了SetLexSem挑战，这是一个评估LLMs在集合操作上表现的合成基准。SetLexSem评估了LLMs在各种条件下遵循指令的能力的鲁棒性，重点在于集合操作以及集合成员的性质和构造。通过对七个LLMs进行SetLexSem评估，我们发现它们在操作和操作数的变化下表现出较差的鲁棒性。我们通过框架对集合成员在词汇和语义维度上的系统采样，展示了LLMs不仅在这些维度上不具备鲁棒性，而且在特定易于创建的“欺骗性”集合的语义分组中表现出独特的失败模式。我们发现，严格测量语言模型对频率和长度变化的鲁棒性是具有挑战性的，并提出了一个独立测量它们的分析方法。本文结果的复现代码以及生成SetLexSem挑战数据集的代码可在\hrefthis https URLthis https URL获取。

[NLP-46] Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations

【速读】：该论文试图解决大型语言模型在处理地理知识相关任务时可能存在的偏见问题，特别是针对不同经济发展水平国家的地理知识编码差异及其对实际应用的影响。解决方案的关键在于通过两个常见场景（旅行推荐和地理锚定的故事生成）对四个流行的大型语言模型进行评估，分析其在处理来自不同经济水平国家的数据时的表现差异。研究发现，与较富裕国家相比，针对较贫穷国家的旅行推荐更缺乏独特性和地点参考，而生成的故事更多地传达了困难和悲伤的情感。

链接: https://arxiv.org/abs/2411.07320
作者: Kirti Bhagat,Kinshuk Vasisht,Danish Pruthi
关键词-EN: large language models, inspects language models, occupation and religion, language models, work inspects language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Submitted to ARR - October 2024

点击查看摘要

Abstract:While a large body of work inspects language models for biases concerning gender, race, occupation and religion, biases of geographical nature are relatively less explored. Some recent studies benchmark the degree to which large language models encode geospatial knowledge. However, the impact of the encoded geographical knowledge (or lack thereof) on real-world applications has not been documented. In this work, we examine large language models for two common scenarios that require geographical knowledge: (a) travel recommendations and (b) geo-anchored story generation. Specifically, we study four popular language models, and across about 100 K travel requests, and 200 K story generations, we observe that travel recommendations corresponding to poorer countries are less unique with fewer location references, and stories from these regions more often convey emotions of hardship and sadness compared to those from wealthier nations.
摘要：尽管大量研究探讨了大语言模型在性别、种族、职业和宗教方面的偏见，但地理性质的偏见相对较少被探索。一些近期研究评估了大语言模型编码地理空间知识的程度。然而，编码的地理知识（或缺乏）对实际应用的影响尚未被记录。在本研究中，我们考察了大语言模型在两种常见需要地理知识的场景中的表现：(a) 旅行推荐和 (b) 地理定位的故事生成。具体而言，我们研究了四个流行的大语言模型，并在约 100,000 个旅行请求和 200,000 个故事生成中观察到，与较富裕国家相比，针对较贫穷国家的旅行推荐较少独特且地点引用较少，而这些地区的故事更常传达出艰辛和悲伤的情感。

[NLP-47] he Surprising Effectiveness of Test-Time Training for Abstract Reasoning

【速读】：该论文试图解决语言模型在面对需要复杂推理的新问题时表现不佳的问题。解决方案的关键在于引入测试时训练（Test-Time Training, TTT），即在推理过程中临时更新模型参数，使用基于输入数据的损失函数。通过系统实验，论文确定了成功实施TTT的三个关键组件：(1) 在类似任务上的初始微调；(2) 辅助任务格式和数据增强；(3) 针对每个实例的训练。TTT显著提升了模型在Abstraction and Reasoning Corpus (ARC)任务上的表现，相较于基础微调模型，准确率提高了多达6倍。在8B参数的语言模型上应用TTT，论文在ARC的公开验证集上达到了53%的准确率，将公开和纯神经方法的最新技术水平提高了近25%。通过将该方法与最近的程序生成方法集成，论文达到了61.9%的公开验证准确率，接近人类平均得分。研究结果表明，显式符号搜索并非提升神经语言模型抽象推理能力的唯一途径，测试时对少量样本的持续训练同样非常有效。

链接: https://arxiv.org/abs/2411.07279
作者: Ekin Akyürek,Mehul Damani,Linlu Qiu,Han Guo,Yoon Kim,Jacob Andreas
关键词-EN: problems requiring complex, requiring complex reasoning, shown impressive performance, shown impressive, problems requiring
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Language models have shown impressive performance on tasks within their training distribution, but often struggle with novel problems requiring complex reasoning. We investigate the effectiveness of test-time training (TTT) – updating model parameters temporarily during inference using a loss derived from input data – as a mechanism for improving models’ reasoning capabilities, using the Abstraction and Reasoning Corpus (ARC) as a benchmark. Through systematic experimentation, we identify three crucial components for successful TTT: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training. TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models; applying TTT to an 8B-parameter language model, we achieve 53% accuracy on the ARC’s public validation set, improving the state-of-the-art by nearly 25% for public and purely neural approaches. By ensembling our method with recent program generation approaches, we get SoTA public validation accuracy of 61.9%, matching the average human score. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models; additional test-time applied to continued training on few-shot examples can also be extremely effective.
摘要：语言模型在其训练分布内的任务上表现出色，但在需要复杂推理的新问题上往往表现不佳。我们研究了测试时训练（Test-Time Training, TTT）——在推理过程中使用基于输入数据的损失临时更新模型参数——作为提升模型推理能力的机制，以抽象与推理语料库（Abstraction and Reasoning Corpus, ARC）作为基准。通过系统的实验，我们确定了三个关键的TTT成功要素：（1）在相似任务上的初始微调；（2）辅助任务格式与增强；（3）实例级别的训练。TTT显著提升了ARC任务的性能，相较于基础微调模型，准确率提升高达6倍；将TTT应用于一个80亿参数的大语言模型，我们在ARC的公开验证集上达到了53%的准确率，将公开和纯神经方法的最新技术水平提升了近25%。通过将我们的方法与最近的程序生成方法集成，我们获得了61.9%的公开验证准确率，达到了人类平均得分。我们的研究结果表明，显式的符号搜索并非提升神经语言模型抽象推理能力的唯一途径；在少样本示例上持续进行测试时训练同样极为有效。

[NLP-48] arget-driven Attack for Large Language Models

【速读】：该论文试图解决当前大型语言模型（LLM）在面对用户通过界面注入的对抗性文本或指令时，可能导致的模型安全挑战，特别是模型无法给出正确答案的问题。解决方案的关键在于提出了目标驱动的黑盒攻击方法，通过最大化干净文本与攻击文本之间的条件概率的KL散度（KL divergence）来重新定义攻击目标。该方法将距离最大化问题转化为两个凸优化问题，分别用于求解攻击文本和估计协方差，并通过投影梯度下降算法求解对应的攻击文本向量。论文提出的目标驱动黑盒攻击方法包括两种攻击策略：token操纵和错误信息攻击，并在多个大型语言模型和数据集上验证了其有效性。

链接: https://arxiv.org/abs/2411.07268
作者: Chong Zhang,Mingyu Jin,Dong Shu,Taowen Wang,Dongfang Liu,Xiaobo Jin
关键词-EN: Current large language, natural language tasks, large language models, user-oriented natural language, large-scale user-oriented natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:2404.07234

点击查看摘要

Abstract:Current large language models (LLM) provide a strong foundation for large-scale user-oriented natural language tasks. Many users can easily inject adversarial text or instructions through the user interface, thus causing LLM model security challenges like the language model not giving the correct answer. Although there is currently a large amount of research on black-box attacks, most of these black-box attacks use random and heuristic strategies. It is unclear how these strategies relate to the success rate of attacks and thus effectively improve model robustness. To solve this problem, we propose our target-driven black-box attack method to maximize the KL divergence between the conditional probabilities of the clean text and the attack text to redefine the attack’s goal. We transform the distance maximization problem into two convex optimization problems based on the attack goal to solve the attack text and estimate the covariance. Furthermore, the projected gradient descent algorithm solves the vector corresponding to the attack text. Our target-driven black-box attack approach includes two attack strategies: token manipulation and misinformation attack. Experimental results on multiple Large Language Models and datasets demonstrate the effectiveness of our attack method.
摘要：当前的大语言模型（Large Language Model, LLM）为面向用户的大规模自然语言任务提供了坚实的基础。然而，许多用户可以通过用户界面轻松注入对抗性文本或指令，从而引发LLM模型的安全挑战，例如语言模型无法给出正确答案。尽管目前已有大量关于黑盒攻击的研究，但大多数黑盒攻击采用随机和启发式策略。这些策略与攻击成功率之间的关系尚不明确，因此难以有效提升模型的鲁棒性。为解决这一问题，我们提出了一种目标驱动的黑盒攻击方法，旨在最大化干净文本与攻击文本的条件概率之间的KL散度，从而重新定义攻击目标。我们将距离最大化问题转化为基于攻击目标的两个凸优化问题，分别用于求解攻击文本和估计协方差。此外，通过投影梯度下降算法求解与攻击文本对应的向量。我们的目标驱动黑盒攻击方法包括两种攻击策略：Token操作和错误信息攻击。在多个大语言模型和数据集上的实验结果表明，我们的攻击方法具有显著的有效性。

[NLP-49] Multi-Document Financial Question Answering using LLM s

【速读】：该论文试图解决多文档金融问答中的复杂问题，特别是那些答案不明显且难以回答的问题。解决方案的关键在于提出了两种新方法：第一种是基于语义标签和索引查询的RAG_SEM方法，第二种是基于知识图谱（Knowledge Graph, KG_RAG）的方法，该方法通过语义标签从图数据库中检索知识图谱的三元组作为上下文。KG_RAG方法利用了通过知识蒸馏（knowledge distillation）微调的小模型构建的知识图谱，这些图谱由大型教师模型指导生成。这两种方法在多个评估指标上显著优于传统的RAG方法，其中KG_RAG在九个评估指标中的四个上表现更优。

链接: https://arxiv.org/abs/2411.07264
作者: Shalin Shah,Srikanth Ryali,Ramasubbu Venkatesh
关键词-EN: financial question answering, multi-document financial question, RAG, multi-document financial, semantic tagging
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose two new methods for multi-document financial question answering. First, a method that uses semantic tagging, and then, queries the index to get the context (RAG_SEM). And second, a Knowledge Graph (KG_RAG) based method that uses semantic tagging, and, retrieves knowledge graph triples from a graph database, as context. KG_RAG uses knowledge graphs constructed using a small model that is fine-tuned using knowledge distillation using a large teacher model. The data consists of 18 10K reports of Apple, Microsoft, Alphabet, NVIDIA, Amazon and Tesla for the years 2021, 2022 and 2023. The list of questions in the data consists of 111 complex questions including many esoteric questions that are difficult to answer and the answers are not completely obvious. As evaluation metrics, we use overall scores as well as segmented scores for measurement including the faithfulness, relevance, correctness, similarity, an LLM based overall score and the rouge scores as well as a similarity of embeddings. We find that both methods outperform plain RAG significantly. KG_RAG outperforms RAG_SEM in four out of nine metrics.
摘要：我们提出了两种新的多文档金融问答方法。首先，一种使用语义标签的方法，然后查询索引以获取上下文（RAG_SEM）。其次，一种基于知识图谱（KG_RAG）的方法，该方法使用语义标签，并从图数据库中检索知识图谱的三元组作为上下文。KG_RAG 使用通过知识蒸馏微调的小模型构建的知识图谱，该小模型通过使用一个大教师模型进行知识蒸馏。数据包括苹果、微软、Alphabet、NVIDIA、亚马逊和特斯拉在 2021、2022 和 2023 年的 18 份 10-K 报告。数据中的问题列表包含 111 个复杂问题，其中包括许多难以回答且答案不完全明显的问题。作为评估指标，我们使用总体分数以及包括忠实度、相关性、正确性、相似性、基于大语言模型的总体分数以及 rouge 分数和嵌入相似性在内的分段分数。我们发现，这两种方法都显著优于普通的 RAG。在九项指标中，KG_RAG 在四项指标上优于 RAG_SEM。

人工智能

[AI-0] Scaling Properties of Diffusion Models for Perceptual Tasks

链接: https://arxiv.org/abs/2411.08034
作者: Rahul Ravishankar,Zeeshan Patel,Jathushan Rajasegaran,Jitendra Malik
关键词-EN: visual perception tasks, perception tasks, argue that iterative, iterative computation, offers a powerful
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and segmentation under image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perception tasks. Through a careful analysis of these scaling behaviors, we present various techniques to efficiently train diffusion models for visual perception tasks. Our models achieve improved or comparable performance to state-of-the-art methods using significantly less data and compute. To use our code and models, see this https URL .

[AI-1] GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation

链接: https://arxiv.org/abs/2411.08033
作者: Yushi Lan,Shangchen Zhou,Zhaoyang Lyu,Fangzhou Hong,Shuai Yang,Bo Dai,Xingang Pan,Chen Change Loy
关键词-EN: latent space design, latent space, Cloud-structured Latent space, Point Cloud-structured Latent, advanced significantly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: project page: this https URL

点击查看摘要

Abstract:While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent diffusion model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing methods in both text- and image-conditioned 3D generation.

[AI-2] Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data

链接: https://arxiv.org/abs/2411.08028
作者: Juanhui Li,Sreyashi Nag,Hui Liu,Xianfeng Tang,Sheikh Sarwar,Limeng Cui,Hansu Gu,Suhang Wang,Qi He,Jiliang Tang
关键词-EN: real-world NLP applications, Large Language Models, offer promising solutions, promising solutions due, Large Language
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. However, the large size and high computation demands of LLMs limit their practicality in many applications, especially when further fine-tuning is required. To address these limitations, smaller models are typically preferred for deployment. However, their training is hindered by the scarcity of labeled data. In contrast, unlabeled data is often readily which can be leveraged by using LLMs to generate pseudo-labels for training smaller models. This enables the smaller models (student) to acquire knowledge from LLMs(teacher) while reducing computational costs. This process introduces challenges, such as potential noisy pseudo-labels. Selecting high-quality and informative data is therefore critical to enhance model performance while improving the efficiency of data utilization. To address this, we propose LLKD that enables Learning with Less computational resources and less data for Knowledge Distillation from LLMs. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Specifically, it prioritizes samples where the teacher demonstrates high confidence in its labeling, indicating reliable labels, and where the student exhibits a high information need, identifying challenging samples that require further learning. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.

[AI-3] LLM Phy: Complex Physical Reasoning Using Large Language Models and World Models

链接: https://arxiv.org/abs/2411.08027
作者: Anoop Cherian,Radu Corcodel,Siddarth Jain,Diego Romeres
关键词-EN: important skill needed, physical reasoning task, Physical reasoning, complex physical reasoning, reasoning task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Physical reasoning is an important skill needed for robotic agents when operating in the real world. However, solving such reasoning problems often involves hypothesizing and reflecting over complex multi-body interactions under the effect of a multitude of physical forces and thus learning all such interactions poses a significant hurdle for state-of-the-art machine learning frameworks, including large language models (LLMs). To study this problem, we propose a new physical reasoning task and a dataset, dubbed TraySim. Our task involves predicting the dynamics of several objects on a tray that is given an external impact – the domino effect of the ensued object interactions and their dynamics thus offering a challenging yet controlled setup, with the goal of reasoning being to infer the stability of the objects after the impact. To solve this complex physical reasoning task, we present LLMPhy, a zero-shot black-box optimization framework that leverages the physics knowledge and program synthesis abilities of LLMs, and synergizes these abilities with the world models built into modern physics engines. Specifically, LLMPhy uses an LLM to generate code to iteratively estimate the physical hyperparameters of the system (friction, damping, layout, etc.) via an implicit analysis-by-synthesis approach using a (non-differentiable) simulator in the loop and uses the inferred parameters to imagine the dynamics of the scene towards solving the reasoning task. To show the effectiveness of LLMPhy, we present experiments on our TraySim dataset to predict the steady-state poses of the objects. Our results show that the combination of the LLM and the physics engine leads to state-of-the-art zero-shot physical reasoning performance, while demonstrating superior convergence against standard black-box optimization methods and better estimation of the physical parameters.

[AI-4] Leonardo vindicated: Pythagorean trees for minimal reconstruction of the natural branching structures

链接: https://arxiv.org/abs/2411.08024
作者: Dymitr Ruta,Corrado Mio,Ernesto Damiani
关键词-EN: engineering masterpieces optimal, fractal trees, natural tree branching, Pythagorean-like fractal trees, Vinci tree branching
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 22 pages, lots of hi res figures I had to reduce quality of, submitting as a requirement to the Theory of Computing Journal

点击查看摘要

Abstract:Trees continue to fascinate with their natural beauty and as engineering masterpieces optimal with respect to several independent criteria. Pythagorean tree is a well-known fractal design that realistically mimics the natural tree branching structures. We study various types of Pythagorean-like fractal trees with different shapes of the base, branching angles and relaxed scales in an attempt to identify and explain which variants are the closest match to the branching structures commonly observed in the natural world. Pursuing simultaneously the realism and minimalism of the fractal tree model, we have developed a flexibly parameterised and fast algorithm to grow and visually examine deep Pythagorean-inspired fractal trees with the capability to orderly over- or underestimate the Leonardo da Vinci’s tree branching rule as well as control various imbalances and branching angles. We tested the realism of the generated fractal tree images by means of the classification accuracy of detecting natural tree with the transfer-trained deep Convolutional Neural Networks (CNNs). Having empirically established the parameters of the fractal trees that maximize the CNN’s natural tree class classification accuracy we have translated them back to the scales and angles of branches and came to the interesting conclusions that support the da Vinci branching rule and golden ratio based scaling for both the shape of the branch and imbalance between the child branches, and claim the flexibly parameterized fractal trees can be used to generate artificial examples to train robust detectors of different species of trees.

[AI-5] Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings

链接: https://arxiv.org/abs/2411.08017
作者: Aditya Sanghi,Aliasghar Khani,Pradyumna Reddy,Arianna Rampini,Derek Cheung,Kamal Rahimi Malekshan,Kanika Madan,Hooman Shayani
关键词-EN: capturing fine details, models require substantial, require substantial computational, substantial computational resources, Wavelet Latent Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale 3D generative models require substantial computational resources yet often fall short in capturing fine details and complex geometries at high resolutions. We attribute this limitation to the inefficiency of current representations, which lack the compactness required to model the generative models effectively. To address this, we introduce a novel approach called Wavelet Latent Diffusion, or WaLa, that encodes 3D shapes into wavelet-based, compact latent encodings. Specifically, we compress a 256^3 signed distance field into a 12^3 \times 4 latent grid, achieving an impressive 2427x compression ratio with minimal loss of detail. This high level of compression allows our method to efficiently train large-scale generative networks without increasing the inference time. Our models, both conditional and unconditional, contain approximately one billion parameters and successfully generate high-quality 3D shapes at 256^3 resolution. Moreover, WaLa offers rapid inference, producing shapes within two to four seconds depending on the condition, despite the model’s scale. We demonstrate state-of-the-art performance across multiple datasets, with significant improvements in generation quality, diversity, and computational efficiency. We open-source our code and, to the best of our knowledge, release the largest pretrained 3D generative models across different modalities.

[AI-6] Investigating the Effectiveness of Explainability Methods in Parkinsons Detection from Speech

链接: https://arxiv.org/abs/2411.08013
作者: Eleonora Mancini,Francesco Paissan,Paolo Torroni,Cem Subakan,Mirco Ravanelli
关键词-EN: significant early indicators, Parkinson disease, impairments in Parkinson, provide significant early, significant early
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: The first two authors contributed equally to this research: author order is alphabetical

点击查看摘要

Abstract:Speech impairments in Parkinson’s disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detection have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.

[AI-7] Gini Coefficient as a Unified Metric for Evaluating Many-versus-Many Similarity in Vector Spaces

链接: https://arxiv.org/abs/2411.07983
作者: Ben Fauber
关键词-EN: Gini coefficients, Gini, metrics to evaluate, similarity in vector, vector spaces
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We demonstrate that Gini coefficients can be used as unified metrics to evaluate many-versus-many (all-to-all) similarity in vector spaces. Our analysis of various image datasets shows that images with the highest Gini coefficients tend to be the most similar to one another, while images with the lowest Gini coefficients are the least similar. We also show that this relationship holds true for vectorized text embeddings from various corpuses, highlighting the consistency of our method and its broad applicability across different types of data. Additionally, we demonstrate that selecting machine learning training samples that closely match the distribution of the testing dataset is far more important than ensuring data diversity. Selection of exemplary and iconic training samples with higher Gini coefficients leads to significantly better model performance compared to simply having a diverse training set with lower Gini coefficients. Thus, Gini coefficients can serve as effective criteria for selecting machine learning training samples, with our selection method outperforming random sampling methods in very sparse information settings.

[AI-8] Exact Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization NEURIPS2024

链接: https://arxiv.org/abs/2411.07979
作者: Davide Buffelli,Jamie McGowan,Wangkun Xu,Alexandru Cioba,Da-shan Shiu,Guillaume Hennequin,Alberto Bernacchia
关键词-EN: http URL, yielding faster progress, deep neural networks, deep learning applications, shown to accelerate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Second-order optimization has been shown to accelerate the training of deep neural networks in many applications, often yielding faster progress per iteration on the training loss compared to first-order this http URL, the generalization properties of second-order methods are still being debated. Theoretical investigations have proved difficult to carry out outside the tractable settings of heavily simplified model classes – thus, the relevance of existing theories to practical deep learning applications remains unclear. Similarly, empirical studies in large-scale models and real datasets are significantly confounded by the necessity to approximate second-order updates in practice. It is often unclear whether the observed generalization behaviour arises specifically from the second-order nature of the parameter updates, or instead reflects the specific structured (e.g.\ Kronecker) approximations used or any damping-based interpolation towards first-order updates. Here, we show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep reversible architectures that are sufficiently expressive to be meaningfully applied to common benchmark datasets. We exploit this novel setting to study the training and generalization properties of the GN optimizer. We find that exact GN generalizes poorly. In the mini-batch training setting, this manifests as rapidly saturating progress even on the \emphtraining loss, with parameter updates found to overfit each mini-batchatch without producing the features that would support generalization to other mini-batches. We show that our experiments run in the ``lazy’’ regime, in which the neural tangent kernel (NTK) changes very little during the course of training. This behaviour is associated with having no significant changes in neural representations, explaining the lack of generalization.

[AI-9] How To Discover Short Shorter and the Shortest Proofs of Unsatisfiability: A Branch-and-Bound Approach for Resolution Proof Length Minimization

链接: https://arxiv.org/abs/2411.07955
作者: Konstantin Sidorov,Koos van der Linden,Gonçalo Homem de Almeida Correia,Mathijs de Weerdt,Emir Demirović
关键词-EN: automated reasoning toolkit, propositional satisfiability problems, powerful automated reasoning, Modern software, modern SAT solvers
类目: Artificial Intelligence (cs.AI)
*备注: 42 pages, 16 figures, 8 tables, submitted to Journal of Artificial Intelligence Research

点击查看摘要

Abstract:Modern software for propositional satisfiability problems gives a powerful automated reasoning toolkit, capable of outputting not only a satisfiable/unsatisfiable signal but also a justification of unsatisfiability in the form of resolution proof (or a more expressive proof), which is commonly used for verification purposes. Empirically, modern SAT solvers produce relatively short proofs, however, there are no inherent guarantees that these proofs cannot be significantly reduced. This paper proposes a novel branch-and-bound algorithm for finding the shortest resolution proofs; to this end, we introduce a layer list representation of proofs that groups clauses by their level of indirection. As we show, this representation breaks all permutational symmetries, thereby improving upon the state-of-the-art symmetry-breaking and informing the design of a novel workflow for proof minimization. In addition to that, we design pruning procedures that reason on proof length lower bound, clause subsumption, and dominance. Our experiments suggest that the proofs from state-of-the-art solvers could be shortened by 30-60% on the instances from SAT Competition 2002 and by 25-50% on small synthetic formulas. When treated as an algorithm for finding the shortest proof, our approach solves twice as many instances as the previous work based on SAT solving and reduces the time to optimality by orders of magnitude for the instances solved by both approaches.

[AI-10] owards Low-bit Communication for Tensor Parallel LLM Inference

链接: https://arxiv.org/abs/2411.07942
作者: Harry Dong,Tyler Johnson,Minsik Cho,Emad Soroush
关键词-EN: large language model, additional communication cost, increase server large, server large language, communication cost
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tensor parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional communication cost. However, as server LLMs continue to scale in size, they will need to be distributed across more devices, magnifying the communication cost. One way to approach this problem is with quantization, but current methods for LLMs tend to avoid quantizing the features that tensor parallelism needs to communicate. Taking advantage of consistent outliers in communicated features, we introduce a quantization method that reduces communicated values on average from 16 bits to 4.2 bits while preserving nearly all of the original performance. For instance, our method maintains around 98.0% and 99.5% of Gemma 2 27B’s and Llama 2 13B’s original performance, respectively, averaged across all tasks we evaluated on.

[AI-11] Automatic dataset shift identification to support root cause analysis of AI performance drift

链接: https://arxiv.org/abs/2411.07940
作者: Mélanie Roschewitz,Raghav Mehta,Charles Jones,Ben Glocker
关键词-EN: shift, substantially harm, harm the performance, performance of clinical, Shifts
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code available at this https URL

点击查看摘要

Abstract:Shifts in data distribution can substantially harm the performance of clinical AI models. Hence, various methods have been developed to detect the presence of such shifts at deployment time. However, root causes of dataset shifts are varied, and the choice of shift mitigation strategies is highly dependent on the precise type of shift encountered at test time. As such, detecting test-time dataset shift is not sufficient: precisely identifying which type of shift has occurred is critical. In this work, we propose the first unsupervised dataset shift identification framework, effectively distinguishing between prevalence shift (caused by a change in the label distribution), covariate shift (caused by a change in input characteristics) and mixed shifts (simultaneous prevalence and covariate shifts). We discuss the importance of self-supervised encoders for detecting subtle covariate shifts and propose a novel shift detector leveraging both self-supervised encoders and task model outputs for improved shift detection. We report promising results for the proposed shift identification framework across three different imaging modalities (chest radiography, digital mammography, and retinal fundus images) on five types of real-world dataset shifts, using four large publicly available datasets.

[AI-12] Doubly Mild Generalization for Offline Reinforcement Learning NEURIPS2024

链接: https://arxiv.org/abs/2411.07934
作者: Yixiu Mao,Qi Wang,Yun Qu,Yuhang Jiang,Xiangyang Ji
关键词-EN: Offline Reinforcement Learning, Offline Reinforcement, Reinforcement Learning, generalization, extrapolation error
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to NeurIPS 2024. arXiv admin note: substantial text overlap with arXiv:2410.19400

点击查看摘要

Abstract:Offline Reinforcement Learning (RL) suffers from the extrapolation error and value overestimation. From a generalization perspective, this issue can be attributed to the over-generalization of value functions or policies towards out-of-distribution (OOD) actions. Significant efforts have been devoted to mitigating such generalization, and recent in-sample learning approaches have further succeeded in entirely eschewing it. Nevertheless, we show that mild generalization beyond the dataset can be trusted and leveraged to improve performance under certain conditions. To appropriately exploit generalization in offline RL, we propose Doubly Mild Generalization (DMG), comprising (i) mild action generalization and (ii) mild generalization propagation. The former refers to selecting actions in a close neighborhood of the dataset to maximize the Q values. Even so, the potential erroneous generalization can still be propagated, accumulated, and exacerbated by bootstrapping. In light of this, the latter concept is introduced to mitigate the generalization propagation without impeding the propagation of RL learning signals. Theoretically, DMG guarantees better performance than the in-sample optimal policy in the oracle generalization scenario. Even under worst-case generalization, DMG can still control value overestimation at a certain level and lower bound the performance. Empirically, DMG achieves state-of-the-art performance across Gym-MuJoCo locomotion tasks and challenging AntMaze tasks. Moreover, benefiting from its flexibility in both generalization aspects, DMG enjoys a seamless transition from offline to online learning and attains strong online fine-tuning performance.

[AI-13] INTRABENCH: Interactive Radiological Benchmark

链接: https://arxiv.org/abs/2411.07885
作者: Constantin Ulrich,Tassilo Wald,Emily Tempus,Maximilian Rokuss,Paul F. Jaeger,Klaus Maier-Hein
关键词-EN: META Segment, achieved notable advancements, Current interactive segmentation, real clinical scenarios, success of META
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Undergoing Peer-Review

点击查看摘要

Abstract:Current interactive segmentation approaches, inspired by the success of META’s Segment Anything model, have achieved notable advancements, however, they come with substantial limitations that hinder their practical application in real clinical scenarios. These include unrealistic human interaction requirements, such as slice-by-slice operations for 2D models on 3D data, a lack of iterative refinement, and insufficient evaluation experiments. These shortcomings prevent accurate assessment of model performance and lead to inconsistent outcomes across studies. IntRaBench overcomes these challenges by offering a comprehensive and reproducible framework for evaluating interactive segmentation methods in realistic, clinically relevant scenarios. It includes diverse datasets, target structures, and segmentation models, and provides a flexible codebase that allows seamless integration of new models and prompting strategies. Additionally, we introduce advanced techniques to minimize clinician interaction, ensuring fair comparisons between 2D and 3D models. By open-sourcing IntRaBench, we invite the research community to integrate their models and prompting techniques, ensuring continuous and transparent evaluation of interactive segmentation models in 3D medical imaging.

[AI-14] Diverse capability and scaling of diffusion and auto-regressive models when learning abstract rules NEURIPS2024

链接: https://arxiv.org/abs/2411.07873
作者: Binxu Wang,Jiaqi Shang,Haim Sompolinsky
关键词-EN: discovering regular structures, applying inferred rules, Raven Progressive Matrices, Humans excel, models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: 12 pages, 5 figures. Accepted to NeurIPS2024 Workshop on System 2 Reasoning At Scale as long paper

点击查看摘要

Abstract:Humans excel at discovering regular structures from limited samples and applying inferred rules to novel settings. We investigate whether modern generative models can similarly learn underlying rules from finite samples and perform reasoning through conditional sampling. Inspired by Raven’s Progressive Matrices task, we designed GenRAVEN dataset, where each sample consists of three rows, and one of 40 relational rules governing the object position, number, or attributes applies to all rows. We trained generative models to learn the data distribution, where samples are encoded as integer arrays to focus on rule learning. We compared two generative model families: diffusion (EDM, DiT, SiT) and autoregressive models (GPT2, Mamba). We evaluated their ability to generate structurally consistent samples and perform panel completion via unconditional and conditional sampling. We found diffusion models excel at unconditional generation, producing more novel and consistent samples from scratch and memorizing less, but performing less well in panel completion, even with advanced conditional sampling methods. Conversely, autoregressive models excel at completing missing panels in a rule-consistent manner but generate less consistent samples unconditionally. We observe diverse data scaling behaviors: for both model families, rule learning emerges at a certain dataset size - around 1000s examples per rule. With more training data, diffusion models improve both their unconditional and conditional generation capabilities. However, for autoregressive models, while panel completion improves with more training data, unconditional generation consistency declines. Our findings highlight complementary capabilities and limitations of diffusion and autoregressive models in rule learning and reasoning tasks, suggesting avenues for further research into their mechanisms and potential for human-like reasoning.

[AI-15] Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimers Disease

链接: https://arxiv.org/abs/2411.07871
作者: Francesco Chiumento,Mingming Liu
关键词-EN: Large Language Models, Large Language, advancements in Large, X-rays are paired, shown great potential
类目: Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: The paper has been accepted by the conference: “2024 International Conference on Big Data (IEEE Big Data 2024)”

点击查看摘要

Abstract:The rapid advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown great potential in medical diagnostics, particularly in radiology, where datasets such as X-rays are paired with human-generated diagnostic reports. However, a significant research gap exists in the neuroimaging field, especially for conditions such as Alzheimer’s disease, due to the lack of comprehensive diagnostic reports that can be utilized for model fine-tuning. This paper addresses this gap by generating synthetic diagnostic reports using GPT-4o-mini on structured data from the OASIS-4 dataset, which comprises 663 patients. Using the synthetic reports as ground truth for training and validation, we then generated neurological reports directly from the images in the dataset leveraging the pre-trained BiomedCLIP and T5 models. Our proposed method achieved a BLEU-4 score of 0.1827, ROUGE-L score of 0.3719, and METEOR score of 0.4163, revealing its potential in generating clinically relevant and accurate diagnostic reports.

[AI-16] Federated Learning for Discrete Optimal Transport with Large Population under Incomplete Information

链接: https://arxiv.org/abs/2411.07841
作者: Navpreet Kaur,Juntao Chen,Yingdong Lu
关键词-EN: Optimal transport, powerful framework, optimal transport framework, Optimal, heterogeneous target populations
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimal transport is a powerful framework for the efficient allocation of resources between sources and targets. However, traditional models often struggle to scale effectively in the presence of large and heterogeneous populations. In this work, we introduce a discrete optimal transport framework designed to handle large-scale, heterogeneous target populations, characterized by type distributions. We address two scenarios: one where the type distribution of targets is known, and one where it is unknown. For the known distribution, we propose a fully distributed algorithm to achieve optimal resource allocation. In the case of unknown distribution, we develop a federated learning-based approach that enables efficient computation of the optimal transport scheme while preserving privacy. Case studies are provided to evaluate the performance of our learning algorithm.

[AI-17] Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

链接: https://arxiv.org/abs/2411.07826
作者: Kilian Pfeiffer,Mohamed Aboelenien Ahmed,Ramin Khalili,Jörg Henkel
关键词-EN: Large Language Models, Floating Point Operations, machine learning tasks, Transformer structures, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) through Transformer structures have dominated many machine learning tasks, especially text processing. However, these models require massive amounts of data for training and induce high resource requirements, particularly in terms of the large number of Floating Point Operations (FLOPs) and the high amounts of memory needed. To fine-tune such a model in a parameter-efficient way, techniques like Adapter or LoRA have been developed. However, we observe that the application of LoRA, when used in federated learning (FL), while still being parameter-efficient, is memory and FLOP inefficient. Based on that observation, we develop a novel layer finetuning scheme that allows devices in cross-device FL to make use of pretrained neural networks (NNs) while adhering to given resource constraints. We show that our presented scheme outperforms the current state of the art when dealing with homogeneous or heterogeneous computation and memory constraints and is on par with LoRA regarding limited communication, thereby achieving significantly higher accuracies in FL training.

[AI-18] Community Research Earth Digital Intelligence Twin (CREDIT)

链接: https://arxiv.org/abs/2411.07814
作者: John Schreck,Yingkai Sha,William Chapman,Dhamma Kimpara,Judith Berner,Seth McGinnis,Arnold Kazadi,Negin Sobhani,Ben Kirk,David John Gagne II
关键词-EN: numerical weather prediction, Recent advancements, transformed atmospheric modeling, significantly transformed atmospheric, NWP
类目: Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI) for numerical weather prediction (NWP) have significantly transformed atmospheric modeling. AI NWP models outperform traditional physics-based systems, such as the Integrated Forecast System (IFS), across several global metrics while requiring fewer computational resources. However, existing AI NWP models face limitations related to training datasets and timestep choices, often resulting in artifacts that reduce model performance. To address these challenges, we introduce the Community Research Earth Digital Intelligence Twin (CREDIT) framework, developed at NSF NCAR. CREDIT provides a flexible, scalable, and user-friendly platform for training and deploying AI-based atmospheric models on high-performance computing systems. It offers an end-to-end pipeline for data preprocessing, model training, and evaluation, democratizing access to advanced AI NWP capabilities. We demonstrate CREDIT’s potential through WXFormer, a novel deterministic vision transformer designed to predict atmospheric states autoregressively, addressing common AI NWP issues like compounding error growth with techniques such as spectral normalization, padding, and multi-step training. Additionally, to illustrate CREDIT’s flexibility and state-of-the-art model comparisons, we train the FUXI architecture within this framework. Our findings show that both FUXI and WXFormer, trained on six-hourly ERA5 hybrid sigma-pressure levels, generally outperform IFS HRES in 10-day forecasts, offering potential improvements in efficiency and forecast accuracy. CREDIT’s modular design enables researchers to explore various models, datasets, and training configurations, fostering innovation within the scientific community.

[AI-19] PatchCTG: Patch Cardiotocography Transformer for Antepartum Fetal Health Monitoring

链接: https://arxiv.org/abs/2411.07796
作者: M. Jaleed Khan,Manu Vatish,Gabriel Davis Jones
关键词-EN: high inter-observer variability, Antepartum Cardiotocography, inter-observer variability, leading to inconsistent, fetal health monitoring
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Antepartum Cardiotocography (CTG) is vital for fetal health monitoring, but traditional methods like the Dawes-Redman system are often limited by high inter-observer variability, leading to inconsistent interpretations and potential misdiagnoses. This paper introduces PatchCTG, a transformer-based model specifically designed for CTG analysis, employing patch-based tokenisation, instance normalisation and channel-independent processing to capture essential local and global temporal dependencies within CTG signals. PatchCTG was evaluated on the Oxford Maternity (OXMAT) dataset, comprising over 20,000 CTG traces across diverse clinical outcomes after applying the inclusion and exclusion criteria. With extensive hyperparameter optimisation, PatchCTG achieved an AUC of 77%, with specificity of 88% and sensitivity of 57% at Youden’s index threshold, demonstrating adaptability to various clinical needs. Testing across varying temporal thresholds showed robust predictive performance, particularly with finetuning on data closer to delivery, achieving a sensitivity of 52% and specificity of 88% for near-delivery cases. These findings suggest the potential of PatchCTG to enhance clinical decision-making in antepartum care by providing a reliable, objective tool for fetal health assessment. The source code is available at this https URL.

[AI-20] InvisMark: Invisible and Robust Watermarking for AI-generated Image Provenance

链接: https://arxiv.org/abs/2411.07795
作者: Rui Xu,Mengya(Mia)Hu,Deren Lei,Yaxi Li,David Lowe,Alex Gorevski,Mingyu Wang,Emily Ching,Alex Deng
关键词-EN: content authentication methods, authentication methods, Abstract, robust content authentication, AI-generated
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The proliferation of AI-generated images has intensified the need for robust content authentication methods. We present InvisMark, a novel watermarking technique designed for high-resolution AI-generated images. Our approach leverages advanced neural network architectures and training strategies to embed imperceptible yet highly robust watermarks. InvisMark achieves state-of-the-art performance in imperceptibility (PSNR \sim 51, SSIM \sim 0.998) while maintaining over 97% bit accuracy across various image manipulations. Notably, we demonstrate the successful encoding of 256-bit watermarks, significantly expanding payload capacity while preserving image quality. This enables the embedding of UUIDs with error correction codes, achieving near-perfect decoding success rates even under challenging image distortions. We also address potential vulnerabilities against advanced attacks and propose mitigation strategies. By combining high imperceptibility, extended payload capacity, and resilience to manipulations, InvisMark provides a robust foundation for ensuring media provenance in an era of increasingly sophisticated AI-generated content. Source code of this paper is available at: this https URL.

[AI-21] Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation WACV

链接: https://arxiv.org/abs/2411.07794
作者: Xiaowei Yu,Zhe Huang,Zao Zhang
关键词-EN: Unsupervised domain adaptation, labeled source domains, unlabeled target domains, Unsupervised domain, Convolutional Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) aims to leverage the knowledge learned from labeled source domains to improve performance on the unlabeled target domains. While Convolutional Neural Networks (CNNs) have been dominant in previous UDA methods, recent research has shown promise in applying Vision Transformers (ViTs) to this task. In this study, we propose a novel Feature Fusion Transferability Aware Transformer (FFTAT) to enhance ViT performance in UDA tasks. Our method introduces two key innovations: First, we introduce a patch discriminator to evaluate the transferability of patches, generating a transferability matrix. We integrate this matrix into self-attention, directing the model to focus on transferable patches. Second, we propose a feature fusion technique to fuse embeddings in the latent space, enabling each embedding to incorporate information from all others, thereby improving generalization. These two components work in synergy to enhance feature representation learning. Extensive experiments on widely used benchmarks demonstrate that our method significantly improves UDA performance, achieving state-of-the-art (SOTA) results.

[AI-22] RedCode: Risky Code Execution and Generation Benchmark for Code Agents NEURIPS2024

链接: https://arxiv.org/abs/2411.07781
作者: Chengquan Guo,Xun Liu,Chulin Xie,Andy Zhou,Yi Zeng,Zinan Lin,Dawn Song,Bo Li
关键词-EN: rapidly increasing capabilities, code, risky code execution, risky, risky code
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024 Datasets and Benchmarks Track

点击查看摘要

Abstract:With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents’ ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents’ vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at this https URL.

[AI-23] ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization

链接: https://arxiv.org/abs/2411.07762
作者: Weibo Zhao,Yubin Shi,Xinyu Lyu,Wanchen Sui,Shen Li,Yong Li
关键词-EN: poses significant challenges, large language model, achieving effective low-bit, pivotal technique, technique for large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quantization stands as a pivotal technique for large language model (LLM) serving, yet it poses significant challenges particularly in achieving effective low-bit quantization. The limited numerical mapping makes the quantized model produce a non-trivial error, bringing out intolerable performance degration. This paper is anchored in the basic idea of model compression objectives, and delves into the layer-wise error distribution of LLMs during post-training quantization. Subsequently, we introduce ASER, an algorithm consisting of (1) Error Reconstruction: low-rank compensation for quantization error with LoRA-style matrices constructed by whitening SVD; (2) Activation Smoothing: outlier extraction to gain smooth activation and better error compensation. ASER is capable of quantizing typical LLMs to low-bit ones, particularly preserving accuracy even in W4A8 per-channel setup. Experimental results show that ASER is competitive among the state-of-the-art quantization algorithms, showing potential to activation quantization, with minor overhead.

[AI-24] Navigation with QPHIL: Quantizing Planner for Hierarchical Implicit Q-Learning

链接: https://arxiv.org/abs/2411.07760
作者: Alexi Canesse,Mathieu Petitbois,Ludovic Denoyer,Sylvain Lamprier,Rémy Portelas
关键词-EN: Offline Reinforcement Learning, Reinforcement Learning, imitation learning, Offline Reinforcement, powerful alternative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Under review. Code will be released upon acceptance

点击查看摘要

Abstract:Offline Reinforcement Learning (RL) has emerged as a powerful alternative to imitation learning for behavior modeling in various domains, particularly in complex navigation tasks. An existing challenge with Offline RL is the signal-to-noise ratio, i.e. how to mitigate incorrect policy updates due to errors in value estimates. Towards this, multiple works have demonstrated the advantage of hierarchical offline RL methods, which decouples high-level path planning from low-level path following. In this work, we present a novel hierarchical transformer-based approach leveraging a learned quantizer of the space. This quantization enables the training of a simpler zone-conditioned low-level policy and simplifies planning, which is reduced to discrete autoregressive prediction. Among other benefits, zone-level reasoning in planning enables explicit trajectory stitching rather than implicit stitching based on noisy value function estimates. By combining this transformer-based planner with recent advancements in offline RL, our proposed approach achieves state-of-the-art results in complex long-distance navigation environments.

[AI-25] Optimizing Traffic Signal Control using High-Dimensional State Representation and Efficient Deep Reinforcement Learning

链接: https://arxiv.org/abs/2411.07759
作者: Lawrence Francis,Blessed Guda,Ahmed Biyabani
关键词-EN: traffic signal control, traffic signal, signal control, dimensional state representations, signal timing
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:In reinforcement learning-based (RL-based) traffic signal control (TSC), decisions on the signal timing are made based on the available information on vehicles at a road intersection. This forms the state representation for the RL environment which can either be high-dimensional containing several variables or a low-dimensional vector. Current studies suggest that using high dimensional state representations does not lead to improved performance on TSC. However, we argue, with experimental results, that the use of high dimensional state representations can, in fact, lead to improved TSC performance with improvements up to 17.9% of the average waiting time. This high-dimensional representation is obtainable using the cost-effective vehicle-to-infrastructure (V2I) communication, encouraging its adoption for TSC. Additionally, given the large size of the state, we identified the need to have computational efficient models and explored model compression via pruning.

[AI-26] SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

链接: https://arxiv.org/abs/2411.07751
作者: Xinyuan Qian,Jiaran Gao,Yaodan Zhang,Qiquan Zhang,Hexin Liu,Leibny Paola Garcia,Haizhou Li
关键词-EN: bring substantial advantages, Speech enhancement plays, substantial advantages, plays an essential, essential role
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. SAV-SE. To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we propose the VC-S ^2 E method, which incorporates the Conformer and Mamba modules for their complementary strengths. Extensive experiments are conducted on public MUSIC, AVSpeech and AudioSet datasets, where the results demonstrate the superiority of VC-S ^2 E over other competitive methods. We will make the source code publicly available. Project demo page: this https URL

[AI-27] Unlocking Legal Knowledge with Multi-Layered Embedding-Based Retrieval

链接: https://arxiv.org/abs/2411.07739
作者: João Alberto de Oliveira Lima
关键词-EN: multi-layered embedding-based retrieval, embedding-based retrieval method, Retrieval Augmented Generation, work addresses, addresses the challenge
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 27 pages, 10 figures

点击查看摘要

Abstract:This work addresses the challenge of capturing the complexities of legal knowledge by proposing a multi-layered embedding-based retrieval method for legal and legislative texts. Creating embeddings not only for individual articles but also for their components (paragraphs, clauses) and structural groupings (books, titles, chapters, etc), we seek to capture the subtleties of legal information through the use of dense vectors of embeddings, representing it at varying levels of granularity. Our method meets various information needs by allowing the Retrieval Augmented Generation system to provide accurate responses, whether for specific segments or entire sections, tailored to the user’s query. We explore the concepts of aboutness, semantic chunking, and inherent hierarchy within legal texts, arguing that this method enhances the legal information retrieval. Despite the focus being on Brazil’s legislative methods and the Brazilian Constitution, which follow a civil law tradition, our findings should in principle be applicable across different legal systems, including those adhering to common law traditions. Furthermore, the principles of the proposed method extend beyond the legal domain, offering valuable insights for organizing and retrieving information in any field characterized by information encoded in hierarchical text.

[AI-28] No-Reference Point Cloud Quality Assessment via Graph Convolutional Network

链接: https://arxiv.org/abs/2411.07728
作者: Wu Chen,Qiuping Jiang,Wei Zhou,Feng Shao,Guangtao Zhai,Weisi Lin
关键词-EN: visual media format, emerging visual media, realistic visual information, media format, emerging visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Three-dimensional (3D) point cloud, as an emerging visual media format, is increasingly favored by consumers as it can provide more realistic visual information than two-dimensional (2D) data. Similar to 2D plane images and videos, point clouds inevitably suffer from quality degradation and information loss through multimedia communication systems. Therefore, automatic point cloud quality assessment (PCQA) is of critical importance. In this work, we propose a novel no-reference PCQA method by using a graph convolutional network (GCN) to characterize the mutual dependencies of multi-view 2D projected image contents. The proposed GCN-based PCQA (GC-PCQA) method contains three modules, i.e., multi-view projection, graph construction, and GCN-based quality prediction. First, multi-view projection is performed on the test point cloud to obtain a set of horizontally and vertically projected images. Then, a perception-consistent graph is constructed based on the spatial relations among different projected images. Finally, reasoning on the constructed graph is performed by GCN to characterize the mutual dependencies and interactions between different projected images, and aggregate feature information of multi-view projected images for final quality prediction. Experimental results on two publicly available benchmark databases show that our proposed GC-PCQA can achieve superior performance than state-of-the-art quality assessment metrics. The code will be available at: this https URL.

[AI-29] Is Cognition consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

链接: https://arxiv.org/abs/2411.07722
作者: Zirui Shao,Chuwei Luo,Zhaoqing Zhu,Hangdi Xing,Zhi Yu,Qi Zheng,Jiajun Bu
关键词-EN: shown impressive capabilities, rapidly growing research, growing research area, significant industrial demand, large language models
类目: Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand in recent years. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it “sees” and what it “understands.” Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (CP) knowledge conflicts, a form of multimodal knowledge conflicts, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 68.6% CP consistency. To mitigate the CP knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. This method first ensures task-specific consistency and then connects the cognitive and perceptual knowledge. Our method significantly reduces CP knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks in most scenarios.

[AI-30] raining Data for Large Language Model

链接: https://arxiv.org/abs/2411.07715
作者: Yiming Ju,Huanhuan Ma
关键词-EN: gained widespread attention, models gained widespread, language models gained, widespread attention, gained widespread
类目: Artificial Intelligence (cs.AI)
*备注: in Chinese language

点击查看摘要

Abstract:In 2022, with the release of ChatGPT, large-scale language models gained widespread attention. ChatGPT not only surpassed previous models in terms of parameters and the scale of its pretraining corpus but also achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data. This progress has led enterprises and research institutions to recognize that building smarter and more powerful models relies on rich and high-quality datasets. Consequently, the construction and optimization of datasets have become a critical focus in the field of artificial intelligence. This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models, covering aspects such as data scale, collection methods, data types and characteristics, processing workflows, and provides an overview of available open-source datasets.

[AI-31] New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook

链接: https://arxiv.org/abs/2411.07691
作者: Meng Yang,Tianqing Zhu,Chi Liu,WanLei Zhou,Shui Yu,Philip S. Yu
关键词-EN: neural language processing, pre-trained models, achieve outstanding performance, build pre-trained models, computer vision
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Thanks to the explosive growth of data and the development of computational resources, it is possible to build pre-trained models that can achieve outstanding performance on various tasks, such as neural language processing, computer vision, and more. Despite their powerful capabilities, pre-trained models have also sparked attention to the emerging security challenges associated with their real-world applications. Security and privacy issues, such as leaking privacy information and generating harmful responses, have seriously undermined users’ confidence in these powerful models. Concerns are growing as model performance improves dramatically. Researchers are eager to explore the unique security and privacy issues that have emerged, their distinguishing factors, and how to defend against them. However, the current literature lacks a clear taxonomy of emerging attacks and defenses for pre-trained models, which hinders a high-level and comprehensive understanding of these questions. To fill the gap, we conduct a systematical survey on the security risks of pre-trained models, proposing a taxonomy of attack and defense methods based on the accessibility of pre-trained models’ input and weights in various security test scenarios. This taxonomy categorizes attacks and defenses into No-Change, Input-Change, and Model-Change approaches. With the taxonomy analysis, we capture the unique security and privacy issues of pre-trained models, categorizing and summarizing existing security issues based on their characteristics. In addition, we offer a timely and comprehensive review of each category’s strengths and limitations. Our survey concludes by highlighting potential new research opportunities in the security and privacy of pre-trained models.

[AI-32] World Models: The Safety Perspective

链接: https://arxiv.org/abs/2411.07690
作者: Zifan Zeng,Chongzhe Zhang,Feng Liu,Joseph Sifakis,Qunli Zhang,Shiming Liu,Peng Wang
关键词-EN: Large Language Model, Language Model, World Models, Large Language, concept of World
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, 3 figures, accepted at the International Workshop on Dependability Modeling and Design (WDMD) during the IEEE International Symposium on Software Reliability Engineering (ISSRE)

点击查看摘要

Abstract:With the proliferation of the Large Language Model (LLM), the concept of World Models (WM) has recently attracted a great deal of attention in the AI research community, especially in the context of AI agents. It is arguably evolving into an essential foundation for building AI agent systems. A WM is intended to help the agent predict the future evolution of environmental states or help the agent fill in missing information so that it can plan its actions and behave safely. The safety property of WM plays a key role in their effective use in critical applications. In this work, we review and analyze the impacts of the current state-of-the-art in WM technology from the point of view of trustworthiness and safety based on a comprehensive survey and the fields of application envisaged. We provide an in-depth analysis of state-of-the-art WMs and derive technical research challenges and their impact in order to call on the research community to collaborate on improving the safety and trustworthiness of WM.

[AI-33] Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG

链接: https://arxiv.org/abs/2411.07688
作者: Zilun Zhang,Haozhan Shen,Tiancheng Zhao,Yuhao Wang,Bin Chen,Yuxiang Cai,Yongheng Shang,Jianwei Yin
关键词-EN: Ultra High Resolution, Large Language Models, Multimodal Large Language, Sensing Multimodal Large, Ultra High
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ultra High Resolution (UHR) remote sensing imagery (RSI) (e.g. 100,000 \times 100,000 pixels or more) poses a significant challenge for current Remote Sensing Multimodal Large Language Models (RSMLLMs). If choose to resize the UHR image to standard input image size, the extensive spatial and contextual information that UHR images contain will be neglected. Otherwise, the original size of these images often exceeds the token limits of standard RSMLLMs, making it difficult to process the entire image and capture long-range dependencies to answer the query based on the abundant visual context. In this paper, we introduce ImageRAG for RS, a training-free framework to address the complexities of analyzing UHR remote sensing imagery. By transforming UHR remote sensing image analysis task to image’s long context selection task, we design an innovative image contextual retrieval mechanism based on the Retrieval-Augmented Generation (RAG) technique, denoted as ImageRAG. ImageRAG’s core innovation lies in its ability to selectively retrieve and focus on the most relevant portions of the UHR image as visual contexts that pertain to a given query. Fast path and slow path are proposed in this framework to handle this task efficiently and effectively. ImageRAG allows RSMLLMs to manage extensive context and spatial information from UHR RSI, ensuring the analysis is both accurate and efficient.

[AI-34] Data-Driven Graph Switching for Cyber-Resilient Control in Microgrids

链接: https://arxiv.org/abs/2411.07686
作者: Suman Rath,Subham Sahoo
关键词-EN: secondary control, secondary control objectives, Distributed microgrids, Artificial Neural Network, secondary control layer
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注: Accepted in IEEE Design Methodologies Conference (DMC) 2024

点击查看摘要

Abstract:Distributed microgrids are conventionally dependent on communication networks to achieve secondary control objectives. This dependence makes them vulnerable to stealth data integrity attacks (DIAs) where adversaries may perform manipulations via infected transmitters and repeaters to jeopardize stability. This paper presents a physics-guided, supervised Artificial Neural Network (ANN)-based framework that identifies communication-level cyberattacks in microgrids by analyzing whether incoming measurements will cause abnormal behavior of the secondary control layer. If abnormalities are detected, an iteration through possible spanning tree graph topologies that can be used to fulfill secondary control objectives is done. Then, a communication network topology that would not create secondary control abnormalities is identified and enforced for maximum stability. By altering the communication graph topology, the framework eliminates the dependence of the secondary control layer on inputs from compromised cyber devices helping it achieve resilience without instability. Several case studies are provided showcasing the robustness of the framework against False Data Injections and repeater-level Man-in-the-Middle attacks. To understand practical feasibility, robustness is also verified against larger microgrid sizes and in the presence of varying noise levels. Our findings indicate that performance can be affected when attempting scalability in the presence of noise. However, the framework operates robustly in low-noise settings.

[AI-35] Fast Disentangled Slim Tensor Learning for Multi-view Clustering

链接: https://arxiv.org/abs/2411.07685
作者: Deng Xu,Chao Zhang,Zechao Li,Chunlin Chen,Huaxiong Li
关键词-EN: recently received significant, received significant attention, significant attention due, cross-view high-order correlations, recently received
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages,6 figures, will be published to IEEE TMM

点击查看摘要

Abstract:Tensor-based multi-view clustering has recently received significant attention due to its exceptional ability to explore cross-view high-order correlations. However, most existing methods still encounter some limitations. (1) Most of them explore the correlations among different affinity matrices, making them unscalable to large-scale data. (2) Although some methods address it by introducing bipartite graphs, they may result in sub-optimal solutions caused by an unstable anchor selection process. (3) They generally ignore the negative impact of latent semantic-unrelated information in each view. To tackle these issues, we propose a new approach termed fast Disentangled Slim Tensor Learning (DSTL) for multi-view clustering . Instead of focusing on the multi-view graph structures, DSTL directly explores the high-order correlations among multi-view latent semantic representations based on matrix factorization. To alleviate the negative influence of feature redundancy, inspired by robust PCA, DSTL disentangles the latent low-dimensional representation into a semantic-unrelated part and a semantic-related part for each view. Subsequently, two slim tensors are constructed with tensor-based regularization. To further enhance the quality of feature disentanglement, the semantic-related representations are aligned across views through a consensus alignment indicator. Our proposed model is computationally efficient and can be solved effectively. Extensive experiments demonstrate the superiority and efficiency of DSTL over state-of-the-art approaches. The code of DSTL is available at this https URL.

[AI-36] Spike Talk in Power Electronic Grids – Leveraging Post Moores Computing Laws

链接: https://arxiv.org/abs/2411.07654
作者: Yubo Song,Subham Sahoo
关键词-EN: Emerging distributed generation, generation demands highly, demands highly reliable, distributed generation demands, resilient coordinating control
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
*备注: The manuscript has been accepted for publication in the Proceedings of 2024 IEEE Design Methodologies for Power Electronics Conference (DMC2024)

点击查看摘要

Abstract:Emerging distributed generation demands highly reliable and resilient coordinating control in microgrids. To improve on these aspects, spiking neural network is leveraged, as a grid-edge intelligence tool to establish a talkative infrastructure, Spike Talk, expediting coordination in next-generation microgrids without the need of communication at all. This paper unravels the physics behind Spike Talk from the perspective of its distributed infrastructure, which aims to address the Von Neumann Bottleneck. Relying on inferring information via power flows in tie lines, Spike Talk allows adaptive and flexible control and coordination itself, and features in synaptic plasticity facilitating online and local training functionality. Preliminary case studies are demonstrated with results, while more extensive validations are to be included as future scopes of work.

[AI-37] Understanding Audiovisual Deepfake Detection: Techniques Challenges Human Factors and Perceptual Insights

链接: https://arxiv.org/abs/2411.07650
作者: Ammarah Hashmi,Sahibzada Adil Shahzad,Chia-Wen Lin,Yu Tsao,Hsin-Min Wang
关键词-EN: Deep Learning, diverse fields, successfully applied, applied in diverse, deepfake detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Deep Learning has been successfully applied in diverse fields, and its impact on deepfake detection is no exception. Deepfakes are fake yet realistic synthetic content that can be used deceitfully for political impersonation, phishing, slandering, or spreading misinformation. Despite extensive research on unimodal deepfake detection, identifying complex deepfakes through joint analysis of audio and visual streams remains relatively unexplored. To fill this gap, this survey first provides an overview of audiovisual deepfake generation techniques, applications, and their consequences, and then provides a comprehensive review of state-of-the-art methods that combine audio and visual modalities to enhance detection accuracy, summarizing and critically analyzing their strengths and limitations. Furthermore, we discuss existing open source datasets for a deeper understanding, which can contribute to the research community and provide necessary information to beginners who want to analyze deep learning-based audiovisual methods for video forensics. By bridging the gap between unimodal and multimodal approaches, this paper aims to improve the effectiveness of deepfake detection strategies and guide future research in cybersecurity and media integrity.

[AI-38] Exploring Multi-Agent Reinforcement Learning for Unrelated Parallel Machine Scheduling

链接: https://arxiv.org/abs/2411.07634
作者: Maria Zampella,Urtzi Otamendi,Xabier Belaunzaran,Arkaitz Artetxe,Igor G. Olaizola,Giuseppe Longo,Basilio Sierra
关键词-EN: problems pose significant, Scheduling problems pose, Machine Scheduling Problem, Unrelated Parallel Machine, Parallel Machine Scheduling
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
*备注: 11 pages, 5 figures, 4 tables, article submitted to a journal

点击查看摘要

Abstract:Scheduling problems pose significant challenges in resource, industry, and operational management. This paper addresses the Unrelated Parallel Machine Scheduling Problem (UPMS) with setup times and resources using a Multi-Agent Reinforcement Learning (MARL) approach. The study introduces the Reinforcement Learning environment and conducts empirical analyses, comparing MARL with Single-Agent algorithms. The experiments employ various deep neural network policies for single- and Multi-Agent approaches. Results demonstrate the efficacy of the Maskable extension of the Proximal Policy Optimization (PPO) algorithm in Single-Agent scenarios and the Multi-Agent PPO algorithm in Multi-Agent setups. While Single-Agent algorithms perform adequately in reduced scenarios, Multi-Agent approaches reveal challenges in cooperative learning but a scalable capacity. This research contributes insights into applying MARL techniques to scheduling optimization, emphasizing the need for algorithmic sophistication balanced with scalability for intelligent scheduling solutions.

[AI-39] Optimizing Service Function Chain Mapping in Network Function Virtualization through Simultaneous NF Decomposition and VNF Placement

链接: https://arxiv.org/abs/2411.07606
作者: Asghar Asgharian-Sardroud,Mohammad Hossein Izanlou,Amin Jabbari,Sepehr Mahmoodian Hamedani
关键词-EN: service function chain, Network function virtualization, function virtualization enables, called service function, Virtual Network Functions
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Network function virtualization enables network operators to implement new services through a process called service function chain mapping. The concept of Service Function Chain (SFC) is introduced to provide complex services, which is an ordered set of Network Functions (NF). The network functions of an SFC can be decomposed in several ways into some Virtual Network Functions (VNF). Additionally, the decomposed NFs can be placed (mapped) as VNFs on different machines on the underlying physical infrastructure. Selecting good decompositions and good placements among the possible options greatly affects both costs and service quality metrics. Previous research has addressed NF decomposition and VNF placement as separate problems. However, in this paper, we address both NF decomposition and VNF placement simultaneously as a single problem. Since finding an optimal solution is NP-hard, we have employed heuristic algorithms to solve the problem. Specifically, we have introduced a multiobjective decomposition and mapping VNFs (MODMVNF) method based on the non-dominated sorting genetic multi-objective algorithm (NSGAII) to solve the problem. The goal is to find near-optimal decomposition and mapping on the physical network at the same time to minimize the mapping cost and communication latency of SFC. The comparison of the results of the proposed method with the results obtained by solving ILP formulation of the problem as well as the results obtained from the multi-objective particle swarm algorithm shows the efficiency and effectiveness of the proposed method in terms of cost and communication latency.

[AI-40] Overhead-free User-side Recommender Systems

链接: https://arxiv.org/abs/2411.07589
作者: Ryoma Sato
关键词-EN: user-side recommender systems, recommender systems, user-side recommender, recommender, user-side
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Digital Libraries (cs.DL)
*备注: arXiv admin note: text overlap with arXiv:2208.09864 , arXiv:2403.15757

点击查看摘要

Abstract:Traditionally, recommendation algorithms have been designed for service developers. But recently, a new paradigm called user-side recommender systems has been proposed. User-side recommender systems are built and used by end users, in sharp contrast to traditional provider-side recommender systems. Even if the official recommender system offered by the provider is not fair, end users can create and enjoy their own user-side recommender systems by themselves. Although the concept of user-side recommender systems is attractive, the problem is they require tremendous communication costs between the user and the official system. Even the most efficient user-side recommender systems require about 5 times more costs than provider-side recommender systems. Such high costs hinder the adoption of user-side recommender systems. In this paper, we propose overhead-free user-side recommender systems, RecCycle, which realizes user-side recommender systems without any communication overhead. The main idea of RecCycle is to recycle past recommendation results offered by the provider’s recommender systems. The ingredients of RecCycle can be retrieved ``for free,‘’ and it greatly reduces the cost of user-side recommendations. In the experiments, we confirm that RecCycle performs as well as state-of-the-art user-side recommendation algorithms while RecCycle reduces costs significantly.

[AI-41] A Comprehensive Survey of AI-Driven Advancements and Techniques in Automated Program Repair and Code Generation

链接: https://arxiv.org/abs/2411.07586
作者: Avinash Anand,Akshit Gupta,Nishchay Yadav,Shaurya Bajaj
关键词-EN: Large Language Models, core research topics, Automated Program Repair, code generation, Large Language
类目: Artificial Intelligence (cs.AI)
*备注: A survey of recent developments in AI-assisted automated program repair

点击查看摘要

Abstract:Bug fixing and code generation have been core research topics in software development for many years. The recent explosive growth in Large Language Models has completely transformed these spaces, putting in reach incredibly powerful tools for both. In this survey, 27 recent papers have been reviewed and split into two groups: one dedicated to Automated Program Repair (APR) and LLM integration and the other to code generation using LLMs. The first group consists of new methods for bug detection and repair, which include locating semantic errors, security vulnerabilities, and runtime failure bugs. The place of LLMs in reducing manual debugging efforts is emphasized in this work by APR toward context-aware fixes, with innovations that boost accuracy and efficiency in automatic debugging. The second group dwells on code generation, providing an overview of both general-purpose LLMs fine-tuned for programming and task-specific models. It also presents methods to improve code generation, such as identifier-aware training, fine-tuning at the instruction level, and incorporating semantic code structures. This survey work contrasts the methodologies in APR and code generation to identify trends such as using LLMs, feedback loops to enable iterative code improvement and open-source models. It also discusses the challenges of achieving functional correctness and security and outlines future directions for research in LLM-based software development.

[AI-42] Disentangling Tabular Data towards Better One-Class Anomaly Detection

链接: https://arxiv.org/abs/2411.07574
作者: Jianan Ye,Zhaorui Tan,Yijie Hu,Xi Yang,Guangliang Cheng,Kaizhu Huang
关键词-EN: involves accurately conceptualizing, classification setting poses, one-class classification setting, significant challenge, accurately conceptualizing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular anomaly detection under the one-class classification setting poses a significant challenge, as it involves accurately conceptualizing “normal” derived exclusively from a single category to discern anomalies from normal data variations. Capturing the intrinsic correlation among attributes within normal samples presents one promising method for learning the concept. To do so, the most recent effort relies on a learnable mask strategy with a reconstruction task. However, this wisdom may suffer from the risk of producing uniform masks, i.e., essentially nothing is masked, leading to less effective correlation learning. To address this issue, we presume that attributes related to others in normal samples can be divided into two non-overlapping and correlated subsets, defined as CorrSets, to capture the intrinsic correlation effectively. Accordingly, we introduce an innovative method that disentangles CorrSets from normal tabular data. To our knowledge, this is a pioneering effort to apply the concept of disentanglement for one-class anomaly detection on tabular data. Extensive experiments on 20 tabular datasets show that our method substantially outperforms the state-of-the-art methods and leads to an average performance improvement of 6.1% on AUC-PR and 2.1% on AUC-ROC.

[AI-43] Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models

链接: https://arxiv.org/abs/2411.07563
作者: Dongrui Han,Mingyu Cui,Jiawen Kang,Xixin Wu,Xunying Liu,Helen Meng
关键词-EN: Large Language Models, responsible for mapping, phonetic representations, crucial step, conversion
类目: Artificial Intelligence (cs.AI)
*备注: accepted by ISCSLP 2024

点击查看摘要

Abstract:Grapheme-to-phoneme (G2P) conversion is a crucial step in Text-to-Speech (TTS) systems, responsible for mapping grapheme to corresponding phonetic representations. However, it faces ambiguities problems where the same grapheme can represent multiple phonemes depending on contexts, posing a challenge for G2P conversion. Inspired by the remarkable success of Large Language Models (LLMs) in handling context-aware scenarios, contextual G2P conversion systems with LLMs’ in-context knowledge retrieval (ICKR) capabilities are proposed to promote disambiguation capability. The efficacy of incorporating ICKR into G2P conversion systems is demonstrated thoroughly on the Librig2p dataset. In particular, the best contextual G2P conversion system using ICKR outperforms the baseline with weighted average phoneme error rate (PER) reductions of 2.0% absolute (28.9% relative). Using GPT-4 in the ICKR system can increase of 3.5% absolute (3.8% relative) on the Librig2p dataset.

[AI-44] EUR/USD Exchange Rate Forecasting incorporating Text Mining Based on Pre-trained Language Models and Deep Learning Methods

链接: https://arxiv.org/abs/2411.07560
作者: Xiangyu Shi,Hongcheng Ding,Salaar Faroog,Deshinta Arrova Dewi,Shamsul Nahar Abdullah,Bahiah A Malek
关键词-EN: USD exchange rate, particle swarm optimization, integrates deep learning, approach for EUR, USD exchange
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study introduces a novel approach for EUR/USD exchange rate forecasting that integrates deep learning, textual analysis, and particle swarm optimization (PSO). By incorporating online news and analysis texts as qualitative data, the proposed PSO-LSTM model demonstrates superior performance compared to traditional econometric and machine learning models. The research employs advanced text mining techniques, including sentiment analysis using the RoBERTa-Large model and topic modeling with LDA. Empirical findings underscore the significant advantage of incorporating textual data, with the PSO-LSTM model outperforming benchmark models such as SVM, SVR, ARIMA, and GARCH. Ablation experiments reveal the contribution of each textual data category to the overall forecasting performance. The study highlights the transformative potential of artificial intelligence in finance and paves the way for future research in real-time forecasting and the integration of alternative data sources.

[AI-45] Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models NEURIPS

链接: https://arxiv.org/abs/2411.07559
作者: Tiejin Chen,Kaishen Wang,Hua Wei
关键词-EN: Multi-modal Large Language, Large Language Models, Large Language, induce Multi-modal Large, raise significant safety
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to Neurips SafeGenAi Workshop 2024

点击查看摘要

Abstract:Jailbreaking methods, which induce Multi-modal Large Language Models (MLLMs) to output harmful responses, raise significant safety concerns. Among these methods, gradient-based approaches, which use gradients to generate malicious prompts, have been widely studied due to their high success rates in white-box settings, where full access to the model is available. However, these methods have notable limitations: they require white-box access, which is not always feasible, and involve high memory usage. To address scenarios where white-box access is unavailable, attackers often resort to transfer attacks. In transfer attacks, malicious inputs generated using white-box models are applied to black-box models, but this typically results in reduced attack performance. To overcome these challenges, we propose Zer0-Jack, a method that bypasses the need for white-box access by leveraging zeroth-order optimization. We propose patch coordinate descent to efficiently generate malicious image inputs to directly attack black-box MLLMs, which significantly reduces memory usage further. Through extensive experiments, Zer0-Jack achieves a high attack success rate across various models, surpassing previous transfer-based methods and performing comparably with existing white-box jailbreak techniques. Notably, Zer0-Jack achieves a 95% attack success rate on MiniGPT-4 with the Harmful Behaviors Multi-modal Dataset on a black-box setting, demonstrating its effectiveness. Additionally, we show that Zer0-Jack can directly attack commercial MLLMs such as GPT-4o. Codes are provided in the supplement.

[AI-46] Model Stealing for Any Low-Rank Language Model

链接: https://arxiv.org/abs/2411.07536
作者: Allen Liu,Ankur Moitra
关键词-EN: carefully chosen queries, chosen queries, machine learning, Mahajan and Zhang, carefully chosen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Model stealing, where a learner tries to recover an unknown model via carefully chosen queries, is a critical problem in machine learning, as it threatens the security of proprietary models and the privacy of data they are trained on. In recent years, there has been particular interest in stealing large language models (LLMs). In this paper, we aim to build a theoretical understanding of stealing language models by studying a simple and mathematically tractable setting. We study model stealing for Hidden Markov Models (HMMs), and more generally low-rank language models. We assume that the learner works in the conditional query model, introduced by Kakade, Krishnamurthy, Mahajan and Zhang. Our main result is an efficient algorithm in the conditional query model, for learning any low-rank distribution. In other words, our algorithm succeeds at stealing any language model whose output distribution is low-rank. This improves upon the previous result by Kakade, Krishnamurthy, Mahajan and Zhang, which also requires the unknown distribution to have high “fidelity”, a property that holds only in restricted cases. There are two key insights behind our algorithm: First, we represent the conditional distributions at each timestep by constructing barycentric spanners among a collection of vectors of exponentially large dimension. Second, for sampling from our representation, we iteratively solve a sequence of convex optimization problems that involve projection in relative entropy to prevent compounding of errors over the length of the sequence. This is an interesting example where, at least theoretically, allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2411.07536 [cs.LG] (or arXiv:2411.07536v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.07536 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-47] Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis

链接: https://arxiv.org/abs/2411.07529
作者: Minda Li,Bhaskar Krishnamachari
关键词-EN: revolutionize software development, automatically generating code, Hypothesis, promise to revolutionize, program specifications
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:ChatGPT and other large language models (LLMs) promise to revolutionize software development by automatically generating code from program specifications. We assess the performance of ChatGPT’s GPT-3.5-turbo model on LeetCode, a popular platform with algorithmic coding challenges for technical interview practice, across three difficulty levels: easy, medium, and hard. We test three main hypotheses. First, ChatGPT solves fewer problems as difficulty rises (Hypothesis 1). Second, prompt engineering improves ChatGPT’s performance, with greater gains on easier problems and diminishing returns on harder ones (Hypothesis 2). Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket (Hypothesis 3). To investigate these hypotheses, we conduct automated experiments using Python scripts to generate prompts that instruct ChatGPT to create Python solutions. These solutions are stored and manually submitted on LeetCode to check their correctness. For Hypothesis 1, results show the GPT-3.5-turbo model successfully solves 92% of easy, 79% of medium, and 51% of hard problems. For Hypothesis 2, prompt engineering yields improvements: 14-29% for Chain of Thought Prompting, 38-60% by providing failed test cases in a second feedback prompt, and 33-58% by switching to GPT-4. From a random subset of problems ChatGPT solved in Python, it also solved 78% in Java, 50% in C++, and none in Elixir, Erlang, or Racket. These findings generally validate all three hypotheses.

[AI-48] IPS: Threat Actor Informed Prioritization of Applications using SecEncoder

链接: https://arxiv.org/abs/2411.07519
作者: Muhammed Fatih Bulut,Acar Tamersoy,Naveed Ahmad,Yingqi Liu,Lloyd Greenwald
关键词-EN: Actor Informed Prioritization, Informed Prioritization, Threat Actor Informed, Prioritization using SecEncoder, paper introduces TIPS
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces TIPS: Threat Actor Informed Prioritization using SecEncoder, a specialized language model for security. TIPS combines the strengths of both encoder and decoder language models to detect and prioritize compromised applications. By integrating threat actor intelligence, TIPS enhances the accuracy and relevance of its detections. Extensive experiments with a real-world benchmark dataset of applications demonstrate TIPS’s high efficacy, achieving an F-1 score of 0.90 in identifying malicious applications. Additionally, in real-world scenarios, TIPS significantly reduces the backlog of investigations for security analysts by 87%, thereby streamlining the threat response process and improving overall security posture.

[AI-49] LLM App Squatting and Cloning

链接: https://arxiv.org/abs/2411.07518
作者: Yinglin Xie,Xinyi Hou,Yanjie Zhao,Kai Chen,Haoyu Wang
关键词-EN: posed longstanding challenges, Large Language Model, Impersonation tactics, mobile app stores, malicious actors exploit
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Impersonation tactics, such as app squatting and app cloning, have posed longstanding challenges in mobile app stores, where malicious actors exploit the names and reputations of popular apps to deceive users. With the rapid growth of Large Language Model (LLM) stores like GPT Store and FlowGPT, these issues have similarly surfaced, threatening the integrity of the LLM app ecosystem. In this study, we present the first large-scale analysis of LLM app squatting and cloning using our custom-built tool, LLMappCrazy. LLMappCrazy covers 14 squatting generation techniques and integrates Levenshtein distance and BERT-based semantic analysis to detect cloning by analyzing app functional similarities. Using this tool, we generated variations of the top 1000 app names and found over 5,000 squatting apps in the dataset. Additionally, we observed 3,509 squatting apps and 9,575 cloning cases across six major platforms. After sampling, we find that 18.7% of the squatting apps and 4.9% of the cloning apps exhibited malicious behavior, including phishing, malware distribution, fake content dissemination, and aggressive ad injection.

[AI-50] An Attack Traffic Identification Method Based on Temporal Spectrum

链接: https://arxiv.org/abs/2411.07510
作者: Wenwei Xie,Jie Yin,Zihao Chen
关键词-EN: data noise interference, attack traffic detection, network attack detection, existing network attack, address the issues
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 20 pages, 7 figures, 7 tables, 8 formulas

点击查看摘要

Abstract:To address the issues of insufficient robustness, unstable features, and data noise interference in existing network attack detection and identification models, this paper proposes an attack traffic detection and identification method based on temporal spectrum. First, traffic data is segmented by a sliding window to construct a feature sequence and a corresponding label sequence for network traffic. Next, the proposed spectral label generation methods, SSPE and COAP, are applied to transform the label sequence into spectral labels and the feature sequence into temporal features. Spectral labels and temporal features are used to capture and represent behavioral patterns of attacks. Finally, the constructed temporal features and spectral labels are used to train models, which subsequently detects and identifies network attack behaviors. Experimental results demonstrate that compared to traditional methods, models trained with the SSPE or COAP method improve identification accuracy by 10%, and exhibit strong robustness, particularly in noisy environments.

[AI-51] FM-TS: Flow Matching for Time Series Generation

链接: https://arxiv.org/abs/2411.07506
作者: Yang Hu,Xiao Wang,Lirong Wu,Huatian Zhang,Stan Z. Li,Sheng Wang,Tianlong Chen
关键词-EN: Time series generation, Time series, analyzing temporal data, unconditional time series, series generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series generation has emerged as an essential tool for analyzing temporal data across numerous fields. While diffusion models have recently gained significant attention in generating high-quality time series, they tend to be computationally demanding and reliant on complex stochastic processes. To address these limitations, we introduce FM-TS, a rectified Flow Matching-based framework for Time Series generation, which simplifies the time series generation process by directly optimizing continuous trajectories. This approach avoids the need for iterative sampling or complex noise schedules typically required in diffusion-based models. FM-TS is more efficient in terms of training and inference. Moreover, FM-TS is highly adaptive, supporting both conditional and unconditional time series generation. Notably, through our novel inference design, the model trained in an unconditional setting can seamlessly generalize to conditional tasks without the need for retraining. Extensive benchmarking across both settings demonstrates that FM-TS consistently delivers superior performance compared to existing approaches while being more efficient in terms of training and inference. For instance, in terms of discriminative score, FM-TS achieves 0.005, 0.019, 0.011, 0.005, 0.053, and 0.106 on the Sines, Stocks, ETTh, MuJoCo, Energy, and fMRI unconditional time series datasets, respectively, significantly outperforming the second-best method which achieves 0.006, 0.067, 0.061, 0.008, 0.122, and 0.167 on the same datasets. We have achieved superior performance in solar forecasting and MuJoCo imputation tasks, significantly enhanced by our innovative t power sampling method. The code is available at this https URL.

[AI-52] LAUREL: Learned Augmented Residual Layer ICML

链接: https://arxiv.org/abs/2411.07501
作者: Gaurav Menghani,Ravi Kumar,Sanjiv Kumar
关键词-EN: deep learning methods, Learned Augmented Residual, Augmented Residual Layer, efficient deep learning, residual connection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the 2nd Efficient Systems for Foundation Models Workshop at the International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:One of the core pillars of efficient deep learning methods is architectural improvements such as the residual/skip connection, which has led to significantly better model convergence and quality. Since then the residual connection has become ubiquitous in not just convolutional neural networks but also transformer-based architectures, the backbone of LLMs. In this paper we introduce \emphLearned Augmented Residual Layer (LAuReL) – a novel generalization of the canonical residual connection – with the goal to be an in-situ replacement of the latter while outperforming on both model quality and footprint metrics. Our experiments show that using \laurel can help boost performance for both vision and language models. For example, on the ResNet-50, ImageNet 1K task, it achieves 60% of the gains from adding an extra layer, while only adding 0.003% more parameters, and matches it while adding 2.6\times fewer parameters. Comments: Accepted at the 2nd Efficient Systems for Foundation Models Workshop at the International Conference on Machine Learning (ICML) 2024 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.07501 [cs.LG] (or arXiv:2411.07501v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.07501 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-53] Enhancing Link Prediction with Fuzzy Graph Attention Networks and Dynamic Negative Sampling

链接: https://arxiv.org/abs/2411.07482
作者: Jinming Xing
关键词-EN: traditional Graph Neural, Graph Neural Networks, Graph Attention Networks, Graph Neural, understanding complex networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Link prediction is crucial for understanding complex networks but traditional Graph Neural Networks (GNNs) often rely on random negative sampling, leading to suboptimal performance. This paper introduces Fuzzy Graph Attention Networks (FGAT), a novel approach integrating fuzzy rough sets for dynamic negative sampling and enhanced node feature aggregation. Fuzzy Negative Sampling (FNS) systematically selects high-quality negative edges based on fuzzy similarities, improving training efficiency. FGAT layer incorporates fuzzy rough set principles, enabling robust and discriminative node representations. Experiments on two research collaboration networks demonstrate FGAT’s superior link prediction accuracy, outperforming state-of-the-art baselines by leveraging the power of fuzzy rough sets for effective negative sampling and node feature learning.

[AI-54] BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

链接: https://arxiv.org/abs/2411.07461
作者: Anas Awadalla,Le Xue,Manli Shu,An Yan,Jun Wang,Senthil Purushwalkam,Sheng Shen,Hannah Lee,Oscar Lo,Jae Sung Park,Etash Guha,Silvio Savarese,Ludwig Schmidt,Yejin Choi,Caiming Xiong,Ran Xu
关键词-EN: million image-text pairs, factual web-scale alt-text, million image-text, image-text pairs, pairs that bridges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at this https URL

[AI-55] Research on fault diagnosis of nuclear power first-second circuit based on hierarchical multi-granularity classification network

链接: https://arxiv.org/abs/2411.07453
作者: Jiangwen Chen,Siwei Li,Guo Jiang,Cheng Dongzhen,Lin Hua,Wang Wei
关键词-EN: nuclear power plants, nuclear power, nuclear power units, complex electromechanical systems, power plants
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The safe and reliable operation of complex electromechanical systems in nuclear power plants is crucial for the safe production of nuclear power plants and their nuclear power unit. Therefore, accurate and timely fault diagnosis of nuclear power systems is of great significance for ensuring the safe and reliable operation of nuclear power plants. The existing fault diagnosis methods mainly target a single device or subsystem, making it difficult to analyze the inherent connections and mutual effects between different types of faults at the entire unit level. This article uses the AP1000 full-scale simulator to simulate the important mechanical component failures of some key systems in the primary and secondary circuits of nuclear power units, and constructs a fault dataset. Meanwhile, a hierarchical multi granularity classification fault diagnosis model based on the EfficientNet large model is proposed, aiming to achieve hierarchical classification of nuclear power faults. The results indicate that the proposed fault diagnosis model can effectively classify faults in different circuits and system components of nuclear power units into hierarchical categories. However, the fault dataset in this study was obtained from a simulator, which may introduce additional information due to parameter redundancy, thereby affecting the diagnostic performance of the model.

[AI-56] Optimizing Data Delivery: Insights from User Preferences on Visuals Tables and Text

链接: https://arxiv.org/abs/2411.07451
作者: Reuben Luera,Ryan Rossi,Franck Dernoncourt,Alexa Siu,Sungchul Kim,Tong Yu,Ruiyi Zhang,Xiang Chen,Nedim Lipka,Zhehao Zhang,Seon Gyeom Kim,Tak Yeon Lee
关键词-EN: user, user preference data, user preference, data, table
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we research user preferences to see a chart, table, or text given a question asked by the user. This enables us to understand when it is best to show a chart, table, or text to the user for the specific question. For this, we conduct a user study where users are shown a question and asked what they would prefer to see and used the data to establish that a user’s personal traits does influence the data outputs that they prefer. Understanding how user characteristics impact a user’s preferences is critical to creating data tools with a better user experience. Additionally, we investigate to what degree an LLM can be used to replicate a user’s preference with and without user preference data. Overall, these findings have significant implications pertaining to the development of data tools and the replication of human preferences using LLMs. Furthermore, this work demonstrates the potential use of LLMs to replicate user preference data which has major implications for future user modeling and personalization research.

[AI-57] he Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving

链接: https://arxiv.org/abs/2411.07447
作者: Kyoungmin Kim,Kijae Hong,Caglar Gulcehre,Anastasia Ailamaki
关键词-EN: Large Language Models, Large Language, usage of Large, Language Models, LLM inference systems
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The growing usage of Large Language Models (LLMs) highlights the demands and challenges in scalable LLM inference systems, affecting deployment and development processes. On the deployment side, there is a lack of comprehensive analysis on the conditions under which a particular scheduler performs better or worse, with performance varying substantially across different schedulers, hardware, models, and workloads. Manually testing each configuration on GPUs can be prohibitively expensive. On the development side, unpredictable performance and unknown upper limits can lead to inconclusive trial-and-error processes, consuming resources on ideas that end up ineffective. To address these challenges, we introduce INFERMAX, an analytical framework that uses inference cost models to compare various schedulers, including an optimal scheduler formulated as a constraint satisfaction problem (CSP) to establish an upper bound on performance. Our framework offers in-depth analysis and raises essential questions, challenging assumptions and exploring opportunities for more efficient scheduling. Notably, our findings indicate that preempting requests can reduce GPU costs by 30% compared to avoiding preemptions at all. We believe our methods and insights will facilitate the cost-effective deployment and development of scalable, efficient inference systems and pave the way for cost-based scheduling.

[AI-58] Input-Based Ensemble-Learning Method for Dynamic Memory Configuration of Serverless Computing Functions

链接: https://arxiv.org/abs/2411.07444
作者: Siddharth Agarwal,Maria A. Rodriguez,Rajkumar Buyya
关键词-EN: CPU and network, successful execution, responsible for configuring, configuring function memory, function
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 10 pages, 2 tables, 28 figures, accepted conference paper - UCC’24

点击查看摘要

Abstract:In today’s Function-as-a-Service offerings, a programmer is usually responsible for configuring function memory for its successful execution, which allocates proportional function resources such as CPU and network. However, right-sizing the function memory force developers to speculate performance and make ad-hoc configuration decisions. Recent research has highlighted that a function’s input characteristics, such as input size, type and number of inputs, significantly impact its resource demand, run-time performance and costs with fluctuating workloads. This correlation further makes memory configuration a non-trivial task. On that account, an input-aware function memory allocator not only improves developer productivity by completely hiding resource-related decisions but also drives an opportunity to reduce resource wastage and offer a finer-grained cost-optimised pricing scheme. Therefore, we present MemFigLess, a serverless solution that estimates the memory requirement of a serverless function with input-awareness. The framework executes function profiling in an offline stage and trains a multi-output Random Forest Regression model on the collected metrics to invoke input-aware optimal configurations. We evaluate our work with the state-of-the-art approaches on AWS Lambda service to find that MemFigLess is able to capture the input-aware resource relationships and allocate upto 82% less resources and save up to 87% run-time costs.

[AI-59] Automatically Detecting Online Deceptive Patterns in Real-time

链接: https://arxiv.org/abs/2411.07441
作者: Asmit Nayak,Shirley Zhang,Yash Wani,Rishabh Khandelwal,Kassem Fawaz
关键词-EN: exploiting cognitive biases, digital interfaces manipulate, making unintended decisions, interfaces manipulate users, exploiting cognitive
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Deceptive patterns (DPs) in digital interfaces manipulate users into making unintended decisions, exploiting cognitive biases and psychological vulnerabilities. These patterns have become ubiquitous across various digital platforms. While efforts to mitigate DPs have emerged from legal and technical perspectives, a significant gap in usable solutions that empower users to identify and make informed decisions about DPs in real-time remains. In this work, we introduce AutoBot, an automated, deceptive pattern detector that analyzes websites’ visual appearances using machine learning techniques to identify and notify users of DPs in real-time. AutoBot employs a two-staged pipeline that processes website screenshots, identifying interactable elements and extracting textual features without relying on HTML structure. By leveraging a custom language model, AutoBot understands the context surrounding these elements to determine the presence of deceptive patterns. We implement AutoBot as a lightweight Chrome browser extension that performs all analyses locally, minimizing latency and preserving user privacy. Through extensive evaluation, we demonstrate AutoBot’s effectiveness in enhancing users’ ability to navigate digital environments safely while providing a valuable tool for regulators to assess and enforce compliance with DP regulations.

[AI-60] Evaluating Detection Thresholds: The Impact of False Positives and Negatives on Super-Resolution Ultrasound Localization Microscopy

链接: https://arxiv.org/abs/2411.07426
作者: Sepideh K. Gharamaleki,Brandon Helfield,Hassan Rivaz
关键词-EN: ultrasound localization microscopy, ULM image quality, offers a high-resolution, microvascular structures, ULM image
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Super-resolution ultrasound imaging with ultrasound localization microscopy (ULM) offers a high-resolution view of microvascular structures. Yet, ULM image quality heavily relies on precise microbubble (MB) detection. Despite the crucial role of localization algorithms, there has been limited focus on the practical pitfalls in MB detection tasks such as setting the detection threshold. This study examines how False Positives (FPs) and False Negatives (FNs) affect ULM image quality by systematically adding controlled detection errors to simulated data. Results indicate that while both FP and FN rates impact Peak Signal-to-Noise Ratio (PSNR) similarly, increasing FP rates from 0% to 20% decreases Structural Similarity Index (SSIM) by 7%, whereas same FN rates cause a greater drop of around 45%. Moreover, dense MB regions are more resilient to detection errors, while sparse regions show high sensitivity, showcasing the need for robust MB detection frameworks to enhance super-resolution imaging.

[AI-61] Predicting BWR Criticality with Data-Driven Machine Learning Model

链接: https://arxiv.org/abs/2411.07425
作者: Muhammad Rizki Oktavian,Anirudh Tunga,Jonathan Nistor,James Tusar,J. Thomas Gruenwald,Yunlin Xu
关键词-EN: nuclear power plants, operating nuclear power, nuclear power, Large-scale nuclear power, power plants
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One of the challenges in operating nuclear power plants is to decide the amount of fuel needed in a cycle. Large-scale nuclear power plants are designed to operate at base load, meaning that they are expected to always operate at full power. Economically, a nuclear power plant should burn enough fuel to maintain criticality until the end of a cycle (EOC). If the reactor goes subcritical before the end of a cycle, it may result in early coastdown as the fuel in the core is already depleted. On contrary, if the reactor still has significant excess reactivity by the end of a cycle, the remaining fuels will remain unused. In both cases, the plant may lose a significant amount of money. This work proposes an innovative method based on a data-driven deep learning model to estimate the excess criticality of a boiling water reactor.

[AI-62] Data-Centric Learning Framework for Real-Time Detection of Aiming Beam in Fluorescence Lifetime Imaging Guided Surgery

链接: https://arxiv.org/abs/2411.07395
作者: Mohamed Abul Hassan,Pu Sun,Xiangnan Zhou,Lisanne Kraft,Kelsey T Hadfield,Katjana Ehrlich,Jinyi Qi,Andrew Birkeland,Laura Marcu
关键词-EN: fluorescence lifetime imaging, fiber-based fluorescence lifetime, Transoral Robotic Surgery, lifetime imaging, aiming beam detection
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study introduces a novel data-centric approach to improve real-time surgical guidance using fiber-based fluorescence lifetime imaging (FLIm). A key aspect of the methodology is the accurate detection of the aiming beam, which is essential for localizing points used to map FLIm measurements onto the tissue region within the surgical field. The primary challenge arises from the complex and variable conditions encountered in the surgical environment, particularly in Transoral Robotic Surgery (TORS). Uneven illumination in the surgical field can cause reflections, reduce contrast, and results in inconsistent color representation, further complicating aiming beam detection. To overcome these challenges, an instance segmentation model was developed using a data-centric training strategy that improves accuracy by minimizing label noise and enhancing detection robustness. The model was evaluated on a dataset comprising 40 in vivo surgical videos, demonstrating a median detection rate of 85%. This performance was maintained when the model was integrated in a clinical system, achieving a similar detection rate of 85% during TORS procedures conducted in patients. The system’s computational efficiency, measured at approximately 24 frames per second (FPS), was sufficient for real-time surgical guidance. This study enhances the reliability of FLIm-based aiming beam detection in complex surgical environments, advancing the feasibility of real-time, image-guided interventions for improved surgical precision

[AI-63] Feature-Space Semantic Invariance: Enhanced OOD Detection for Open-Set Domain Generalization

链接: https://arxiv.org/abs/2411.07392
作者: Haoliang Wang,Chen Zhao,Feng Chen
关键词-EN: Open-set domain generalization, domain generalization addresses, real-world challenge, Feature-space Semantic Invariance, addresses a real-world
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: IEEE BigData 2024, Ph.D. Forum

点击查看摘要

Abstract:Open-set domain generalization addresses a real-world challenge: training a model to generalize across unseen domains (domain generalization) while also detecting samples from unknown classes not encountered during training (open-set recognition). However, most existing approaches tackle these issues separately, limiting their practical applicability. To overcome this limitation, we propose a unified framework for open-set domain generalization by introducing Feature-space Semantic Invariance (FSI). FSI maintains semantic consistency across different domains within the feature space, enabling more accurate detection of OOD instances in unseen domains. Additionally, we adopt a generative model to produce synthetic data with novel domain styles or class labels, enhancing model robustness. Initial experiments show that our method improves AUROC by 9.1% to 18.9% on ColoredMNIST, while also significantly increasing in-distribution classification accuracy.

[AI-64] Federated Learning Client Pruning for Noisy Labels

链接: https://arxiv.org/abs/2411.07391
作者: Mahdi Morafah,Hojin Chang,Chen Chen,Bill Lin
关键词-EN: preserving data privacy, decentralized edge devices, Federated Learning, enables collaborative model, Federated Learning Client
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across decentralized edge devices while preserving data privacy. However, existing FL methods often assume clean annotated datasets, impractical for resource-constrained edge devices. In reality, noisy labels are prevalent, posing significant challenges to FL performance. Prior approaches attempt label correction and robust training techniques but exhibit limited efficacy, particularly under high noise levels. This paper introduces ClipFL (Federated Learning Client Pruning), a novel framework addressing noisy labels from a fresh perspective. ClipFL identifies and excludes noisy clients based on their performance on a clean validation dataset, tracked using a Noise Candidacy Score (NCS). The framework comprises three phases: pre-client pruning to identify potential noisy clients and calculate their NCS, client pruning to exclude a percentage of clients with the highest NCS, and post-client pruning for fine-tuning the global model with standard FL on clean clients. Empirical evaluation demonstrates ClipFL’s efficacy across diverse datasets and noise levels, achieving accurate noisy client identification, superior performance, faster convergence, and reduced communication costs compared to state-of-the-art FL methods. Our code is available at this https URL.

[AI-65] Data-Driven Analysis of AI in Medical Device Software in China: Deep Learning and General AI Trends Based on Regulatory Data

链接: https://arxiv.org/abs/2411.07378
作者: Yu Han,Aaron Ceross,Sarim Ather,Jeroen H.M. Bergmann
关键词-EN: attracting increasing attention, Artificial intelligence, Medical Products Administration, transformative clinical technology, National Medical Products
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial intelligence (AI) in medical device software (MDSW) represents a transformative clinical technology, attracting increasing attention within both the medical community and the regulators. In this study, we leverage a data-driven approach to automatically extract and analyze AI-enabled medical devices (AIMD) from the National Medical Products Administration (NMPA) regulatory database. The continued increase in publicly available regulatory data requires scalable methods for analysis. Automation of regulatory information screening is essential to create reproducible insights that can be quickly updated in an ever changing medical device landscape. More than 4 million entries were assessed, identifying 2,174 MDSW registrations, including 531 standalone applications and 1,643 integrated within medical devices, of which 43 were AI-enabled. It was shown that the leading medical specialties utilizing AIMD include respiratory (20.5%), ophthalmology/endocrinology (12.8%), and orthopedics (10.3%). This approach greatly improves the speed of data extracting providing a greater ability to compare and contrast. This study provides the first extensive, data-driven exploration of AIMD in China, showcasing the potential of automated regulatory data analysis in understanding and advancing the landscape of AI in medical technology.

[AI-66] Warmstarting for Scaling Language Models

链接: https://arxiv.org/abs/2411.07340
作者: Neeratyoy Mallik,Maciej Janowski,Johannes Hog,Herilalaina Rakotoarison,Aaron Klein,Josif Grabocka,Frank Hutter
关键词-EN: language models paradigm, current large language, Scaling model sizes, performance has worked, worked remarkably
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scaling model sizes to scale performance has worked remarkably well for the current large language models paradigm. The research and empirical findings of various scaling studies led to novel scaling results and laws that guides subsequent research. High training costs for contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups. One direction to ameliorate the cost of pretraining large models is to warmstart the large-scale training from smaller models that are cheaper to tune. In this work, we attempt to understand if the behavior of optimal hyperparameters can be retained under warmstarting for scaling. We explore simple operations that allow the application of theoretically motivated methods of zero-shot transfer of optimal hyperparameters using \muTransfer. We investigate the aspects that contribute to the speedup in convergence and the preservation of stable training dynamics under warmstarting with \muTransfer. We find that shrinking smaller model weights, zero-padding, and perturbing the resulting larger model with scaled initialization from \muP enables effective warmstarting of \mut .

[AI-67] Multimodal Fusion Balancing Through Game-Theoretic Regularization

链接: https://arxiv.org/abs/2411.07335
作者: Konstantinos Kontras,Thomas Strypsteen,Christos Chatzichristos,Paul P. Liang,Matthew Blaschko,Maarten De Vos
关键词-EN: uncovering key dependencies, data sources, complete the picture, dependencies between data, Multimodal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT); Multimedia (cs.MM)
*备注: 21 pages, 6 figures, 4 tables, 1 algorithm

点击查看摘要

Abstract:Multimodal learning can complete the picture of information extraction by uncovering key dependencies between data sources. However, current systems fail to fully leverage multiple modalities for optimal performance. This has been attributed to modality competition, where modalities strive for training resources, leaving some underoptimized. We show that current balancing methods struggle to train multimodal models that surpass even simple baselines, such as ensembles. This raises the question: how can we ensure that all modalities in multimodal training are sufficiently trained, and that learning from new modalities consistently improves performance? This paper proposes the Multimodal Competition Regularizer (MCR), a new loss component inspired by mutual information (MI) decomposition designed to prevent the adverse effects of competition in multimodal training. Our key contributions are: 1) Introducing game-theoretic principles in multimodal learning, where each modality acts as a player competing to maximize its influence on the final outcome, enabling automatic balancing of the MI terms. 2) Refining lower and upper bounds for each MI term to enhance the extraction of task-relevant unique and shared information across modalities. 3) Suggesting latent space permutations for conditional MI estimation, significantly improving computational efficiency. MCR outperforms all previously suggested training strategies and is the first to consistently improve multimodal learning beyond the ensemble baseline, clearly demonstrating that combining modalities leads to significant performance gains on both synthetic and large real-world datasets.

[AI-68] Harnessing Smartphone Sensors for Enhanced Road Safety: A Comprehensive Dataset and Review

链接: https://arxiv.org/abs/2411.07315
作者: Amith Khandakar,David G. Michelson,Mansura Naznine,Abdus Salam,Md. Nahiduzzaman,Khaled M. Khan,Ponnuthurai Nagaratnam Suganthan,Mohamed Arselene Ayari,Hamid Menouar,Julfikar Haider
关键词-EN: Severe collisions, poor road conditions, collisions can result, result from aggressive, effective monitoring
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 29 pages, 14 Figures, journal paper, submitted into Scientific Data Journal

点击查看摘要

Abstract:Severe collisions can result from aggressive driving and poor road conditions, emphasizing the need for effective monitoring to ensure safety. Smartphones, with their array of built-in sensors, offer a practical and affordable solution for road-sensing. However, the lack of reliable, standardized datasets has hindered progress in assessing road conditions and driving patterns. This study addresses this gap by introducing a comprehensive dataset derived from smartphone sensors, which surpasses existing datasets by incorporating a diverse range of sensors including accelerometer, gyroscope, magnetometer, GPS, gravity, orientation, and uncalibrated sensors. These sensors capture extensive parameters such as acceleration force, gravitation, rotation rate, magnetic field strength, and vehicle speed, providing a detailed understanding of road conditions and driving behaviors. The dataset is designed to enhance road safety, infrastructure maintenance, traffic management, and urban planning. By making this dataset available to the community, the study aims to foster collaboration, inspire further research, and facilitate the development of innovative solutions in intelligent transportation systems.

[AI-69] X-DFS: Explainable Artificial Intelligence Guided Design-for-Security Solution Space Exploration

链接: https://arxiv.org/abs/2411.07308
作者: Tanzim Mahfuz,Swarup Bhunia,Prabuddha Chakraborty
关键词-EN: involving diverse entities, integrated circuits predominantly, semiconductor supply chain, globally distributed semiconductor, chain involving diverse
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Design and manufacturing of integrated circuits predominantly use a globally distributed semiconductor supply chain involving diverse entities. The modern semiconductor supply chain has been designed to boost production efficiency, but is filled with major security concerns such as malicious modifications (hardware Trojans), reverse engineering (RE), and cloning. While being deployed, digital systems are also subject to a plethora of threats such as power, timing, and electromagnetic (EM) side channel attacks. Many Design-for-Security (DFS) solutions have been proposed to deal with these vulnerabilities, and such solutions (DFS) relays on strategic modifications (e.g., logic locking, side channel resilient masking, and dummy logic insertion) of the digital designs for ensuring a higher level of security. However, most of these DFS strategies lack robust formalism, are often not human-understandable, and require an extensive amount of human expert effort during their development/use. All of these factors make it difficult to keep up with the ever growing number of microelectronic vulnerabilities. In this work, we propose X-DFS, an explainable Artificial Intelligence (AI) guided DFS solution-space exploration approach that can dramatically cut down the mitigation strategy development/use time while enriching our understanding of the vulnerability by providing human-understandable decision rationale. We implement X-DFS and comprehensively evaluate it for reverse engineering threats (SAIL, SWEEP, and OMLA) and formalize a generalized mechanism for applying X-DFS to defend against other threats such as hardware Trojans, fault attacks, and side channel attacks for seamless future extensions.

[AI-70] Artificial Intelligence Ecosystem for Automating Self-Directed Teaching

链接: https://arxiv.org/abs/2411.07300
作者: Tejas Satish Gotavade
关键词-EN: innovative artificial intelligence-driven, artificial intelligence-driven educational, automated teaching assistance, intelligence-driven educational concept, educational concept designed
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 13 pages, 15 figures, 12 references and 1 table

点击查看摘要

Abstract:This research introduces an innovative artificial intelligence-driven educational concept designed to optimize self-directed learning through personalized course delivery and automated teaching assistance. The system leverages fine-tuned AI models to create an adaptive learning environment that encompasses customized roadmaps, automated presentation generation, and three-dimensional modeling for complex concept visualization. By integrating real-time virtual assistance for doubt resolution, the platform addresses the immediate educational needs of learners while promoting autonomous learning practices. This study explores the psychological advantages of self-directed learning and demonstrates how AI automation can enhance educational outcomes through personalized content delivery and interactive support mechanisms. The research contributes to the growing field of educational technology by presenting a comprehensive framework that combines automated content generation, visual learning aids, and intelligent tutoring to create an efficient, scalable solution for modern educational needs. Preliminary findings suggest that this approach not only accommodates diverse learning styles but also strengthens student engagement and knowledge retention through its emphasis on self-paced, independent learning methodologies.

[AI-71] Multi-hop Upstream Preemptive Traffic Signal Control with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2411.07271
作者: Xiaocan Li,Xiaoyu Wang,Ilia Smirnov,Scott Sanner,Baher Abdulhai
关键词-EN: crucial for managing, upstream, urban networks, upstream links, Traffic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Probability (math.PR)
*备注: 5 tables, 12 figures. arXiv admin note: text overlap with arXiv:2409.00753

点击查看摘要

Abstract:Traffic signal control is crucial for managing congestion in urban networks. Existing myopic pressure-based control methods focus only on immediate upstream links, leading to suboptimal green time allocation and increased network delays. Effective signal control, however, inherently requires a broader spatial scope, as traffic conditions further upstream can significantly impact traffic at the current location. This paper introduces a novel concept based on the Markov chain theory, namely multi-hop upstream pressure, that generalizes the conventional pressure to account for traffic conditions beyond the immediate upstream links. This farsighted and compact metric informs the deep reinforcement learning agent to preemptively clear the present queues, guiding the agent to optimize signal timings with a broader spatial awareness. Simulations on synthetic and realistic (Toronto) scenarios demonstrate controllers utilizing multi-hop upstream pressure significantly reduce overall network delay by prioritizing traffic movements based on a broader understanding of upstream congestion.

[AI-72] Learning From Graph-Structured Data: Addressing Design Issues and Exploring Practical Applications in Graph Representation Learning

链接: https://arxiv.org/abs/2411.07269
作者: Chenqing Hua
关键词-EN: Graph Neural Networks, interacting elements, Neural Networks, social networks, graph representation learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2205.11691 , arXiv:2304.14621

点击查看摘要

Abstract:Graphs serve as fundamental descriptors for systems composed of interacting elements, capturing a wide array of data types, from molecular interactions to social networks and knowledge graphs. In this paper, we present an exhaustive review of the latest advancements in graph representation learning and Graph Neural Networks (GNNs). GNNs, tailored to handle graph-structured data, excel in deriving insights and predictions from intricate relational information, making them invaluable for tasks involving such data. Graph representation learning, a pivotal approach in analyzing graph-structured data, facilitates numerous downstream tasks and applications across machine learning, data mining, biomedicine, and healthcare. Our work delves into the capabilities of GNNs, examining their foundational designs and their application in addressing real-world challenges. We introduce a GNN equipped with an advanced high-order pooling function, adept at capturing complex node interactions within graph-structured data. This pooling function significantly enhances the GNN’s efficacy in both node- and graph-level tasks. Additionally, we propose a molecular graph generative model with a GNN as its core framework. This GNN backbone is proficient in learning invariant and equivariant molecular characteristics. Employing these features, the molecular graph generative model is capable of simultaneously learning and generating molecular graphs with atom-bond structures and precise atom positions. Our models undergo thorough experimental evaluations and comparisons with established methods, showcasing their superior performance in addressing diverse real-world challenges with various datasets. Comments: arXiv admin note: text overlap with arXiv:2205.11691, arXiv:2304.14621 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.07269 [cs.LG] (or arXiv:2411.07269v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.07269 Focus to learn more arXiv-issued DOI via DataCite

[AI-73] A Survey on Data Markets

链接: https://arxiv.org/abs/2411.07267
作者: Jiayao Zhang,Yuran Bi,Mengye Cheng,Jinfei Liu,Kui Ren,Qiheng Sun,Yihang Wu,Yang Cao,Raul Castro Fernandez,Haifeng Xu,Ruoxi Jia,Yongchan Kwon,Jian Pei,Jiachen T. Wang,Haocheng Xia,Li Xiong,Xiaohui Yu,James Zou
关键词-EN: Data, data markets, markets, data products including, trading data
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Data is the new oil of the 21st century. The growing trend of trading data for greater welfare has led to the emergence of data markets. A data market is any mechanism whereby the exchange of data products including datasets and data derivatives takes place as a result of data buyers and data sellers being in contact with one another, either directly or through mediating agents. It serves as a coordinating mechanism by which several functions, including the pricing and the distribution of data as the most important ones, interact to make the value of data fully exploited and enhanced. In this article, we present a comprehensive survey of this important and emerging direction from the aspects of data search, data productization, data transaction, data pricing, revenue allocation as well as privacy, security, and trust issues. We also investigate the government policies and industry status of data markets across different countries and different domains. Finally, we identify the unresolved challenges and discuss possible future directions for the development of data markets.

[AI-74] Navigating AI in Social Work and Beyond: A Multidisciplinary Review

链接: https://arxiv.org/abs/2411.07245
作者: Matt Victor Dalziel,Krystal Schaffer,Neil Martin
关键词-EN: work profession engages, artificial intelligence, modest goal, goal of drafting, profession engages
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 30 pages

点击查看摘要

Abstract:This review began with the modest goal of drafting a brief commentary on how the social work profession engages with and is impacted by artificial intelligence (AI). However, it quickly became apparent that a deeper exploration was required to adequately capture the profound influence of AI, one of the most transformative and debated innovations in modern history. As a result, this review evolved into an interdisciplinary endeavour, gathering seminal texts, critical articles, and influential voices from across industries and academia. This review aims to provide a comprehensive yet accessible overview, situating AI within broader societal and academic conversations as 2025 dawns. We explore perspectives from leading tech entrepreneurs, cultural icons, CEOs, and politicians alongside the pioneering contributions of AI engineers, innovators, and academics from fields as diverse as mathematics, sociology, philosophy, economics, and more. This review also briefly analyses AI’s real-world impacts, ethical challenges, and implications for social work. It presents a vision for AI-facilitated simulations that could transform social work education through Advanced Personalised Simulation Training (APST). This tool uses AI to tailor high-fidelity simulations to individual student needs, providing real-time feedback and preparing them for the complexities of their future practice environments. We maintain a critical tone throughout, balancing our awe of AI’s remarkable advancements with necessary caution. As AI continues to permeate every professional realm, understanding its subtleties, challenges, and opportunities becomes essential. Those who fully grasp the intricacies of this technology will be best positioned to navigate the impending AI Era.

[AI-75] A Tutorial on Teaching Data Analytics with Generative AI

链接: https://arxiv.org/abs/2411.07244
作者: Robert L. Bray
关键词-EN: large language models, incorporating large language, language models, addresses the challenge, challenge of incorporating
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This tutorial addresses the challenge of incorporating large language models (LLMs), such as ChatGPT, in a data analytics class. It details several new in-class and out-of-class teaching techniques enabled by AI. For example, instructors can parallelize instruction by having students interact with different custom-made GPTs to learn different parts of an analysis and then teach each other what they learned from their AIs. For another example, instructors can turn problem sets into AI tutoring sessions, whereby a custom-made GPT guides a student through the problems, and the student uploads the chatlog for their homework submission. For a third example, you can assign different labs to each section of your class and have each section create AI assistants to help the other sections work through their labs. This tutorial advocates the programming in the English paradigm, in which students express the desired data transformations in prose and then use AI to generate the corresponding code. Students can wrangle data more effectively by programming in English than by manipulating in Excel. However, some students will program in English better than others, so you will still derive a robust grade distribution (at least with current LLMs).

[AI-76] Barriers to Complexity-Theoretic Proofs that Achieving AGI Using Machine Learning is Intractable

链接: https://arxiv.org/abs/2411.06498
作者: Michael Guerzhoy
关键词-EN: achieving human-like intelligence, van Rooij, recent paper, complexity-theoretic sense, proved that achieving
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注:

点击查看摘要

Abstract:A recent paper (van Rooij et al. 2024) claims to have proved that achieving human-like intelligence using learning from data is intractable in a complexity-theoretic sense. We identify that the proof relies on an unjustified assumption about the distribution of (input, output) pairs to the system. We briefly discuss that assumption in the context of two fundamental barriers to repairing the proof: the need to precisely define ``human-like," and the need to account for the fact that a particular machine learning system will have particular inductive biases that are key to the analysis.

[AI-77] DINO-LG: A Task-Specific DINO Model for Coronary Calcium Scoring

链接: https://arxiv.org/abs/2411.07976
作者: Mahmut S. Gokmen,Cody Bumgardner,Caner Ozcan
关键词-EN: Coronary artery disease, Coronary artery, prevent coronary disease, Coronary artery calcium, CAC
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Developed by Center for Applied Artificial Intelligence (CAAI), University of Kentucky

点击查看摘要

Abstract:Coronary artery disease (CAD), one of the most common cause of mortality in the world. Coronary artery calcium (CAC) scoring using computed tomography (CT) is key for risk assessment to prevent coronary disease. Previous studies on risk assessment and calcification detection in CT scans primarily use approaches based on UNET architecture, frequently implemented on pre-built models. However, these models are limited by the availability of annotated CT scans containing CAC and suffering from imbalanced dataset, decreasing performance of CAC segmentation and scoring. In this study, we extend this approach by incorporating the self-supervised learning (SSL) technique of DINO (self-distillation with no labels) to eliminate limitations of scarce annotated data in CT scans. The DINO model’s ability to train without requiring CAC area annotations enhances its robustness in generating distinct features. The DINO model is trained on to focus specifically on calcified areas by using labels, aiming to generate features that effectively capture and highlight key characteristics. The label-guided DINO (DINO-LG) enhances classification by distinguishing CT slices that contain calcification from those that do not, performing 57% better than the standard DINO model in this task. CAC scoring and segmentation tasks are performed by a basic U-NET architecture, fed specifically with CT slices containing calcified areas as identified by the DINO-LG model. This targeted identification performed by DINO-LG model improves CAC segmentation performance by approximately 10% and significant increase in CAC scoring accuracy.

[AI-78] DuoLift-GAN:Reconstructing CT from Single-view and Biplanar X-Rays with Generative Adversarial Networks

链接: https://arxiv.org/abs/2411.07941
作者: Zhaoxi Zhang,Yueliang Ying
关键词-EN: highly detailed three-dimensional, Computed tomography, detailed three-dimensional, intraoperative settings, highly detailed
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Computed tomography (CT) provides highly detailed three-dimensional (3D) medical images but is costly, time-consuming, and often inaccessible in intraoperative settings (Organization et al. 2011). Recent advancements have explored reconstructing 3D chest volumes from sparse 2D X-rays, such as single-view or orthogonal double-view images. However, current models tend to process 2D images in a planar manner, prioritizing visual realism over structural accuracy. In this work, we introduce DuoLift Generative Adversarial Networks (DuoLift-GAN), a novel architecture with dual branches that independently elevate 2D images and their features into 3D representations. These 3D outputs are merged into a unified 3D feature map and decoded into a complete 3D chest volume, enabling richer 3D information capture. We also present a masked loss function that directs reconstruction towards critical anatomical regions, improving structural accuracy and visual quality. This paper demonstrates that DuoLift-GAN significantly enhances reconstruction accuracy while achieving superior visual realism compared to existing methods.

[AI-79] AI enhanced diagnosis of Peyronies disease a novel approach using Computer Vision

链接: https://arxiv.org/abs/2411.07684
作者: Yudara Kularathne,Janitha Prathapa,Prarththanan Sothyrajah,Salomi Arasaratnam,Sithira Ambepitiya,Thanveer Ahamed,Dinuka Wijesundara
关键词-EN: diagnosing Peyronie Disease, Peyronie Disease, innovative AI-driven tool, diagnosing Peyronie, men worldwide
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures, 4 tables

点击查看摘要

Abstract:This study presents an innovative AI-driven tool for diagnosing Peyronie’s Disease (PD), a condition that affects between 0.3% and 13.1% of men worldwide. Our method uses key point detection on both images and videos to measure penile curvature angles, utilizing advanced computer vision techniques. This tool has demonstrated high accuracy in identifying anatomical landmarks, validated against conventional goniometer measurements. Traditional PD diagnosis often involves subjective and invasive methods, which can lead to patient discomfort and inaccuracies. Our approach offers a precise, reliable, and non-invasive diagnostic tool to address these drawbacks. The model distinguishes between PD and normal anatomical changes with a sensitivity of 96.7% and a specificity of 100%. This advancement represents a significant improvement in urological diagnostics, greatly enhancing the efficacy and convenience of PD assessment for healthcare providers and patients.

[AI-80] Reinforcement Learning Framework for Quantitative Trading

链接: https://arxiv.org/abs/2411.07585
作者: Alhassan S. Yasin,Prabdeep S. Gill
关键词-EN: integrates risk management, stock market underscore, risk management strategies, financial stock market, inherent volatility
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
*备注: 8 pages, 9 figures, 3 tables, accepted at ICAIF 2024 FM4TS Workshop

点击查看摘要

Abstract:The inherent volatility and dynamic fluctuations within the financial stock market underscore the necessity for investors to employ a comprehensive and reliable approach that integrates risk management strategies, market trends, and the movement trends of individual securities. By evaluating specific data, investors can make more informed decisions. However, the current body of literature lacks substantial evidence supporting the practical efficacy of reinforcement learning (RL) agents, as many models have only demonstrated success in back testing using historical data. This highlights the urgent need for a more advanced methodology capable of addressing these challenges. There is a significant disconnect in the effective utilization of financial indicators to better understand the potential market trends of individual securities. The disclosure of successful trading strategies is often restricted within financial markets, resulting in a scarcity of widely documented and published strategies leveraging RL. Furthermore, current research frequently overlooks the identification of financial indicators correlated with various market trends and their potential advantages. This research endeavors to address these complexities by enhancing the ability of RL agents to effectively differentiate between positive and negative buy/sell actions using financial indicators. While we do not address all concerns, this paper provides deeper insights and commentary on the utilization of technical indicators and their benefits within reinforcement learning. This work establishes a foundational framework for further exploration and investigation of more complex scenarios. Comments: 8 pages, 9 figures, 3 tables, accepted at ICAIF 2024 FM4TS Workshop Subjects: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP) Cite as: arXiv:2411.07585 [q-fin.TR] (or arXiv:2411.07585v1 [q-fin.TR] for this version) https://doi.org/10.48550/arXiv.2411.07585 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-81] Firing Rate Models as Associative Memory: Excitatory-Inhibitory Balance for Robust Retrieval

链接: https://arxiv.org/abs/2411.07388
作者: Simone Betteti,Giacomo Baggio,Francesco Bullo,Sandro Zampieri
关键词-EN: describe local cortical, Firing rate models, local cortical dynamics, dynamical systems widely, Firing rate
类目: Neurons and Cognition (q-bio.NC); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:Firing rate models are dynamical systems widely used in applied and theoretical neuroscience to describe local cortical dynamics in neuronal populations. By providing a macroscopic perspective of neuronal activity, these models are essential for investigating oscillatory phenomena, chaotic behavior, and associative memory processes. Despite their widespread use, the application of firing rate models to associative memory networks has received limited mathematical exploration, and most existing studies are focused on specific models. Conversely, well-established associative memory designs, such as Hopfield networks, lack key biologically-relevant features intrinsic to firing rate models, including positivity and interpretable synaptic matrices that reflect excitatory and inhibitory interactions. To address this gap, we propose a general framework that ensures the emergence of re-scaled memory patterns as stable equilibria in the firing rate dynamics. Furthermore, we analyze the conditions under which the memories are locally and globally asymptotically stable, providing insights into constructing biologically-plausible and robust systems for associative memory retrieval.

[AI-82] Ensemble Learning for Microbubble Localization in Super-Resolution Ultrasound

链接: https://arxiv.org/abs/2411.07376
作者: Sepideh K. Gharamaleki,Brandon Helfield,Hassan Rivaz
关键词-EN: high spatial resolution, powerful imaging technique, spatial resolution, Super-resolution ultrasound, powerful imaging
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Super-resolution ultrasound (SR-US) is a powerful imaging technique for capturing microvasculature and blood flow at high spatial resolution. However, accurate microbubble (MB) localization remains a key challenge, as errors in localization can propagate through subsequent stages of the super-resolution process, affecting overall performance. In this paper, we explore the potential of ensemble learning techniques to enhance MB localization by increasing detection sensitivity and reducing false positives. Our study evaluates the effectiveness of ensemble methods on both in vivo and simulated outputs of a Deformable DEtection TRansformer (Deformable DETR) network. As a result of our study, we are able to demonstrate the advantages of these ensemble approaches by showing improved precision and recall in MB detection and offering insights into their application in SR-US.

[AI-83] High quality ECG dataset based on MIT-BIH recordings for improved heartbeats classification

链接: https://arxiv.org/abs/2411.07252
作者: Ahmed.S Benmessaoud,Farida Medjani,Yahia Bousseloub,Khalid Bouaita,Dhia Benrahem,Tahar Kezai
关键词-EN: diagnose abnormal heart, abnormal heart waves, cardiovascular diseases, reliable tool, tool for medical
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 4 pages, 5 figures, 5 tables, presented during IEEE COINS 2023 Berlin. link to ieeexploere: this https URL

点击查看摘要

Abstract:Electrocardiogram (ECG) is a reliable tool for medical professionals to detect and diagnose abnormal heart waves that may cause cardiovascular diseases. This paper proposes a methodology to create a new high-quality heartbeat dataset from all 48 of the MIT-BIH recordings. The proposed approach computes an optimal heartbeat size, by eliminating outliers and calculating the mean value over 10-second windows. This results in independent QRS-centered heartbeats avoiding the mixing of successive heartbeats problem. The quality of the newly constructed dataset has been evaluated and compared with existing datasets. To this end, we built and trained a PyTorch 1-D Resnet architecture model that achieved 99.24% accuracy with a 5.7% improvement compared to other methods. Additionally, downsampling the dataset has improved the model’s execution time by 33% and reduced 3x memory usage.

[AI-84] Neuropsychology and Explainability of AI: A Distributional Approach to the Relationship Between Activation Similarity of Neural Categories in Synthetic Cognition

链接: https://arxiv.org/abs/2411.07243
作者: Michael Pichat,Enola Campoli,William Pogrund,Jourdan Wilson,Michael Veillet-Guillem,Anton Melkozerov,Paloma Pichat,Armanush Gasparian,Samuel Demarchi,Judicael Poumay
关键词-EN: artificial neural networks, human cognitive psychology, relevant heuristic references, developing synthetic explanatory, synthetic explanatory frameworks
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We propose a neuropsychological approach to the explainability of artificial neural networks, which involves using concepts from human cognitive psychology as relevant heuristic references for developing synthetic explanatory frameworks that align with human modes of thought. The analogical concepts mobilized here, which are intended to create such an epistemological bridge, are those of categorization and similarity, as these notions are particularly suited to the categorical “nature” of the reconstructive information processing performed by artificial neural networks. Our study aims to reveal a unique process of synthetic cognition, that of the categorical convergence of highly activated tokens. We attempt to explain this process with the idea that the categorical segment created by a neuron is actually the result of a superposition of categorical sub-dimensions within its input vector space.

计算机视觉

[CV-0] Material Transforms from Disentangled NeRF Representations

链接: https://arxiv.org/abs/2411.08037
作者: Ivan Lopes,Jean-François Lalonde,Raoul de Charette
关键词-EN: Neural Radiance Field, Reflectance Distribution Functions, Bidirectional Reflectance Distribution, disentangled Neural Radiance, map Bidirectional Reflectance
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:In this paper, we first propose a novel method for transferring material transformations across different scenes. Building on disentangled Neural Radiance Field (NeRF) representations, our approach learns to map Bidirectional Reflectance Distribution Functions (BRDF) from pairs of scenes observed in varying conditions, such as dry and wet. The learned transformations can then be applied to unseen scenes with similar materials, therefore effectively rendering the transformation learned with an arbitrary level of intensity. Extensive experiments on synthetic scenes and real-world objects validate the effectiveness of our approach, showing that it can learn various transformations such as wetness, painting, coating, etc. Our results highlight not only the versatility of our method but also its potential for practical applications in computer graphics. We publish our method implementation, along with our synthetic/real datasets on this https URL

[CV-1] Artistic Neural Style Transfer Algorithms with Activation Smoothing

链接: https://arxiv.org/abs/2411.08014
作者: Xiangtian Li,Han Cao,Zhaoyang Zhang,Jiacheng Hu,Yuhui Jin,Zihao Zhao
关键词-EN: Convolutional Neural Networks, Neural Style Transfer, Neural Networks, creating artistic style, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 8 pages,7 figures

点击查看摘要

Abstract:The works of Gatys et al. demonstrated the capability of Convolutional Neural Networks (CNNs) in creating artistic style images. This process of transferring content images in different styles is called Neural Style Transfer (NST). In this paper, we re-implement image-based NST, fast NST, and arbitrary NST. We also explore to utilize ResNet with activation smoothing in NST. Extensive experimental results demonstrate that smoothing transformation can greatly improve the quality of stylization results.

[CV-2] SimBase: A Simple Baseline for Temporal Video Grounding

链接: https://arxiv.org/abs/2411.07945
作者: Peijun Bao,Alex C. Kot
关键词-EN: temporal video grounding, paper presents SimBase, temporal, paper presents, video grounding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical report

点击查看摘要

Abstract:This paper presents SimBase, a simple yet effective baseline for temporal video grounding. While recent advances in temporal grounding have led to impressive performance, they have also driven network architectures toward greater complexity, with a range of methods to (1) capture temporal relationships and (2) achieve effective multimodal fusion. In contrast, this paper explores the question: How effective can a simplified approach be? To investigate, we design SimBase, a network that leverages lightweight, one-dimensional temporal convolutional layers instead of complex temporal structures. For cross-modal interaction, SimBase only employs an element-wise product instead of intricate multimodal fusion. Remarkably, SimBase achieves state-of-the-art results on two large-scale datasets. As a simple yet powerful baseline, we hope SimBase will spark new ideas and streamline future evaluations in temporal video grounding.

[CV-3] Learning Disentangled Representations for Perceptual Point Cloud Quality Assessment via Mutual Information Minimization

链接: https://arxiv.org/abs/2411.07936
作者: Ziyu Shan,Yujie Zhang,Yipeng Liu,Yiling Xu
关键词-EN: Cloud Quality Assessment, human perceptual quality, No-Reference Point Cloud, Quality Assessment, pristine-quality point clouds
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:No-Reference Point Cloud Quality Assessment (NR-PCQA) aims to objectively assess the human perceptual quality of point clouds without relying on pristine-quality point clouds for reference. It is becoming increasingly significant with the rapid advancement of immersive media applications such as virtual reality (VR) and augmented reality (AR). However, current NR-PCQA models attempt to indiscriminately learn point cloud content and distortion representations within a single network, overlooking their distinct contributions to quality information. To address this issue, we propose DisPA, a novel disentangled representation learning framework for NR-PCQA. The framework trains a dual-branch disentanglement network to minimize mutual information (MI) between representations of point cloud content and distortion. Specifically, to fully disentangle representations, the two branches adopt different philosophies: the content-aware encoder is pretrained by a masked auto-encoding strategy, which can allow the encoder to capture semantic information from rendered images of distorted point clouds; the distortion-aware encoder takes a mini-patch map as input, which forces the encoder to focus on low-level distortion patterns. Furthermore, we utilize an MI estimator to estimate the tight upper bound of the actual MI and further minimize it to achieve explicit representation disentanglement. Extensive experimental results demonstrate that DisPA outperforms state-of-the-art methods on multiple PCQA datasets.

[CV-4] Isometric Transformations for Image Augmentation in Mueller Matrix Polarimetry

链接: https://arxiv.org/abs/2411.07918
作者: Christopher Hahne,Omar Rodriguez-Nunez,Éléa Gros,Théotim Lucas,Ekkehard Hewer,Tatiana Novikova,Theoni Maragkou,Philippe Schucht,Richard McKinley
关键词-EN: presenting unique challenges, polarized light interactions, Mueller matrix polarimetry, polarimetry captures essential, captures essential information
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: preprint

点击查看摘要

Abstract:Mueller matrix polarimetry captures essential information about polarized light interactions with a sample, presenting unique challenges for data augmentation in deep learning due to its distinct structure. While augmentations are an effective and affordable way to enhance dataset diversity and reduce overfitting, standard transformations like rotations and flips do not preserve the polarization properties in Mueller matrix images. To this end, we introduce a versatile simulation framework that applies physically consistent rotations and flips to Mueller matrices, tailored to maintain polarization fidelity. Our experimental results across multiple datasets reveal that conventional augmentations can lead to misleading results when applied to polarimetric data, underscoring the necessity of our physics-based approach. In our experiments, we first compare our polarization-specific augmentations against real-world captures to validate their physical consistency. We then apply these augmentations in a semantic segmentation task, achieving substantial improvements in model generalization and performance. This study underscores the necessity of physics-informed data augmentation for polarimetric imaging in deep learning (DL), paving the way for broader adoption and more robust applications across diverse research in the field. In particular, our framework unlocks the potential of DL models for polarimetric datasets with limited sample sizes. Our code implementation is available at this http URL.

[CV-5] LDR: Traffic Light Detection using Fourier Domain Adaptation in Hostile WeatheR

链接: https://arxiv.org/abs/2411.07901
作者: Ishaan Gakhar,Aryesh Guha,Aryaman Gupta,Amit Agarwal,Durga Toshniwal,Ujjwal Verma
关键词-EN: present significant challenges, traffic light detection, conditions present significant, significant challenges, scarcity of comprehensive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review at IEEE Transactions of Artificial Intelligence. 10 Pages, 7 Figures

点击查看摘要

Abstract:The scarcity of comprehensive datasets in the traffic light detection and recognition domain and the poor performance of state-of-the-art models under hostile weather conditions present significant challenges. To address these issues, this paper proposes a novel approach by merging two widely used datasets, LISA and S2TLD. The merged dataset is further processed to tackle class imbalance, a common problem in this domain. This merged dataset becomes our source domain. Synthetic rain and fog are added to the dataset to create our target domain. We employ Fourier Domain Adaptation (FDA) to create a final dataset with a minimized domain gap between the two datasets, helping the model trained on this final dataset adapt to rainy and foggy weather conditions. Additionally, we explore Semi-Supervised Learning (SSL) techniques to leverage the available data more effectively. Experimental results demonstrate that models trained on FDA-augmented images outperform those trained without FDA across confidence-dependent and independent metrics, like mAP50, mAP50-95, Precision, and Recall. The best-performing model, YOLOv8, achieved a Precision increase of 5.1860%, Recall increase of 14.8009%, mAP50 increase of 9.5074%, and mAP50-95 increase of 19.5035%. On average, percentage increases of 7.6892% in Precision, 19.9069% in Recall, 15.8506% in mAP50, and 23.8099% in mAP50-95 were observed across all models, highlighting the effectiveness of FDA in mitigating the impact of adverse weather conditions on model performance. These improvements pave the way for real-world applications where reliable performance in challenging environmental conditions is critical.

[CV-6] Rendering-Oriented 3D Point Cloud Attribute Compression using Sparse Tensor-based Transformer

链接: https://arxiv.org/abs/2411.07899
作者: Xiao Huo,Junhui Ho,Shuai Wan,Fuzheng Yang
关键词-EN: point cloud, reconstructed point clouds, point, digital content, techniques has fundamentally
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The evolution of 3D visualization techniques has fundamentally transformed how we interact with digital content. At the forefront of this change is point cloud technology, offering an immersive experience that surpasses traditional 2D representations. However, the massive data size of point clouds presents significant challenges in data compression. Current methods for lossy point cloud attribute compression (PCAC) generally focus on reconstructing the original point clouds with minimal error. However, for point cloud visualization scenarios, the reconstructed point clouds with distortion still need to undergo a complex rendering process, which affects the final user-perceived quality. In this paper, we propose an end-to-end deep learning framework that seamlessly integrates PCAC with differentiable rendering, denoted as rendering-oriented PCAC (RO-PCAC), directly targeting the quality of rendered multiview images for viewing. In a differentiable manner, the impact of the rendering process on the reconstructed point clouds is taken into account. Moreover, we characterize point clouds as sparse tensors and propose a sparse tensor-based transformer, called SP-Trans. By aligning with the local density of the point cloud and utilizing an enhanced local attention mechanism, SP-Trans captures the intricate relationships within the point cloud, further improving feature analysis and synthesis within the framework. Extensive experiments demonstrate that the proposed RO-PCAC achieves state-of-the-art compression performance, compared to existing reconstruction-oriented methods, including traditional, learning-based, and hybrid methods.

[CV-7] Joint multi-dimensional dynamic attention and transformer for general image restoration

链接: https://arxiv.org/abs/2411.07893
作者: Huan Zhang,Xu Zhang,Nian Cai,Jianglei Di,Yun Zhang
关键词-EN: severe degradation due, impairing image quality, Outdoor images, due to rain, suffer from severe
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Outdoor images often suffer from severe degradation due to rain, haze, and noise, impairing image quality and challenging high-level tasks. Current image restoration methods struggle to handle complex degradation while maintaining efficiency. This paper introduces a novel image restoration architecture that combines multi-dimensional dynamic attention and self-attention within a U-Net framework. To leverage the global modeling capabilities of transformers and the local modeling capabilities of convolutions, we integrate sole CNNs in the encoder-decoder and sole transformers in the latent layer. Additionally, we design convolutional kernels with selected multi-dimensional dynamic attention to capture diverse degraded inputs efficiently. A transformer block with transposed self-attention further enhances global feature extraction while maintaining efficiency. Extensive experiments demonstrate that our method achieves a better balance between performance and computational complexity across five image restoration tasks: deraining, deblurring, denoising, dehazing, and enhancement, as well as superior performance for high-level vision tasks. The source code will be available at this https URL.

[CV-8] CDXFormer: Boosting Remote Sensing Change Detection with Extended Long Short-Term Memory

链接: https://arxiv.org/abs/2411.07863
作者: Zhenkai Wu,Xiaowen Ma,Rongrong Lian,Zhentao Lin,Wei Zhang
关键词-EN: effectively integrating spatial-temporal, varied conditions, complex scenes, scenes and varied, crucial for accurately
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In complex scenes and varied conditions, effectively integrating spatial-temporal context is crucial for accurately identifying changes. However, current RS-CD methods lack a balanced consideration of performance and efficiency. CNNs lack global context, Transformers have quadratic computational complexity, and Mambas are restricted by CUDA acceleration. In this paper, we propose CDXFormer, with a core component that is a powerful XLSTM-based feature enhancement layer, integrating the advantages of linear computational complexity, global context perception, and strong interpret-ability. Specifically, we introduce a scale-specific Feature Enhancer layer, incorporating a Cross-Temporal Global Perceptron customized for semantic-accurate deep features, and a Cross-Temporal Spatial Refiner customized for detail-rich shallow features. Additionally, we propose a Cross-Scale Interactive Fusion module to progressively interact global change representations with spatial responses. Extensive experimental results demonstrate that CDXFormer achieves state-of-the-art performance across three benchmark datasets, offering a compelling balance between efficiency and accuracy. Code is available at this https URL.

[CV-9] NL-SLAM for OC-VLN: Natural Language Grounded SLAM for Object-Centric VLN

链接: https://arxiv.org/abs/2411.07848
作者: Sonia Raychaudhuri,Duy Ta,Katrina Ashton,Angel X. Chang,Jiuguang Wang,Bernadette Bucher
关键词-EN: robotics navigation methodology, natural language navigation, relative positional navigation, distinct navigation challenges, navigation challenges solved
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Landmark-based navigation (e.g. go to the wooden desk) and relative positional navigation (e.g. move 5 meters forward) are distinct navigation challenges solved very differently in existing robotics navigation methodology. We present a new dataset, OC-VLN, in order to distinctly evaluate grounding object-centric natural language navigation instructions in a method for performing landmark-based navigation. We also propose Natural Language grounded SLAM (NL-SLAM), a method to ground natural language instruction to robot observations and poses. We actively perform NL-SLAM in order to follow object-centric natural language navigation instructions. Our methods leverage pre-trained vision and language foundation models and require no task-specific training. We construct two strong baselines from state-of-the-art methods on related tasks, Object Goal Navigation and Vision Language Navigation, and we show that our approach, NL-SLAM, outperforms these baselines across all our metrics of success on OC-VLN. Finally, we successfully demonstrate the effectiveness of NL-SLAM for performing navigation instruction following in the real world on a Boston Dynamics Spot robot.

[CV-10] owards Vision Mixture of Experts for Wildlife Monitoring on the Edge

链接: https://arxiv.org/abs/2411.07834
作者: Emmanuel Azuh Mensah,Anderson Lee,Haoran Zhang,Yitong Shan,Kurtis Heimerl
关键词-EN: sensors in industrial, consumer and remote, explosion of IoT, IoT sensors, remote sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The explosion of IoT sensors in industrial, consumer and remote sensing use cases has come with unprecedented demand for computing infrastructure to transmit and to analyze petabytes of data. Concurrently, the world is slowly shifting its focus towards more sustainable computing. For these reasons, there has been a recent effort to reduce the footprint of related computing infrastructure, especially by deep learning algorithms, for advanced insight generation. The `TinyML’ community is actively proposing methods to save communication bandwidth and excessive cloud storage costs while reducing algorithm inference latency and promoting data privacy. Such proposed approaches should ideally process multiple types of data, including time series, audio, satellite images, and video, near the network edge as multiple data streams has been shown to improve the discriminative ability of learning algorithms, especially for generating fine grained results. Incidentally, there has been recent work on data driven conditional computation of subnetworks that has shown real progress in using a single model to share parameters among very different types of inputs such as images and text, reducing the computation requirement of multi-tower multimodal networks. Inspired by such line of work, we explore similar per patch conditional computation for the first time for mobile vision transformers (vision only case), that will eventually be used for single-tower multimodal edge models. We evaluate the model on Cornell Sap Sucker Woods 60, a fine grained bird species discrimination dataset. Our initial experiments uses 4X fewer parameters compared to MobileViTV2-1.0 with a 1 % accuracy drop on the iNaturalist '21 birds test data provided as part of the SSW60 dataset.

[CV-11] Reliable-loc: Robust sequential LiDAR global localization in large-scale street scenes based on verifiable cues

链接: https://arxiv.org/abs/2411.07815
作者: Xianghong Zou,Jianping Li,Weitong Wu,Fuxun Liang,Bisheng Yang,Zhen Dong
关键词-EN: Wearable laser scanning, Wearable laser, flexibility and portability, advantages of flexibility, Monte Carlo Localization
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Wearable laser scanning (WLS) system has the advantages of flexibility and portability. It can be used for determining the user’s path within a prior map, which is a huge demand for applications in pedestrian navigation, collaborative mapping, augmented reality, and emergency rescue. However, existing LiDAR-based global localization methods suffer from insufficient robustness, especially in complex large-scale outdoor scenes with insufficient features and incomplete coverage of the prior map. To address such challenges, we propose LiDAR-based reliable global localization (Reliable-loc) exploiting the verifiable cues in the sequential LiDAR data. First, we propose a Monte Carlo Localization (MCL) based on spatially verifiable cues, utilizing the rich information embedded in local features to adjust the particles’ weights hence avoiding the particles converging to erroneous regions. Second, we propose a localization status monitoring mechanism guided by the sequential pose uncertainties and adaptively switching the localization mode using the temporal verifiable cues to avoid the crash of the localization system. To validate the proposed Reliable-loc, comprehensive experiments have been conducted on a large-scale heterogeneous point cloud dataset consisting of high-precision vehicle-mounted mobile laser scanning (MLS) point clouds and helmet-mounted WLS point clouds, which cover various street scenes with a length of over 20km. The experimental results indicate that Reliable-loc exhibits high robustness, accuracy, and efficiency in large-scale, complex street scenes, with a position accuracy of 1.66m, yaw accuracy of 3.09 degrees, and achieves real-time performance. For the code and detailed experimental results, please refer to this https URL.

[CV-12] Large-scale Remote Sensing Image Target Recognition and Automatic Annotation

链接: https://arxiv.org/abs/2411.07802
作者: Wuzheng Dong
关键词-EN: images called LRSAA, large-area remote sensing, remote sensing images, sensing images called, called LRSAA
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a method for object recognition and automatic labeling in large-area remote sensing images called LRSAA. The method integrates YOLOv11 and MobileNetV3-SSD object detection algorithms through ensemble learning to enhance model performance. Furthermore, it employs Poisson disk sampling segmentation techniques and the EIOU metric to optimize the training and inference processes of segmented images, followed by the integration of results. This approach not only reduces the demand for computational resources but also achieves a good balance between accuracy and speed. The source code for this project has been made publicly available on this https URL.

[CV-13] Horticultural Temporal Fruit Monitoring via 3D Instance Segmentation and Re-Identification using Point Clouds

链接: https://arxiv.org/abs/2411.07799
作者: Daniel Fusaro,Federico Magistri,Jens Behley,Alberto Pretto,Cyrill Stachniss
关键词-EN: Robotic fruit monitoring, agricultural production systems, automated agricultural production, fruit monitoring, Robotic fruit
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Submitted to IEEE Robotics and Automation Letters

点击查看摘要

Abstract:Robotic fruit monitoring is a key step toward automated agricultural production systems. Robots can significantly enhance plant and temporal fruit monitoring by providing precise, high-throughput assessments that overcome the limitations of traditional manual methods. Fruit monitoring is a challenging task due to the significant variation in size, shape, orientation, and occlusion of fruits. Also, fruits may be harvested or newly grown between recording sessions. Most methods are 2D image-based and they lack the 3D structure, depth, and spatial information, which represent key aspects of fruit monitoring. 3D colored point clouds, instead, can offer this information but they introduce challenges such as their sparsity and irregularity. In this paper, we present a novel approach for temporal fruit monitoring that addresses point clouds collected in a greenhouse over time. Our method segments fruits using a learning-based instance segmentation approach directly on the point cloud. Each segmented fruit is processed by a 3D sparse convolutional neural network to extract descriptors, which are used in an attention-based matching network to associate fruits with their instances from previous data collections. Experimental results on a real dataset of strawberries demonstrate that our approach outperforms other methods for fruits re-identification over time, allowing for precise temporal fruit monitoring in real and complex scenarios.

[CV-14] Interaction Asymmetry: A General Principle for Learning Composable Abstractions

链接: https://arxiv.org/abs/2411.07784
作者: Jack Brady,Julius von Kügelgen,Sébastien Lachapelle,Simon Buchholz,Thomas Kipf,Wieland Brendel
关键词-EN: Learning disentangled representations, disentangled representations, crucial for generalizing, concepts, Learning disentangled
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint, under review

点击查看摘要

Abstract:Learning disentangled representations of concepts and re-composing them in unseen ways is crucial for generalizing to out-of-domain situations. However, the underlying properties of concepts that enable such disentanglement and compositional generalization remain poorly understood. In this work, we propose the principle of interaction asymmetry which states: “Parts of the same concept have more complex interactions than parts of different concepts”. We formalize this via block diagonality conditions on the (n+1) th order derivatives of the generator mapping concepts to observed data, where different orders of “complexity” correspond to different n . Using this formalism, we prove that interaction asymmetry enables both disentanglement and compositional generalization. Our results unify recent theoretical results for learning concepts of objects, which we show are recovered as special cases with n!=!0 or 1 . We provide results for up to n!=!2 , thus extending these prior works to more flexible generator functions, and conjecture that the same proof strategies generalize to larger n . Practically, our theory suggests that, to disentangle concepts, an autoencoder should penalize its latent capacity and the interactions between concepts during decoding. We propose an implementation of these criteria using a flexible Transformer-based VAE, with a novel regularizer on the attention weights of the decoder. On synthetic image datasets consisting of objects, we provide evidence that this model can achieve comparable object disentanglement to existing models that use more explicit object-centric priors.

[CV-15] Novel View Synthesis with Pixel-Space Diffusion Models

链接: https://arxiv.org/abs/2411.07765
作者: Noam Elata,Bahjat Kawar,Yaron Ostrovsky-Berman,Miriam Farber,Ron Sokolovsky
关键词-EN: single input image, single input, input image, Synthesizing, challenging task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Synthesizing a novel view from a single input image is a challenging task. Traditionally, this task was approached by estimating scene depth, warping, and inpainting, with machine learning models enabling parts of the pipeline. More recently, generative models are being increasingly employed in novel view synthesis (NVS), often encompassing the entire end-to-end system. In this work, we adapt a modern diffusion model architecture for end-to-end NVS in the pixel space, substantially outperforming previous state-of-the-art (SOTA) techniques. We explore different ways to encode geometric information into the network. Our experiments show that while these methods may enhance performance, their impact is minor compared to utilizing improved generative models. Moreover, we introduce a novel NVS training scheme that utilizes single-view datasets, capitalizing on their relative abundance compared to their multi-view counterparts. This leads to improved generalization capabilities to scenes with out-of-domain content.

[CV-16] AdaSemiCD: An Adaptive Semi-Supervised Change Detection Method Based on Pseudo-Label Evaluation

链接: https://arxiv.org/abs/2411.07758
作者: Ran Lingyan,Wen Dongcheng,Zhuo Tao,Zhang Shizhou,Zhang Xiuwei,Zhang Yanning
关键词-EN: bi-temporal image pairs, image pairs captured, Change Detection, remote sensing, essential field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Change Detection (CD) is an essential field in remote sensing, with a primary focus on identifying areas of change in bi-temporal image pairs captured at varying intervals of the same region by a satellite. The data annotation process for the CD task is both time-consuming and labor-intensive. To make better use of the scarce labeled data and abundant unlabeled data, we present an adaptive dynamic semi-supervised learning method, AdaSemiCD, to improve the use of pseudo-labels and optimize the training process. Initially, due to the extreme class imbalance inherent in CD, the model is more inclined to focus on the background class, and it is easy to confuse the boundary of the target object. Considering these two points, we develop a measurable evaluation metric for pseudo-labels that enhances the representation of information entropy by class rebalancing and amplification of confusing areas to give a larger weight to prospects change objects. Subsequently, to enhance the reliability of sample-wise pseudo-labels, we introduce the AdaFusion module, which is capable of dynamically identifying the most uncertain region and substituting it with more trustworthy content. Lastly, to ensure better training stability, we introduce the AdaEMA module, which updates the teacher model using only batches of trusted samples. Experimental results from LEVIR-CD, WHU-CD, and CDD datasets validate the efficacy and universality of our proposed adaptive training framework.

[CV-17] Constraint Learning for Parametric Point Cloud

链接: https://arxiv.org/abs/2411.07747
作者: Xi Cheng,Ruiqi Lei,Di Huang,Zhichao Liao,Fengyuan Piao,Yan Chen,Pingfa Feng,Long Zeng
关键词-EN: CAD shapes, Parametric point clouds, Feature Learning Network, point cloud learning, CAD
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Parametric point clouds are sampled from CAD shapes, have become increasingly prevalent in industrial manufacturing. However, most existing point cloud learning methods focus on the geometric features, such as local and global features or developing efficient convolution operations, overlooking the important attribute of constraints inherent in CAD shapes, which limits these methods’ ability to fully comprehend CAD shapes. To address this issue, we analyzed the effect of constraints, and proposed its deep learning-friendly representation, after that, the Constraint Feature Learning Network (CstNet) is developed to extract and leverage constraints. Our CstNet includes two stages. The Stage 1 extracts constraints from B-Rep data or point cloud. The Stage 2 leverages coordinates and constraints to enhance the comprehend of CAD shapes. Additionally, we built up the Parametric 20,000 Multi-modal Dataset for the scarcity of labeled B-Rep datasets. Experiments demonstrate that our CstNet achieved state-of-the-art performance on both public and proposed CAD shapes datasets. To the best of our knowledge, CstNet is the first constraint-based learning method tailored for CAD shapes analysis.

[CV-18] Efficient 3D Perception on Multi-Sweep Point Cloud with Gumbel Spatial Pruning

链接: https://arxiv.org/abs/2411.07742
作者: Jianhao Li,Tianyu Sun,Xueqian Zhang,Zhongdao Wang,Bailan Feng,Hengshuang Zhao
关键词-EN: paper studies point, paper studies, studies point cloud, outdoor environments, LiDAR sweeps
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper studies point cloud perception within outdoor environments. Existing methods face limitations in recognizing objects located at a distance or occluded, due to the sparse nature of outdoor point clouds. In this work, we observe a significant mitigation of this problem by accumulating multiple temporally consecutive LiDAR sweeps, resulting in a remarkable improvement in perception accuracy. However, the computation cost also increases, hindering previous approaches from utilizing a large number of LiDAR sweeps. To tackle this challenge, we find that a considerable portion of points in the accumulated point cloud is redundant, and discarding these points has minimal impact on perception accuracy. We introduce a simple yet effective Gumbel Spatial Pruning (GSP) layer that dynamically prunes points based on a learned end-to-end sampling. The GSP layer is decoupled from other network components and thus can be seamlessly integrated into existing point cloud network architectures. Without incurring additional computational overhead, we increase the number of LiDAR sweeps from 10, a common practice, to as many as 40. Consequently, there is a significant enhancement in perception performance. For instance, in nuScenes 3D object detection and BEV map segmentation tasks, our pruning strategy improves the vanilla TransL baseline and other baseline methods.

[CV-19] 3D Focusing-and-Matching Network for Multi-Instance Point Cloud Registration NEURIPS2024

链接: https://arxiv.org/abs/2411.07740
作者: Liyuan Zhang,Le Hui,Qi Liu,Bo Li,Yuchao Dai
关键词-EN: model point cloud, point cloud registration, point cloud, Multi-instance point cloud, model point
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Multi-instance point cloud registration aims to estimate the pose of all instances of a model point cloud in the whole scene. Existing methods all adopt the strategy of first obtaining the global correspondence and then clustering to obtain the pose of each instance. However, due to the cluttered and occluded objects in the scene, it is difficult to obtain an accurate correspondence between the model point cloud and all instances in the scene. To this end, we propose a simple yet powerful 3D focusing-and-matching network for multi-instance point cloud registration by learning the multiple pair-wise point cloud registration. Specifically, we first present a 3D multi-object focusing module to locate the center of each object and generate object proposals. By using self-attention and cross-attention to associate the model point cloud with structurally similar objects, we can locate potential matching instances by regressing object centers. Then, we propose a 3D dual masking instance matching module to estimate the pose between the model point cloud and each object proposal. It performs instance mask and overlap mask masks to accurately predict the pair-wise correspondence. Extensive experiments on two public benchmarks, Scan2CAD and ROBI, show that our method achieves a new state-of-the-art performance on the multi-instance point cloud registration task. Code is available at this https URL.

[CV-20] ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction

链接: https://arxiv.org/abs/2411.07725
作者: Dubing Chen,Jin Fang,Wencheng Han,Xinjing Cheng,Junbo Yin,Chenzhong Xu,Fahad Shahbaz Khan,Jianbing Shen
关键词-EN: providing spatiotemporal cues, Vision-based semantic occupancy, Vision-based semantic, flow prediction plays, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-based semantic occupancy and flow prediction plays a crucial role in providing spatiotemporal cues for real-world tasks, such as autonomous driving. Existing methods prioritize higher accuracy to cater to the demands of these tasks. In this work, we strive to improve performance by introducing a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. First, we introduce an occlusion-aware adaptive lifting mechanism with a depth denoising technique to improve the robustness of 2D-to-3D feature transformation and reduce the reliance on depth priors. Second, we strengthen the semantic consistency between 3D features and their original 2D modalities by utilizing shared semantic prototypes to jointly constrain both 2D and 3D features. This is complemented by confidence- and category-based sampling strategies to tackle long-tail challenges in 3D space. To alleviate the feature encoding burden in the joint prediction of semantics and flow, we propose a BEV cost volume-based prediction method that links flow and semantic features through a cost volume and employs a classification-regression supervision scheme to address the varying flow scales in dynamic scenes. Our purely convolutional architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy achieving state-of-the-art results on multiple benchmarks. On Occ3D and training without the camera visible mask, our ALOcc achieves an absolute gain of 2.5% in terms of RayIoU while operating at a comparable speed compared to the state-of-the-art, using the same input size (256 \times 704) and ResNet-50 backbone. Our method also achieves 2nd place in the CVPR24 Occupancy and Flow Prediction Competition.

[CV-21] EMPERROR: A Flexible Generative Perception Error Model for Probing Self-Driving Planners

链接: https://arxiv.org/abs/2411.07719
作者: Niklas Hanselmann,Simon Doll,Marius Cordts,Hendrik P.A. Lensch,Andreas Geiger
关键词-EN: real-world traffic, promising direction, handle the complexities, complexities of real-world, shown great progress
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:To handle the complexities of real-world traffic, learning planners for self-driving from data is a promising direction. While recent approaches have shown great progress, they typically assume a setting in which the ground-truth world state is available as input. However, when deployed, planning needs to be robust to the long-tail of errors incurred by a noisy perception system, which is often neglected in evaluation. To address this, previous work has proposed drawing adversarial samples from a perception error model (PEM) mimicking the noise characteristics of a target object detector. However, these methods use simple PEMs that fail to accurately capture all failure modes of detection. In this paper, we present EMPERROR, a novel transformer-based generative PEM, apply it to stress-test an imitation learning (IL)-based planner and show that it imitates modern detectors more faithfully than previous work. Furthermore, it is able to produce realistic noisy inputs that increase the planner’s collision rate by up to 85%, demonstrating its utility as a valuable tool for a more complete evaluation of self-driving planners.

[CV-22] Emotion Classification of Children Expressions

链接: https://arxiv.org/abs/2411.07708
作者: Sanchayan Vivekananthan
关键词-EN: Convolutional Block Attention, facial expressions, paper proposes, Block Attention modules, children emotions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper proposes a process for a classification model for the facial expressions. The proposed process would aid in specific categorisation of children’s emotions from 2 emotions namely ‘Happy’ and ‘Sad’. Since the existing emotion recognition systems algorithms primarily train on adult faces, the model developed is achieved by using advanced concepts of models with Squeeze-andExcitation blocks, Convolutional Block Attention modules, and robust data augmentation. Stable Diffusion image synthesis was used for expanding and diversifying the data set generating realistic and various training samples. The model designed using Batch Normalisation, Dropout, and SE Attention mechanisms for the classification of children’s emotions achieved an accuracy rate of 89% due to these methods improving the precision of emotion recognition in children. The relative importance of this issue is raised in this study with an emphasis on the call for a more specific model in emotion detection systems for the young generation with specific direction on how the young people can be assisted to manage emotions while online.

[CV-23] Evaluating the Generation of Spatial Relations in Text and Image Generative Models

链接: https://arxiv.org/abs/2411.07664
作者: Shang Hong Sim,Clarence Lee,Alvin Tan,Cheston Tan
关键词-EN: crucial cognitive ability, Large Language Models, crucial cognitive, cognitive ability, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding spatial relations is a crucial cognitive ability for both humans and AI. While current research has predominantly focused on the benchmarking of text-to-image (T2I) models, we propose a more comprehensive evaluation that includes \textitboth T2I and Large Language Models (LLMs). As spatial relations are naturally understood in a visuo-spatial manner, we develop an approach to convert LLM outputs into an image, thereby allowing us to evaluate both T2I models and LLMs \textitvisually. We examined the spatial relation understanding of 8 prominent generative models (3 T2I models and 5 LLMs) on a set of 10 common prepositions, as well as assess the feasibility of automatic evaluation methods. Surprisingly, we found that T2I models only achieve subpar performance despite their impressive general image-generation abilities. Even more surprisingly, our results show that LLMs are significantly more accurate than T2I models in generating spatial relations, despite being primarily trained on textual data. We examined reasons for model failures and highlight gaps that can be filled to enable more spatially faithful generations.

[CV-24] HMIL: Hierarchical Multi-Instance Learning for Fine-Grained Whole Slide Image Classification

链接: https://arxiv.org/abs/2411.07660
作者: Cheng Jin,Luyang Luo,Huangjing Lin,Jun Hou,Hao Chen
关键词-EN: personalized treatment strategies, enabling precise cancer, precise cancer diagnosis, precision oncology, enabling precise
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:Fine-grained classification of whole slide images (WSIs) is essential in precision oncology, enabling precise cancer diagnosis and personalized treatment strategies. The core of this task involves distinguishing subtle morphological variations within the same broad category of gigapixel-resolution images, which presents a significant challenge. While the multi-instance learning (MIL) paradigm alleviates the computational burden of WSIs, existing MIL methods often overlook hierarchical label correlations, treating fine-grained classification as a flat multi-class classification task. To overcome these limitations, we introduce a novel hierarchical multi-instance learning (HMIL) framework. By facilitating on the hierarchical alignment of inherent relationships between different hierarchy of labels at instance and bag level, our approach provides a more structured and informative learning process. Specifically, HMIL incorporates a class-wise attention mechanism that aligns hierarchical information at both the instance and bag levels. Furthermore, we introduce supervised contrastive learning to enhance the discriminative capability for fine-grained classification and a curriculum-based dynamic weighting module to adaptively balance the hierarchical feature during training. Extensive experiments on our large-scale cytology cervical cancer (CCC) dataset and two public histology datasets, BRACS and PANDA, demonstrate the state-of-the-art class-wise and overall performance of our HMIL framework. Our source code is available at this https URL.

[CV-25] Maritime Search and Rescue Missions with Aerial Images: A Survey

链接: https://arxiv.org/abs/2411.07649
作者: Juan P. Martinez-Esteso,Francisco J. Castellanos,Jorge Calvo-Zaragoza,Antonio Javier Gallego
关键词-EN: Unmanned Aerial Vehicles, vital importance, speed of response, response by search, search and rescue
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The speed of response by search and rescue teams at sea is of vital importance, as survival may depend on it. Recent technological advancements have led to the development of more efficient systems for locating individuals involved in a maritime incident, such as the use of Unmanned Aerial Vehicles (UAVs) equipped with cameras and other integrated sensors. Over the past decade, several researchers have contributed to the development of automatic systems capable of detecting people using aerial images, particularly by leveraging the advantages of deep learning. In this article, we provide a comprehensive review of the existing literature on this topic. We analyze the methods proposed to date, including both traditional techniques and more advanced approaches based on machine learning and neural networks. Additionally, we take into account the use of synthetic data to cover a wider range of scenarios without the need to deploy a team to collect data, which is one of the major obstacles for these systems. Overall, this paper situates the reader in the field of detecting people at sea using aerial images by quickly identifying the most suitable methodology for each scenario, as well as providing an in-depth discussion and direction for future trends.

[CV-26] xCG: Explainable Cell Graphs for Survival Prediction in Non-Small Cell Lung Cancer ML4H ALT

链接: https://arxiv.org/abs/2411.07643
作者: Marvin Sextro,Gabriel Dernbach,Kai Standvoss,Simon Schallenberg,Frederick Klauschen,Klaus-Robert Müller,Maximilian Alber,Lukas Ruff
关键词-EN: data-driven precision medicine, predict oncology patient, provide critical insights, Understanding how deep, support clinical decision-making
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 11 pages

点击查看摘要

Abstract:Understanding how deep learning models predict oncology patient risk can provide critical insights into disease progression, support clinical decision-making, and pave the way for trustworthy and data-driven precision medicine. Building on recent advances in the spatial modeling of the tumor microenvironment using graph neural networks, we present an explainable cell graph (xCG) approach for survival prediction. We validate our model on a public cohort of imaging mass cytometry (IMC) data for 416 cases of lung adenocarcinoma. We explain survival predictions in terms of known phenotypes on the cell level by computing risk attributions over cell graphs, for which we propose an efficient grid-based layer-wise relevance propagation (LRP) method. Our ablation studies highlight the importance of incorporating the cancer stage and model ensembling to improve the quality of risk estimates. Our xCG method, together with the IMC data, is made publicly available to support further research.

[CV-27] Breaking the Low-Rank Dilemma of Linear Attention

链接: https://arxiv.org/abs/2411.07635
作者: Qihang Fan,Huaibo Huang,Ran He
关键词-EN: notoriously computationally expensive, posing significant challenges, linear attention, Softmax attention, attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far more efficient solution by reducing the complexity to linear levels. However, compared to Softmax attention, linear attention often experiences significant performance degradation. Our experiments indicate that this performance drop is due to the low-rank nature of linear attention’s feature map, which hinders its ability to adequately model complex spatial information. In this paper, to break the low-rank dilemma of linear attention, we conduct rank analysis from two perspectives: the KV buffer and the output features. Consequently, we introduce Rank-Augmented Linear Attention (RALA), which rivals the performance of Softmax attention while maintaining linear complexity and high efficiency. Based on RALA, we construct the Rank-Augmented Vision Linear Transformer (RAVLT). Extensive experiments demonstrate that RAVLT achieves excellent performance across various vision tasks. Specifically, without using any additional labels, data, or supervision during training, RAVLT achieves an 84.4% Top-1 accuracy on ImageNet-1k with only 26M parameters and 4.6G FLOPs. This result significantly surpasses previous linear attention mechanisms, fully illustrating the potential of RALA. Code will be available at this https URL.

[CV-28] Leveraging Previous Steps: A Training-free Fast Solver for Flow Diffusion

链接: https://arxiv.org/abs/2411.07627
作者: Kaiyu Song,Hanjiang Lai
关键词-EN: Flow diffusion models, recently shown potential, high generation quality, generation tasks due, Flow diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Flow diffusion models (FDMs) have recently shown potential in generation tasks due to the high generation quality. However, the current ordinary differential equation (ODE) solver for FDMs, e.g., the Euler solver, still suffers from slow generation since ODE solvers need many number function evaluations (NFE) to keep high-quality generation. In this paper, we propose a novel training-free flow-solver to reduce NFE while maintaining high-quality generation. The key insight for the flow-solver is to leverage the previous steps to reduce the NFE, where a cache is created to reuse these results from the previous steps. Specifically, the Taylor expansion is first used to approximate the ODE. To calculate the high-order derivatives of Taylor expansion, the flow-solver proposes to use the previous steps and a polynomial interpolation to approximate it, where the number of orders we could approximate equals the number of previous steps we cached. We also prove that the flow-solver has a more minor approximation error and faster generation speed. Experimental results on the CIFAR-10, CelebA-HQ, LSUN-Bedroom, LSUN-Church, ImageNet, and real text-to-image generation prove the efficiency of the flow-solver. Specifically, the flow-solver improves the FID-30K from 13.79 to 6.75, from 46.64 to 19.49 with \textNFE=10 on CIFAR-10 and LSUN-Church, respectively.

[CV-29] Unraveling the Connections between Flow Matching and Diffusion Probabilistic Models in Training-free Conditional Generation

链接: https://arxiv.org/abs/2411.07625
作者: Kaiyu Song,Hanjiang Lai
关键词-EN: unconditional diffusion models, Training-free conditional generation, conditional generation, mature unconditional diffusion, Training-free conditional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Training-free conditional generation aims to leverage the unconditional diffusion models to implement the conditional generation, where flow-matching (FM) and diffusion probabilistic models (DPMs) are two mature unconditional diffusion models that achieve high-quality generation. Two questions were asked in this paper: What are the underlying connections between FM and DPMs in training-free conditional generation? Can we leverage DPMs to improve the training-free conditional generation for FM? We first show that a probabilistic diffusion path can be associated with the FM and DPMs. Then, we reformulate the ordinary differential equation (ODE) of FM based on the score function of DPMs, and thus, the conditions in FM can be incorporated as those in DPMs. Finally, we propose two posterior sampling methods to estimate the conditional term and achieve a training-free conditional generation of FM. Experimental results show that our proposed method could be implemented for various conditional generation tasks. Our method can generate higher-quality results than the state-of-the-art methods.

[CV-30] Mix from Failure: Confusion-Pairing Mixup for Long-Tailed Recognition

链接: https://arxiv.org/abs/2411.07621
作者: Youngseok Yoon,Sangwoo Hong,Hyungjoon Joo,Yao Qin,Haewon Jeong,Jungwoo Lee
关键词-EN: computer vision problem, real-world class distribution, artificial uniform, computer vision, real-world class
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Long-tailed image recognition is a computer vision problem considering a real-world class distribution rather than an artificial uniform. Existing methods typically detour the problem by i) adjusting a loss function, ii) decoupling classifier learning, or iii) proposing a new multi-head architecture called experts. In this paper, we tackle the problem from a different perspective to augment a training dataset to enhance the sample diversity of minority classes. Specifically, our method, namely Confusion-Pairing Mixup (CP-Mix), estimates the confusion distribution of the model and handles the data deficiency problem by augmenting samples from confusion pairs in real-time. In this way, CP-Mix trains the model to mitigate its weakness and distinguish a pair of classes it frequently misclassifies. In addition, CP-Mix utilizes a novel mixup formulation to handle the bias in decision boundaries that originated from the imbalanced dataset. Extensive experiments demonstrate that CP-Mix outperforms existing methods for long-tailed image recognition and successfully relieves the confusion of the classifier.

[CV-31] Artificial Intelligence for Biomedical Video Generation

链接: https://arxiv.org/abs/2411.07619
作者: Linyuan Li,Jianing Qiu,Anujit Saha,Lin Li,Poyuan Li,Mengxian He,Ziyu Guo,Wu Yuan
关键词-EN: Intelligence Generated Content, Artificial Intelligence Generated, Generated Content, Artificial Intelligence, Intelligence Generated
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As a prominent subfield of Artificial Intelligence Generated Content (AIGC), video generation has achieved notable advancements in recent years. The introduction of Sora-alike models represents a pivotal breakthrough in video generation technologies, significantly enhancing the quality of synthesized videos. Particularly in the realm of biomedicine, video generation technology has shown immense potential such as medical concept explanation, disease simulation, and biomedical data augmentation. In this article, we thoroughly examine the latest developments in video generation models and explore their applications, challenges, and future opportunities in the biomedical sector. We have conducted an extensive review and compiled a comprehensive list of datasets from various sources to facilitate the development and evaluation of video generative models in biomedicine. Given the rapid progress in this field, we have also created a github repository to regularly update the advances of biomedical video generation at: this https URL

[CV-32] Quantum Information-Empowered Graph Neural Network for Hyperspectral Change Detection

链接: https://arxiv.org/abs/2411.07608
作者: Chia-Hsiang Lin,Tzu-Hsuan Lin,Jocelyn Chanussot
关键词-EN: critical remote sensing, remote sensing technique, Earth surface, Change detection, hyperspectral change detection
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: This work has been accepted by IEEE Transactions on Geoscience and Remote Sensing (TGRS)

点击查看摘要

Abstract:Change detection (CD) is a critical remote sensing technique for identifying changes in the Earth’s surface over time. The outstanding substance identifiability of hyperspectral images (HSIs) has significantly enhanced the detection accuracy, making hyperspectral change detection (HCD) an essential technology. The detection accuracy can be further upgraded by leveraging the graph structure of HSIs, motivating us to adopt the graph neural networks (GNNs) in solving HCD. For the first time, this work introduces quantum deep network (QUEEN) into HCD. Unlike GNN and CNN, both extracting the affine-computing features, QUEEN provides fundamentally different unitary-computing features. We demonstrate that through the unitary feature extraction procedure, QUEEN provides radically new information for deciding whether there is a change or not. Hierarchically, a graph feature learning (GFL) module exploits the graph structure of the bitemporal HSIs at the superpixel level, while a quantum feature learning (QFL) module learns the quantum features at the pixel level, as a complementary to GFL by preserving pixel-level detailed spatial information not retained in the superpixels. In the final classification stage, a quantum classifier is designed to cooperate with a traditional fully connected classifier. The superior HCD performance of the proposed QUEEN-empowered GNN (i.e., QUEEN-G) will be experimentally demonstrated on real hyperspectral datasets.

[CV-33] Grounded Video Caption Generation

链接: https://arxiv.org/abs/2411.07584
作者: Evangelos Kazakos,Cordelia Schmid,Josef Sivic
关键词-EN: video caption generation, grounded video caption, video caption, grounded video, caption generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a new task, dataset and model for grounded video caption generation. This task unifies captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes. We introduce the following contributions. First, we present a task definition and a manually annotated test dataset for this task, referred to as GROunded Video Caption Generation (GROC). Second, we introduce a large-scale automatic annotation method leveraging an existing model for grounded still image captioning together with an LLM for summarising frame-level captions into temporally consistent captions in video. Furthermore, we prompt the LLM to track by language – classifying noun phrases from the frame-level captions into noun phrases of the video-level generated caption. We apply this approach to videos from the HowTo100M dataset, which results in a new large-scale training dataset, called HowToGround, with automatically annotated captions and spatio-temporally consistent bounding boxes with coherent natural language labels. Third, we introduce a new grounded video caption generation model, called VideoGround, and train the model on the new automatically annotated HowToGround dataset. Finally, results of our VideoGround model set the state of the art for the new task of grounded video caption generation. We perform extensive ablations and demonstrate the importance of key technical contributions of our model.

[CV-34] Semantic segmentation on multi-resolution optical and microwave data using deep learning

链接: https://arxiv.org/abs/2411.07581
作者: Jai G Singla,Bakul Vaghela
关键词-EN: convolutional neural networks, convolutional neural, deep learning, implemented convolutional neural, neural network based
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Presently, deep learning and convolutional neural networks (CNNs) are widely used in the fields of image processing, image classification, object identification and many more. In this work, we implemented convolutional neural network based modified U-Net model and VGG-UNet model to automatically identify objects from satellite imagery captured using high resolution Indian remote sensing satellites and then to pixel wise classify satellite data into various classes. In this paper, Cartosat 2S (~1m spatial resolution) datasets were used and deep learning models were implemented to detect building shapes and ships from the test datasets with an accuracy of more than 95%. In another experiment, microwave data (varied resolution) from RISAT-1 was taken as an input and ships and trees were detected with an accuracy of 96% from these datasets. For the classification of images into multiple-classes, deep learning model was trained on multispectral Cartosat images. Model generated results were then tested using ground truth. Multi-label classification results were obtained with an accuracy (IoU) of better than 95%. Total six different problems were attempted using deep learning models and IoU accuracies in the range of 85% to 98% were achieved depending on the degree of complexity.

[CV-35] Projecting Gaussian Ellipsoids While Avoiding Affine Projection Approximation

链接: https://arxiv.org/abs/2411.07579
作者: Han Qi,Tao Cai,Xiyue Han
关键词-EN: dominated novel-view synthesis, real-time rendering speed, Gaussian Splatting, ellipsoid-based projection method, dominated novel-view
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting has dominated novel-view synthesis with its real-time rendering speed and state-of-the-art rendering quality. However, during the rendering process, the use of the Jacobian of the affine approximation of the projection transformation leads to inevitable errors, resulting in blurriness, artifacts and a lack of scene consistency in the final rendered images. To address this issue, we introduce an ellipsoid-based projection method to calculate the projection of Gaussian ellipsoid on the image plane, witch is the primitive of 3D Gaussian Splatting. As our proposed ellipsoid-based projection method cannot handle Gaussian ellipsoids with camera origins inside them or parts lying below z=0 plane in the camera space, we designed a pre-filtering strategy. Experiments over multiple widely adopted benchmark datasets show that using our ellipsoid-based projection method can enhance the rendering quality of 3D Gaussian Splatting and its extensions.

[CV-36] Atmospheric turbulence restoration by diffeomorphic image registration and blind deconvolution

链接: https://arxiv.org/abs/2411.07578
作者: Jerome Gilles,Tristan Dagobert,Carlo De Franchis
关键词-EN: atmospheric turbulence, paper to improve, altered by atmospheric, Abstract, improve images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A novel approach is presented in this paper to improve images which are altered by atmospheric turbulence. Two new algorithms are presented based on two combinations of a blind deconvolution block, an elastic registration block and a temporal filter block. The algorithms are tested on real images acquired in the desert in New Mexico by the NATO RTG40 group.

[CV-37] IR image databases generation under target intrinsic thermal variability constraints

链接: https://arxiv.org/abs/2411.07577
作者: Jerome Gilles,Stephane Landeau,Tristan Dagobert,Philippe Chevalier,Christian Bolut
关键词-EN: ATR assessment purposes, generation for ATR, ATR assessment, assessment purposes, paper deals
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2411.06695

点击查看摘要

Abstract:This paper deals with the problem of infrared image database generation for ATR assessment purposes. Huge databases are required to have quantitative and objective performance evaluations. We propose a method which superimpose targets and occultants on background under image quality metrics constraints to generate realistic images. We also propose a method to generate target signatures with intrinsic thermal variability based on 3D models plated with real infrared textures.

[CV-38] Generation de bases de donnees images IR sous contraintes avec variabilite thermique intrins`eque des cibles

链接: https://arxiv.org/abs/2411.07575
作者: Jerome Gilles,Stephane Landeau,Tristan Dagobert,Philippe Chevalier,Christian Bolut
关键词-EN: eventually with occultants, permits to simulate, simulate images, infrared imagery, imagery by superimposition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: in French language, GRETSI Symposium on Signal and Image Processing, Dijon, France, September 2009

点击查看摘要

Abstract:In this communication, we propose a method which permits to simulate images of targets in infrared imagery by superimposition of vehicle signatures in background, eventually with occultants. We develop a principle which authorizes us to generate different thermal configurations of target signatures. This method enables us to easily generate huge datasets for ATR algorithms performance evaluation.

[CV-39] Multi-task Feature Enhancement Network for No-Reference Image Quality Assessment

链接: https://arxiv.org/abs/2411.07556
作者: Li Yu
关键词-EN: Image Quality Assessment, numerous recent studies, Quality Assessment, Image Quality, No-Reference Image Quality
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Due to the scarcity of labeled samples in Image Quality Assessment (IQA) datasets, numerous recent studies have proposed multi-task based strategies, which explore feature information from other tasks or domains to boost the IQA task. Nevertheless, multi-task strategies based No-Reference Image Quality Assessment (NR-IQA) methods encounter several challenges. First, existing methods have not explicitly exploited texture details, which significantly influence the image quality. Second, multi-task methods conventionally integrate features through simple operations such as addition or concatenation, thereby diminishing the network’s capacity to accurately represent distorted features. To tackle these challenges, we introduce a novel multi-task NR-IQA framework. Our framework consists of three key components: a high-frequency extraction network, a quality estimation network, and a distortion-aware network. The high-frequency extraction network is designed to guide the model’s focus towards high-frequency information, which is highly related to the texture details. Meanwhile, the distortion-aware network extracts distortion-related features to distinguish different distortion types. To effectively integrate features from different tasks, a feature fusion module is developed based on an attention mechanism. Empirical results from five standard IQA databases confirm that our method not only achieves high performance but also exhibits robust generalization ability.

[CV-40] GaussianCut: Interactive segmentation via graph cut for 3D Gaussian Splatting

链接: https://arxiv.org/abs/2411.07555
作者: Umangi Jain,Ashkan Mirzaei,Igor Gilitschenski
关键词-EN: interactive multiview segmentation, method for interactive, interactive multiview, Gaussians, scene Gaussians
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce GaussianCut, a new method for interactive multiview segmentation of scenes represented as 3D Gaussians. Our approach allows for selecting the objects to be segmented by interacting with a single view. It accepts intuitive user input, such as point clicks, coarse scribbles, or text. Using 3D Gaussian Splatting (3DGS) as the underlying scene representation simplifies the extraction of objects of interest which are considered to be a subset of the scene’s Gaussians. Our key idea is to represent the scene as a graph and use the graph-cut algorithm to minimize an energy function to effectively partition the Gaussians into foreground and background. To achieve this, we construct a graph based on scene Gaussians and devise a segmentation-aligned energy function on the graph to combine user inputs with scene properties. To obtain an initial coarse segmentation, we leverage 2D image/video segmentation models and further refine these coarse estimates using our graph construction. Our empirical evaluations show the adaptability of GaussianCut across a diverse set of scenes. GaussianCut achieves competitive performance with state-of-the-art approaches for 3D segmentation without requiring any additional segmentation-aware training.

[CV-41] Depthwise Separable Convolutions with Deep Residual Convolutions

链接: https://arxiv.org/abs/2411.07544
作者: Md Arid Hasan,Krishno Dey
关键词-EN: computing enables researchers, Xception architecture, edge computing enables, optimize Xception architecture, Xception
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Course Project Report

点击查看摘要

Abstract:The recent advancement of edge computing enables researchers to optimize various deep learning architectures to employ them in edge devices. In this study, we aim to optimize Xception architecture which is one of the most popular deep learning algorithms for computer vision applications. The Xception architecture is highly effective for object detection tasks. However, it comes with a significant computational cost. The computational complexity of Xception sometimes hinders its deployment on resource-constrained edge devices. To address this, we propose an optimized Xception architecture tailored for edge devices, aiming for lightweight and efficient deployment. We incorporate the depthwise separable convolutions with deep residual convolutions of the Xception architecture to develop a small and efficient model for edge devices. The resultant architecture reduces parameters, memory usage, and computational load. The proposed architecture is evaluated on the CIFAR 10 object detection dataset. The evaluation result of our experiment also shows the proposed architecture is smaller in parameter size and requires less training time while outperforming Xception architecture performance.

[CV-42] HiCoM: Hierarchical Coherent Motion for Streamable Dynamic Scene with 3D Gaussian Splatting NEURIPS2024

链接: https://arxiv.org/abs/2411.07541
作者: Qiankun Gao,Jiarui Meng,Chengxiang Wen,Jie Chen,Jian Zhang
关键词-EN: faces significant challenges, multi-view streaming videos, streaming videos faces, videos faces significant, online reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024; Code is avaliable at this https URL

点击查看摘要

Abstract:The online reconstruction of dynamic scenes from multi-view streaming videos faces significant challenges in training, rendering and storage efficiency. Harnessing superior learning speed and real-time rendering capabilities, 3D Gaussian Splatting (3DGS) has recently demonstrated considerable potential in this field. However, 3DGS can be inefficient in terms of storage and prone to overfitting by excessively growing Gaussians, particularly with limited views. This paper proposes an efficient framework, dubbed HiCoM, with three key components. First, we construct a compact and robust initial 3DGS representation using a perturbation smoothing strategy. Next, we introduce a Hierarchical Coherent Motion mechanism that leverages the inherent non-uniform distribution and local consistency of 3D Gaussians to swiftly and accurately learn motions across frames. Finally, we continually refine the 3DGS with additional Gaussians, which are later merged into the initial 3DGS to maintain consistency with the evolving scene. To preserve a compact representation, an equivalent number of low-opacity Gaussians that minimally impact the representation are removed before processing subsequent frames. Extensive experiments conducted on two widely used datasets show that our framework improves learning efficiency of the state-of-the-art methods by about 20% and reduces the data storage by 85% , achieving competitive free-viewpoint video synthesis quality but with higher robustness and stability. Moreover, by parallel learning multiple frames simultaneously, our HiCoM decreases the average training wall time to 2 seconds per frame with negligible performance degradation, substantially boosting real-world applicability and responsiveness.

[CV-43] GUS-IR: Gaussian Splatting with Unified Shading for Inverse Rendering

链接: https://arxiv.org/abs/2411.07478
作者: Zhihao Liang,Hongdong Li,Kui Jia,Kailing Guo,Qi Zhang
关键词-EN: inverse rendering problem, inverse rendering, intrinsic physical attributes, rendering problem, generally termed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 11 figures

点击查看摘要

Abstract:Recovering the intrinsic physical attributes of a scene from images, generally termed as the inverse rendering problem, has been a central and challenging task in computer vision and computer graphics. In this paper, we present GUS-IR, a novel framework designed to address the inverse rendering problem for complicated scenes featuring rough and glossy surfaces. This paper starts by analyzing and comparing two prominent shading techniques popularly used for inverse rendering, forward shading and deferred shading, effectiveness in handling complex materials. More importantly, we propose a unified shading solution that combines the advantages of both techniques for better decomposition. In addition, we analyze the normal modeling in 3D Gaussian Splatting (3DGS) and utilize the shortest axis as normal for each particle in GUS-IR, along with a depth-related regularization, resulting in improved geometric representation and better shape reconstruction. Furthermore, we enhance the probe-based baking scheme proposed by GS-IR to achieve more accurate ambient occlusion modeling to better handle indirect illumination. Extensive experiments have demonstrated the superior performance of GUS-IR in achieving precise intrinsic decomposition and geometric representation, supporting many downstream tasks (such as relighting, retouching) in computer vision, graphics, and extended reality.

[CV-44] Semi-Truths: A Large-Scale Dataset of AI-Augmented Images for Evaluating Robustness of AI-Generated Image detectors NEURIPS2024

链接: https://arxiv.org/abs/2411.07472
作者: Anisha Pal,Julia Kruk,Mansi Phute,Manognya Bhattaram,Diyi Yang,Duen Horng Chau,Judy Hoffman
关键词-EN: pose significant risks, applications in art, dissemination of misinformation, impactful applications, technologies also pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024 Track Datasets Benchmarks Track

点击查看摘要

Abstract:Text-to-image diffusion models have impactful applications in art, design, and entertainment, yet these technologies also pose significant risks by enabling the creation and dissemination of misinformation. Although recent advancements have produced AI-generated image detectors that claim robustness against various augmentations, their true effectiveness remains uncertain. Do these detectors reliably identify images with different levels of augmentation? Are they biased toward specific scenes or data distributions? To investigate, we introduce SEMI-TRUTHS, featuring 27,600 real images, 223,400 masks, and 1,472,700 AI-augmented images that feature targeted and localized perturbations produced using diverse augmentation techniques, diffusion models, and data distributions. Each augmented image is accompanied by metadata for standardized and targeted evaluation of detector robustness. Our findings suggest that state-of-the-art detectors exhibit varying sensitivities to the types and degrees of perturbations, data distributions, and augmentation methods used, offering new insights into their performance and limitations. The code for the augmentation and evaluation pipeline is available at this https URL.

[CV-45] MSEG-VCUQ: Multimodal SEGmentation with Enhanced Vision Foundation Models Convolutional Neural Networks and Uncertainty Quantification for High-Speed Video Phase Detection Data

链接: https://arxiv.org/abs/2411.07463
作者: Chika Maduabuchi,Ericmoore Jossou,Matteo Bucci
关键词-EN: High-speed video, chemical processing, nuclear reactors, detecting vapor, vital in nuclear
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Under Review in EAAI

点击查看摘要

Abstract:Purpose: High-speed video (HSV) phase detection (PD) segmentation is vital in nuclear reactors, chemical processing, and electronics cooling for detecting vapor, liquid, and microlayer phases. Traditional segmentation models face pixel-level accuracy and generalization issues in multimodal data. MSEG-VCUQ introduces VideoSAM, a hybrid framework leveraging convolutional neural networks (CNNs) and transformer-based vision models to enhance segmentation accuracy and generalizability across complex multimodal PD tasks. Methods: VideoSAM combines U-Net CNN and the Segment Anything Model (SAM) for advanced feature extraction and segmentation across diverse HSV PD modalities, spanning fluids like water, FC-72, nitrogen, and argon under varied heat flux conditions. The framework also incorporates uncertainty quantification (UQ) to assess pixel-based discretization errors, delivering reliable metrics such as contact line density and dry area fraction under experimental conditions. Results: VideoSAM outperforms SAM and modality-specific CNN models in segmentation accuracy, excelling in environments with complex phase boundaries, overlapping bubbles, and dynamic liquid-vapor interactions. Its hybrid architecture supports cross-dataset generalization, adapting effectively to varying modalities. The UQ module provides accurate error estimates, enhancing the reliability of segmentation outputs for advanced HSV PD research. Conclusion: MSEG-VCUQ, via VideoSAM, offers a robust solution for HSV PD segmentation, addressing previous limitations with advanced deep learning and UQ techniques. The open-source datasets and tools introduced enable scalable, precise, and adaptable segmentation for multimodal PD datasets, supporting advancements in HSV analysis and autonomous experimentation. The codes and data used for this paper are publicly available at: \urlthis https URL

[CV-46] MureObjectStitch: Multi-reference Image Composition

链接: https://arxiv.org/abs/2411.07462
作者: Jiaxuan Chen,Bo Zhang,Li Niu
关键词-EN: Generative image composition, realistic composite image, image composition aims, foreground object, Generative image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generative image composition aims to regenerate the given foreground object in the background image to produce a realistic composite image. In this work, we propose an effective finetuning strategy for generative image composition model, in which we finetune a pretrained model using one or more images containing the same foreground object. Moreover, we propose a multi-reference strategy, which allows the model to take in multiple reference images of the foreground object. The experiments on MureCOM dataset verify the effectiveness of our method.

[CV-47] racing the Roots: Leveraging Temporal Dynamics in Diffusion Trajectories for Origin Attribution

链接: https://arxiv.org/abs/2411.07449
作者: Andreas Floros,Seyed-Mohsen Moosavi-Dezfooli,Pier Luigi Dragotti
关键词-EN: revolutionized image synthesis, garnering significant research, significant research interest, recent years, research interest
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have revolutionized image synthesis, garnering significant research interest in recent years. Diffusion is an iterative algorithm in which samples are generated step-by-step, starting from pure noise. This process introduces the notion of diffusion trajectories, i.e., paths from the standard Gaussian distribution to the target image distribution. In this context, we study discriminative algorithms operating on these trajectories. Specifically, given a pre-trained diffusion model, we consider the problem of classifying images as part of the training dataset, generated by the model or originating from an external source. Our approach demonstrates the presence of patterns across steps that can be leveraged for classification. We also conduct ablation studies, which reveal that using higher-order gradient features to characterize the trajectories leads to significant performance gains and more robust algorithms.

[CV-48] All-in-one Weather-degraded Image Restoration via Adaptive Degradation-aware Self-prompting Model

链接: https://arxiv.org/abs/2411.07445
作者: Yuanbo Wen,Tao Gao,Ziqi Li,Jing Zhang,Kaihao Zhang,Ting Chen
关键词-EN: weather-degraded image restoration, leveraging degradation-aware priors, Existing approaches, image restoration suffer, weather-degraded image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing approaches for all-in-one weather-degraded image restoration suffer from inefficiencies in leveraging degradation-aware priors, resulting in sub-optimal performance in adapting to different weather conditions. To this end, we develop an adaptive degradation-aware self-prompting model (ADSM) for all-in-one weather-degraded image restoration. Specifically, our model employs the contrastive language-image pre-training model (CLIP) to facilitate the training of our proposed latent prompt generators (LPGs), which represent three types of latent prompts to characterize the degradation type, degradation property and image caption. Moreover, we integrate the acquired degradation-aware prompts into the time embedding of diffusion model to improve degradation perception. Meanwhile, we employ the latent caption prompt to guide the reverse sampling process using the cross-attention mechanism, thereby guiding the accurate image reconstruction. Furthermore, to accelerate the reverse sampling procedure of diffusion model and address the limitations of frequency perception, we introduce a wavelet-oriented noise estimating network (WNE-Net). Extensive experiments conducted on eight publicly available datasets demonstrate the effectiveness of our proposed approach in both task-specific and all-in-one applications.

[CV-49] XPoint: A Self-Supervised Visual-State-Space based Architecture for Multispectral Image Registration

链接: https://arxiv.org/abs/2411.07430
作者: Ismail Can Yagmur,Hasan F. Ates,Bahadir K. Gunturk
关键词-EN: presents significant challenges, significant challenges due, Accurate multispectral image, non-linear intensity variations, matching presents significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 11 figures, 1 table, Journal

点击查看摘要

Abstract:Accurate multispectral image matching presents significant challenges due to non-linear intensity variations across spectral modalities, extreme viewpoint changes, and the scarcity of labeled datasets. Current state-of-the-art methods are typically specialized for a single spectral difference, such as visibleinfrared, and struggle to adapt to other modalities due to their reliance on expensive supervision, such as depth maps or camera poses. To address the need for rapid adaptation across modalities, we introduce XPoint, a self-supervised, modular image-matching framework designed for adaptive training and fine-tuning on aligned multispectral datasets, allowing users to customize key components based on their specific tasks. XPoint employs modularity and self-supervision to allow for the adjustment of elements such as the base detector, which generates pseudoground truth keypoints invariant to viewpoint and spectrum variations. The framework integrates a VMamba encoder, pretrained on segmentation tasks, for robust feature extraction, and includes three joint decoder heads: two are dedicated to interest point and descriptor extraction; and a task-specific homography regression head imposes geometric constraints for superior performance in tasks like image registration. This flexible architecture enables quick adaptation to a wide range of modalities, demonstrated by training on Optical-Thermal data and fine-tuning on settings such as visual-near infrared, visual-infrared, visual-longwave infrared, and visual-synthetic aperture radar. Experimental results show that XPoint consistently outperforms or matches state-ofthe-art methods in feature matching and image registration tasks across five distinct multispectral datasets. Our source code is available at this https URL.

[CV-50] Generalization of Brady-Yong Algorithm for Fast Hough Transform to Arbitrary Image Size

链接: https://arxiv.org/abs/2411.07351
作者: Danil Kazimirov,Dmitry Nikolaev,Ekaterina Rybakova,Arseniy Terekhin
关键词-EN: X-ray computed tomography, widespread tool harnessed, discrete Radon, general image processing, processing to X-ray
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 2 figures. Accepted to Symposium on Pattern Recognition and Applications 2024 (SPRA 2024)

点击查看摘要

Abstract:Nowadays, the Hough (discrete Radon) transform (HT/DRT) has proved to be an extremely powerful and widespread tool harnessed in a number of application areas, ranging from general image processing to X-ray computed tomography. Efficient utilization of the HT to solve applied problems demands its acceleration and increased accuracy. Along with this, most fast algorithms for computing the HT, especially the pioneering Brady-Yong algorithm, operate on power-of-two size input images and are not adapted for arbitrary size images. This paper presents a new algorithm for calculating the HT for images of arbitrary size. It generalizes the Brady-Yong algorithm from which it inherits the optimal computational complexity. Moreover, the algorithm allows to compute the HT with considerably higher accuracy compared to the existing algorithm. Herewith, the paper provides a theoretical analysis of the computational complexity and accuracy of the proposed algorithm. The conclusions of the performed experiments conform with the theoretical results.

[CV-51] Exploring Variational Autoencoders for Medical Image Generation: A Comprehensive Study

链接: https://arxiv.org/abs/2411.07348
作者: Khadija Rais,Mohamed Amroune,Abdelmadjid Benmachiche,Mohamed Yassine Haouam
关键词-EN: shown advanced researchers, Variational autoencoder, common techniques, shown advanced, advanced researchers
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: for associated mpeg file, see this https URL

点击查看摘要

Abstract:Variational autoencoder (VAE) is one of the most common techniques in the field of medical image generation, where this architecture has shown advanced researchers in recent years and has developed into various architectures. VAE has advantages including improving datasets by adding samples in smaller datasets and in datasets with imbalanced classes, and this is how data augmentation works. This paper provides a comprehensive review of studies on VAE in medical imaging, with a special focus on their ability to create synthetic images close to real data so that they can be used for data augmentation. This study reviews important architectures and methods used to develop VAEs for medical images and provides a comparison with other generative models such as GANs on issues such as image quality, and low diversity of generated samples. We discuss recent developments and applications in several medical fields highlighting the ability of VAEs to improve segmentation and classification accuracy.

[CV-52] SE(3) Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation NEURIPS2024

链接: https://arxiv.org/abs/2411.07326
作者: Yinshuang Xu,Dian Chen,Katherine Liu,Sergey Zakharov,Rares Ambrus,Kostas Daniilidis,Vitor Guizilini
关键词-EN: Incorporating inductive bias, embedding geometric entities, Incorporating inductive, bias by embedding, input has proven
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Incorporating inductive bias by embedding geometric entities (such as rays) as input has proven successful in multi-view learning. However, the methods adopting this technique typically lack equivariance, which is crucial for effective 3D learning. Equivariance serves as a valuable inductive prior, aiding in the generation of robust multi-view features for 3D scene understanding. In this paper, we explore the application of equivariant multi-view learning to depth estimation, not only recognizing its significance for computer vision and robotics but also addressing the limitations of previous research. Most prior studies have either overlooked equivariance in this setting or achieved only approximate equivariance through data augmentation, which often leads to inconsistencies across different reference frames. To address this issue, we propose to embed SE(3) equivariance into the Perceiver IO architecture. We employ Spherical Harmonics for positional encoding to ensure 3D rotation equivariance, and develop a specialized equivariant encoder and decoder within the Perceiver IO architecture. To validate our model, we applied it to the task of stereo depth estimation, achieving state of the art results on real-world datasets without explicit geometric constraints or extensive data augmentation.

[CV-53] GPU-Accelerated Inverse Lithography Towards High Quality Curvy Mask Generation

链接: https://arxiv.org/abs/2411.07311
作者: Haoyu Yang,Haoxing Ren
关键词-EN: Inverse Lithography Technology, Inverse Lithography, Lithography Technology, photo mask design, promising solution
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
*备注: 10 pages, 5 figures, Accepted by International Symposium on Physical Design (ISPD), 2025, Austin TX

点击查看摘要

Abstract:Inverse Lithography Technology (ILT) has emerged as a promising solution for photo mask design and optimization. Relying on multi-beam mask writers, ILT enables the creation of free-form curvilinear mask shapes that enhance printed wafer image quality and process window. However, a major challenge in implementing curvilinear ILT for large-scale production is mask rule checking, an area currently under development by foundries and EDA vendors. Although recent research has incorporated mask complexity into the optimization process, much of it focuses on reducing e-beam shots, which does not align with the goals of curvilinear ILT. In this paper, we introduce a GPU-accelerated ILT algorithm that improves not only contour quality and process window but also the precision of curvilinear mask shapes. Our experiments on open benchmarks demonstrate a significant advantage of our algorithm over leading academic ILT engines.

[CV-54] ViTOC: Vision Transformer and Object-aware Captioner

链接: https://arxiv.org/abs/2411.07265
作者: Feiyang Huang
关键词-EN: Vision Transformer, Object-aware Captioner, paper presents ViTOC, generated descriptions, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents ViTOC (Vision Transformer and Object-aware Captioner), a novel vision-language model for image captioning that addresses the challenges of accuracy and diversity in generated descriptions. Unlike conventional approaches, ViTOC employs a dual-path architecture based on Vision Transformer and object detector, effectively fusing global visual features and local object information through learnable vectors. The model introduces an innovative object-aware prompting strategy that significantly enhances its capability in handling long-tail data. Experiments on the standard COCO dataset demonstrate that ViTOC outperforms baseline models across all evaluation metrics, achieving 71.26 and 17.82 on CIDEr and SPICE, respectively. Additionally, we propose a reference-free evaluation method based on CLIP to further validate the model’s effectiveness. By utilizing pretrained visual model parameters, ViTOC achieves efficient end-to-end training.

[CV-55] Commissioning An All-Sky Infrared Camera Array for Detection Of Airborne Objects

链接: https://arxiv.org/abs/2411.07956
作者: Laura Dominé,Ankit Biswas,Richard Cloete,Alex Delacroix,Andriy Fedorenko,Lucas Jacaruso,Ezra Kelderman,Eric Keto,Sarah Little,Abraham Loeb,Eric Masson,Mike Prior,Forrest Schultz,Matthew Szenher,Wes Watters,Abby White
关键词-EN: Unidentified Aerial Phenomena, Aerial Phenomena, Unidentified Aerial, kinematics purportedly reside, publicly available scientific
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:To date there is little publicly available scientific data on Unidentified Aerial Phenomena (UAP) whose properties and kinematics purportedly reside outside the performance envelope of known phenomena. To address this deficiency, the Galileo Project is designing, building, and commissioning a multi-modal ground-based observatory to continuously monitor the sky and conduct a rigorous long-term aerial census of all aerial phenomena, including natural and human-made. One of the key instruments is an all-sky infrared camera array using eight uncooled long-wave infrared FLIR Boson 640 cameras. Their calibration includes a novel extrinsic calibration method using airplane positions from Automatic Dependent Surveillance-Broadcast (ADS-B) data. We establish a first baseline for the system performance over five months of field operation, using a real-world dataset derived from ADS-B data, synthetic 3-D trajectories, and a hand-labelled real-world dataset. We report acceptance rates (e.g. viewable airplanes that are recorded) and detection efficiencies (e.g. recorded airplanes which are successfully detected) for a variety of weather conditions, range and aircraft size. We reconstruct \sim 500,000 trajectories of aerial objects from this commissioning period. A toy outlier search focused on large sinuosity of the 2-D reconstructed trajectories flags about 16% of trajectories as outliers. After manual review, 144 trajectories remain ambiguous: they are likely mundane objects but cannot be elucidated at this stage of development without distance and kinematics estimation or other sensor modalities. Our observed count of ambiguous outliers combined with systematic uncertainties yields an upper limit of 18,271 outliers count for the five-month interval at a 95% confidence level. This likelihood-based method to evaluate significance is applicable to all of our future outlier searches.

[CV-56] LapGSR: Laplacian Reconstructive Network for Guided Thermal Super-Resolution

链接: https://arxiv.org/abs/2411.07750
作者: Aditya Kasliwal,Ishaan Gakhar,Aryan Kamani,Pratinav Seth,Ujjwal Verma
关键词-EN: autonomous navigation, fusion of multi-modal, widely studied, RGB color images, gesture recognition
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the last few years, the fusion of multi-modal data has been widely studied for various applications such as robotics, gesture recognition, and autonomous navigation. Indeed, high-quality visual sensors are expensive, and consumer-grade sensors produce low-resolution images. Researchers have developed methods to combine RGB color images with non-visual data, such as thermal, to overcome this limitation to improve resolution. Fusing multiple modalities to produce visually appealing, high-resolution images often requires dense models with millions of parameters and a heavy computational load, which is commonly attributed to the intricate architecture of the model. We propose LapGSR, a multimodal, lightweight, generative model incorporating Laplacian image pyramids for guided thermal super-resolution. This approach uses a Laplacian Pyramid on RGB color images to extract vital edge information, which is then used to bypass heavy feature map computation in the higher layers of the model in tandem with a combined pixel and adversarial loss. LapGSR preserves the spatial and structural details of the image while also being efficient and compact. This results in a model with significantly fewer parameters than other SOTA models while demonstrating excellent results on two cross-domain datasets viz. ULB17-VT and VGTSR datasets. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2411.07750 [eess.IV] (or arXiv:2411.07750v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2411.07750 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-57] SegQC: a segmentation network-based framework for multi-metric segmentation quality control and segmentation error detection in volumetric medical images

链接: https://arxiv.org/abs/2411.07601
作者: Bella Specktor-Fadida,Liat Ben-Sira,Dafna Ben-Bashat,Leo Joskowicz
关键词-EN: segmentation error, facilitating model development, segmentation, error, segmentation error probabilities
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 28 pages, 9 figures

点击查看摘要

Abstract:Quality control of structures segmentation in volumetric medical images is important for identifying segmentation errors in clinical practice and for facilitating model development. This paper introduces SegQC, a novel framework for segmentation quality estimation and segmentation error detection. SegQC computes an estimate measure of the quality of a segmentation in volumetric scans and in their individual slices and identifies possible segmentation error regions within a slice. The key components include: 1. SegQC-Net, a deep network that inputs a scan and its segmentation mask and outputs segmentation error probabilities for each voxel in the scan; 2. three new segmentation quality metrics, two overlap metrics and a structure size metric, computed from the segmentation error probabilities; 3. a new method for detecting possible segmentation errors in scan slices computed from the segmentation error probabilities. We introduce a new evaluation scheme to measure segmentation error discrepancies based on an expert radiologist corrections of automatically produced segmentations that yields smaller observer variability and is closer to actual segmentation errors. We demonstrate SegQC on three fetal structures in 198 fetal MRI scans: fetal brain, fetal body and the placenta. To assess the benefits of SegQC, we compare it to the unsupervised Test Time Augmentation (TTA)-based quality estimation. Our studies indicate that SegQC outperforms TTA-based quality estimation in terms of Pearson correlation and MAE for fetal body and fetal brain structures segmentation. Our segmentation error detection method achieved recall and precision rates of 0.77 and 0.48 for fetal body, and 0.74 and 0.55 for fetal brain segmentation error detection respectively. SegQC enhances segmentation metrics estimation for whole scans and individual slices, as well as provides error regions detection.

[CV-58] Uncertainty-Aware Test-Time Adaptation for Inverse Consistent Diffeomorphic Lung Image Registration

链接: https://arxiv.org/abs/2411.07567
作者: Muhammad F. A. Chaudhary,Stephanie M. Aguilera,Arie Nakhmani,Joseph M. Reinhardt,Surya P. Bhatt,Sandeep Bodduluri
关键词-EN: ensures smooth invertible, smooth invertible transformations, Diffeomorphic deformable image, deformable image registration, image registration ensures
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Diffeomorphic deformable image registration ensures smooth invertible transformations across inspiratory and expiratory chest CT scans. Yet, in practice, deep learning-based diffeomorphic methods struggle to capture large deformations between inspiratory and expiratory volumes, and therefore lack inverse consistency. Existing methods also fail to account for model uncertainty, which can be useful for improving performance. We propose an uncertainty-aware test-time adaptation framework for inverse consistent diffeomorphic lung registration. Our method uses Monte Carlo (MC) dropout to estimate spatial uncertainty that is used to improve model performance. We train and evaluate our method for inspiratory-to-expiratory CT registration on a large cohort of 675 subjects from the COPDGene study, achieving a higher Dice similarity coefficient (DSC) between the lung boundaries (0.966) compared to both VoxelMorph (0.953) and TransMorph (0.953). Our method demonstrates consistent improvements in the inverse registration direction as well with an overall DSC of 0.966, higher than VoxelMorph (0.958) and TransMorph (0.956). Paired t-tests indicate statistically significant improvements.

[CV-59] A Novel Automatic Real-time Motion Tracking Method for Magnetic Resonance Imaging-guided Radiotherapy: Leveraging the Enhanced Tracking-Learning-Detection Framework with Automatic Segmentation

链接: https://arxiv.org/abs/2411.07503
作者: Shengqi Chen,Zilin Wang,Jianrong Dai,Shirui Qin,Ying Cao,Ruiao Zhao,Jiayun Chen,Guohua Wu,Yuan Tang
关键词-EN: motion tracking, motion tracking method, tracking, ETLD, delivery of effective
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph); Tissues and Organs (q-bio.TO)
*备注:

点击查看摘要

Abstract:Objective: Ensuring the precision in motion tracking for MRI-guided Radiotherapy (MRIgRT) is crucial for the delivery of effective treatments. This study refined the motion tracking accuracy in MRIgRT through the innovation of an automatic real-time tracking method, leveraging an enhanced Tracking-Learning-Detection (ETLD) framework coupled with automatic segmentation. Methods: We developed a novel MRIgRT motion tracking method by integrating two primary methods: the ETLD framework and an improved Chan-Vese model (ICV), named ETLD+ICV. The TLD framework was upgraded to suit real-time cine MRI, including advanced image preprocessing, no-reference image quality assessment, an enhanced median-flow tracker, and a refined detector with dynamic search region adjustments. Additionally, ICV was combined for precise coverage of the target volume, which refined the segmented region frame by frame using tracking results, with key parameters optimized. Tested on 3.5D MRI scans from 10 patients with liver metastases, our method ensures precise tracking and accurate segmentation vital for MRIgRT. Results: An evaluation of 106,000 frames across 77 treatment fractions revealed sub-millimeter tracking errors of less than 0.8mm, with over 99% precision and 98% recall for all subjects, underscoring the robustness and efficacy of the ETLD. Moreover, the ETLD+ICV yielded a dice global score of more than 82% for all subjects, demonstrating the proposed method’s extensibility and precise target volume coverage. Conclusions: This study successfully developed an automatic real-time motion tracking method for MRIgRT that markedly surpasses current methods. The novel method not only delivers exceptional precision in tracking and segmentation but also demonstrates enhanced adaptability to clinical demands, positioning it as an indispensable asset in the quest to augment the efficacy of radiotherapy treatments.

[CV-60] Quantifying Knowledge Distillation Using Partial Information Decomposition NEURIPS2024

链接: https://arxiv.org/abs/2411.07483
作者: Pasan Dissanayake,Faisal Hamman,Barproda Halder,Ilia Sucholutsky,Qiuyi Zhang,Sanghamitra Dutta
关键词-EN: deploying complex machine, complex machine learning, machine learning models, resource-constrained environments, effective method
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Accepted at NeurIPS 2024 Machine Learning and Compression Workshop

点击查看摘要

Abstract:Knowledge distillation provides an effective method for deploying complex machine learning models in resource-constrained environments. It typically involves training a smaller student model to emulate either the probabilistic outputs or the internal feature representations of a larger teacher model. By doing so, the student model often achieves substantially better performance on a downstream task compared to when it is trained independently. Nevertheless, the teacher’s internal representations can also encode noise or additional information that may not be relevant to the downstream task. This observation motivates our primary question: What are the information-theoretic limits of knowledge transfer? To this end, we leverage a body of work in information theory called Partial Information Decomposition (PID) to quantify the distillable and distilled knowledge of a teacher’s representation corresponding to a given student and a downstream task. Moreover, we demonstrate that this metric can be practically used in distillation to address challenges caused by the complexity gap between the teacher and the student representations.

[CV-61] 2-Only Prostate Cancer Prediction by Meta-Learning from Bi-Parametric MR Imaging

链接: https://arxiv.org/abs/2411.07416
作者: Weixi Yi,Yipei Wang,Natasha Thorley,Alexander Ng,Shonit Punwani,Veeru Kasivisvanathan,Dean C. Barratt,Shaheer Ullah Saeed,Yipeng Hu
关键词-EN: Current imaging-based prostate, Current imaging-based, greater accuracy improvement, cancer diagnosis requires, potentially greater accuracy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code: this https URL

点击查看摘要

Abstract:Current imaging-based prostate cancer diagnosis requires both MR T2-weighted (T2w) and diffusion-weighted imaging (DWI) sequences, with additional sequences for potentially greater accuracy improvement. However, measuring diffusion patterns in DWI sequences can be time-consuming, prone to artifacts and sensitive to imaging parameters. While machine learning (ML) models have demonstrated radiologist-level accuracy in detecting prostate cancer from these two sequences, this study investigates the potential of ML-enabled methods using only the T2w sequence as input during inference time. We first discuss the technical feasibility of such a T2-only approach, and then propose a novel ML formulation, where DWI sequences - readily available for training purposes - are only used to train a meta-learning model, which subsequently only uses T2w sequences at inference. Using multiple datasets from more than 3,000 prostate cancer patients, we report superior or comparable performance in localising radiologist-identified prostate cancer using our proposed T2-only models, compared with alternative models using T2-only or both sequences as input. Real patient cases are presented and discussed to demonstrate, for the first time, the exclusively true-positive cases from models with different input sequences.

[CV-62] Artificial Intelligence-Informed Handheld Breast Ultrasound for Screening: A Systematic Review of Diagnostic Test Accuracy

链接: https://arxiv.org/abs/2411.07322
作者: Arianna Bunnell,Dustin Valdez,Fredrik Strand,Yannik Glaser,Peter Sadowski,John A. Shepherd
关键词-EN: Background, BUS, Preferred Reporting Items, Studies, Abstract
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Background. Breast cancer screening programs using mammography have led to significant mortality reduction in high-income countries. However, many low- and middle-income countries lack resources for mammographic screening. Handheld breast ultrasound (BUS) is a low-cost alternative but requires substantial training. Artificial intelligence (AI) enabled BUS may aid in both the detection (perception) and classification (interpretation) of breast cancer. Materials and Methods. This review (CRD42023493053) is reported in accordance with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analysis) and SWiM (Synthesis Without Meta-analysis) guidelines. PubMed and Google Scholar were searched from January 1, 2016 to December 12, 2023. A meta-analysis was not attempted. Studies are grouped according to their AI task type, application time, and AI task. Study quality is assessed using the QUality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool. Results. Of 763 candidate studies, 314 total full texts were reviewed. 34 studies are included. The AI tasks of included studies are as follows: 1 frame selection, 6 detection, 11 segmentation, and 16 classification. In total, 5.7 million BUS images from over 185,000 patients were used for AI training or validation. A single study included a prospective testing set. 79% of studies were at high or unclear risk of bias. Conclusion. There has been encouraging development of AI for BUS. Despite studies demonstrating high performance across all identified tasks, the evidence supporting AI-enhanced BUS generally lacks robustness. High-quality model validation will be key to realizing the potential for AI-enhanced BUS in increasing access to screening in resource-limited environments.

机器学习

[LG-0] Optimal Control of Mechanical Ventilators with Learned Respiratory Dynamics

链接: https://arxiv.org/abs/2411.07971
作者: Isaac Ronald Ward,Dylan M. Asmar,Mansur Arief,Jana Krystofova Mike,Mykel J. Kochenderfer
关键词-EN: Respiratory Distress Syndrome, strategies significantly impacts, significantly impacts, Distress Syndrome, mechanical ventilator management
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 2024 IEEE 37th International Symposium on Computer-Based Medical Systems (CBMS), 7 pages, 3 figures

点击查看摘要

Abstract:Deciding on appropriate mechanical ventilator management strategies significantly impacts the health outcomes for patients with respiratory diseases. Acute Respiratory Distress Syndrome (ARDS) is one such disease that requires careful ventilator operation to be effectively treated. In this work, we frame the management of ventilators for patients with ARDS as a sequential decision making problem using the Markov decision process framework. We implement and compare controllers based on clinical guidelines contained in the ARDSnet protocol, optimal control theory, and learned latent dynamics represented as neural networks. The Pulse Physiology Engine’s respiratory dynamics simulator is used to establish a repeatable benchmark, gather simulated data, and quantitatively compare these controllers. We score performance in terms of measured improvement in established ARDS health markers (pertaining to improved respiratory rate, oxygenation, and vital signs). Our results demonstrate that techniques leveraging neural networks and optimal control can automatically discover effective ventilation management strategies without access to explicit ventilator management procedures or guidelines (such as those defined in the ARDSnet protocol).

[LG-1] Sleep Staging from Airflow Signals Using Fourier Approximations of Persistence Curves

链接: https://arxiv.org/abs/2411.07964
作者: Shashank Manjunath,Hau-Tieng Wu,Aarti Sathyanarayana
关键词-EN: typically manually performed, sleep technologists based, perform sleep staging, challenging task, typically manually
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sleep staging is a challenging task, typically manually performed by sleep technologists based on electroencephalogram and other biosignals of patients taken during overnight sleep studies. Recent work aims to leverage automated algorithms to perform sleep staging not based on electroencephalogram signals, but rather based on the airflow signals of subjects. Prior work uses ideas from topological data analysis (TDA), specifically Hermite function expansions of persistence curves (HEPC) to featurize airflow signals. However, finite order HEPC captures only partial information. In this work, we propose Fourier approximations of persistence curves (FAPC), and use this technique to perform sleep staging based on airflow signals. We analyze performance using an XGBoost model on 1155 pediatric sleep studies taken from the Nationwide Children’s Hospital Sleep DataBank (NCHSDB), and find that FAPC methods provide complimentary information to HEPC methods alone, leading to a 4.9% increase in performance over baseline methods.

[LG-2] On the Convergence of Continual Federated Learning Using Incrementally Aggregated Gradients

链接: https://arxiv.org/abs/2411.07959
作者: Satish Kumar Keshri,Nazreen Shah,Ranjitha Prasad
关键词-EN: Continual Federated Learning, enable Continual Federated, Continual Federated, propose Continual Federated, enhance the efficiency
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The holy grail of machine learning is to enable Continual Federated Learning (CFL) to enhance the efficiency, privacy, and scalability of AI systems while learning from streaming data. The primary challenge of a CFL system is to overcome global catastrophic forgetting, wherein the accuracy of the global model trained on new tasks declines on the old tasks. In this work, we propose Continual Federated Learning with Aggregated Gradients (C-FLAG), a novel replay-memory based federated strategy consisting of edge-based gradient updates on memory and aggregated gradients on the current data. We provide convergence analysis of the C-FLAG approach which addresses forgetting and bias while converging at a rate of O(1/\sqrtT) over T communication rounds. We formulate an optimization sub-problem that minimizes catastrophic forgetting, translating CFL into an iterative algorithm with adaptive learning rates that ensure seamless learning across tasks. We empirically show that C-FLAG outperforms several state-of-the-art baselines on both task and class-incremental settings with respect to metrics such as accuracy and forgetting.

[LG-3] Learning Memory Mechanisms for Decision Making through Demonstrations

链接: https://arxiv.org/abs/2411.07954
作者: William Yue,Bo Liu,Peter Stone
关键词-EN: Partially Observable Markov, Observable Markov Decision, Markov Decision Processes, Partially Observable, Observable Markov
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In Partially Observable Markov Decision Processes, integrating an agent’s history into memory poses a significant challenge for decision-making. Traditional imitation learning, relying on observation-action pairs for expert demonstrations, fails to capture the expert’s memory mechanisms used in decision-making. To capture memory processes as demonstrations, we introduce the concept of memory dependency pairs (p, q) indicating that events at time p are recalled for decision-making at time q . We introduce AttentionTuner to leverage memory dependency pairs in Transformers and find significant improvements across several tasks compared to standard Transformers when evaluated on Memory Gym and the Long-term Memory Benchmark. Code is available at this https URL.

[LG-4] Prediction of Acoustic Communication Performance for AUVs using Gaussian Process Classification

链接: https://arxiv.org/abs/2411.07933
作者: Yifei Gao,Harun Yetkin,McMahon James,Daniel J. Stilwell
关键词-EN: Cooperating autonomous underwater, autonomous underwater vehicles, actions effectively, underwater acoustic communication, coordinate their actions
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cooperating autonomous underwater vehicles (AUVs) often rely on acoustic communication to coordinate their actions effectively. However, the reliability of underwater acoustic communication decreases as the communication range between vehicles increases. Consequently, teams of cooperating AUVs typically make conservative assumptions about the maximum range at which they can communicate reliably. To address this limitation, we propose a novel approach that involves learning a map representing the probability of successful communication based on the locations of the transmitting and receiving vehicles. This probabilistic communication map accounts for factors such as the range between vehicles, environmental noise, and multi-path effects at a given location. In pursuit of this goal, we investigate the application of Gaussian process binary classification to generate the desired communication map. We specialize existing results to this specific binary classification problem and explore methods to incorporate uncertainty in vehicle location into the mapping process. Furthermore, we compare the prediction performance of the probability communication map generated using binary classification with that of a signal-to-noise ratio (SNR) communication map generated using Gaussian process regression. Our approach is experimentally validated using communication and navigation data collected during trials with a pair of Virginia Tech 690 AUVs.

[LG-5] A Stochastic Optimization Framework for Private and Fair Learning From Decentralized Data

链接: https://arxiv.org/abs/2411.07889
作者: Devansh Gupta,A.S. Poornash,Andrew Lowy,Meisam Razaviyayn
关键词-EN: Machine learning models, Machine learning, medical records, trained on sensitive, learning models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models are often trained on sensitive data (e.g., medical records and race/gender) that is distributed across different “silos” (e.g., hospitals). These federated learning models may then be used to make consequential decisions, such as allocating healthcare resources. Two key challenges emerge in this setting: (i) maintaining the privacy of each person’s data, even if other silos or an adversary with access to the central server tries to infer this data; (ii) ensuring that decisions are fair to different demographic groups (e.g., race/gender). In this paper, we develop a novel algorithm for private and fair federated learning (FL). Our algorithm satisfies inter-silo record-level differential privacy (ISRL-DP), a strong notion of private FL requiring that silo i’s sent messages satisfy record-level differential privacy for all i. Our framework can be used to promote different fairness notions, including demographic parity and equalized odds. We prove that our algorithm converges under mild smoothness assumptions on the loss function, whereas prior work required strong convexity for convergence. As a byproduct of our analysis, we obtain the first convergence guarantee for ISRL-DP nonconvex-strongly concave min-max FL. Experiments demonstrate the state-of-the-art fairness-accuracy tradeoffs of our algorithm across different privacy levels.

[LG-6] Evidential time-to-event prediction model with well-calibrated uncertainty estimation

链接: https://arxiv.org/abs/2411.07853
作者: Ling Huang,Yucheng Xing,Swapnil Mishra,Thierry Denoeux,Mengling Feng
关键词-EN: Survival analysis, treatment recommendations, valuable insights, prognosis and treatment, analysis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time-to-event analysis, or Survival analysis, provides valuable insights into clinical prognosis and treatment recommendations. However, this task is typically more challenging than other regression tasks due to the censored observations. Moreover, concerns regarding the reliability of predictions persist among clinicians, mainly attributed to the absence of confidence assessment, robustness, and calibration of prediction. To address those challenges, we introduce an evidential regression model designed especially for time-to-event prediction tasks, with which the most plausible event time, is directly quantified by aggregated Gaussian random fuzzy numbers (GRFNs). The GRFNs are a newly introduced family of random fuzzy subsets of the real line that generalizes both Gaussian random variables and Gaussian possibility distributions. Different from conventional methods that construct models based on strict data distribution, e.g., proportional hazard function, our model only assumes the event time is encoded in a real line GFRN without any strict distribution assumption, therefore offering more flexibility in complex data scenarios. Furthermore, the epistemic and aleatory uncertainty regarding the event time is quantified within the aggregated GRFN as well. Our model can, therefore, provide more detailed clinical decision-making guidance with two more degrees of information. The model is fit by minimizing a generalized negative log-likelihood function that accounts for data censoring based on uncertainty evidence reasoning. Experimental results on simulated datasets with varying data distributions and censoring scenarios, as well as on real-world datasets across diverse clinical settings and tasks, demonstrate that our model achieves both accurate and reliable performance, outperforming state-of-the-art methods.

[LG-7] FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

链接: https://arxiv.org/abs/2411.07837
作者: Philip Zmushko,Aleksandr Beznosikov,Martin Takáč,Samuel Horváth
关键词-EN: large language models, increasingly demands larger, demands larger volumes, volumes of GPU, GPU memory
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the increase in the number of parameters in large language models, the process of pre-training and fine-tuning increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA (Hu et al., 2021)), low-rank gradient projection (GaLore (Zhao et al., 2024)), and blockwise optimization (BAdam (Luo et al., 2024)) have been proposed. However, in all these algorithms, the \textiteffective rank of the weight updates remains low-rank , which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce \textttFRUGAL ( \textbfF ull- \textbfR ank \textbfU pdates with \textbfG r \textbfA dient sp \textbfL itting), a new memory-efficient optimization framework. \textttFRUGAL leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD (Bernstein et al., 2018). Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches across various fixed memory budgets, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

[LG-8] Dynamical-VAE-based Hindsight to Learn the Causal Dynamics of Factored-POMDPs

链接: https://arxiv.org/abs/2411.07832
作者: Chao Han,Debabrota Basu,Michael Mangan,Eleni Vasilaki,Aditya Gilra
关键词-EN: Markov Decision Processes, Partially Observable Markov, Observable Markov Decision, machine learning, underlying environmental dynamics
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning representations of underlying environmental dynamics from partial observations is a critical challenge in machine learning. In the context of Partially Observable Markov Decision Processes (POMDPs), state representations are often inferred from the history of past observations and actions. We demonstrate that incorporating future information is essential to accurately capture causal dynamics and enhance state representations. To address this, we introduce a Dynamical Variational Auto-Encoder (DVAE) designed to learn causal Markovian dynamics from offline trajectories in a POMDP. Our method employs an extended hindsight framework that integrates past, current, and multi-step future information within a factored-POMDP setting. Empirical results reveal that this approach uncovers the causal graph governing hidden state transitions more effectively than history-based and typical hindsight-based models.

[LG-9] Suite-IN: Aggregating Motion Features from Apple Suite for Robust Inertial Navigation

链接: https://arxiv.org/abs/2411.07828
作者: Lan Sun,Songpengcheng Xia,Junyuan Deng,Jiarui Yang,Zengyuan Lai,Qi Wu,Ling Pei
关键词-EN: rapid development, headphones equipped, wearable technology, traditional pedestrian dead, pedestrian dead reckoning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of wearable technology, devices like smartphones, smartwatches, and headphones equipped with IMUs have become essential for applications such as pedestrian positioning. However, traditional pedestrian dead reckoning (PDR) methods struggle with diverse motion patterns, while recent data-driven approaches, though improving accuracy, often lack robustness due to reliance on a single this http URL our work, we attempt to enhance the positioning performance using the low-cost commodity IMUs embedded in the wearable devices. We propose a multi-device deep learning framework named Suite-IN, aggregating motion data from Apple Suite for inertial navigation. Motion data captured by sensors on different body parts contains both local and global motion information, making it essential to reduce the negative effects of localized movements and extract global motion representations from multiple devices.

[LG-10] Dual-Criterion Model Aggregation in Federated Learning: Balancing Data Quantity and Quality

链接: https://arxiv.org/abs/2411.07816
作者: Haizhou Zhang,Xianjia Yu,Tomi Westerlund
关键词-EN: privacy-preserving collaborative learning, Federated learning, key methods, methods for privacy-preserving, privacy-preserving collaborative
类目: Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Federated learning (FL) has become one of the key methods for privacy-preserving collaborative learning, as it enables the transfer of models without requiring local data exchange. Within the FL framework, an aggregation algorithm is recognized as one of the most crucial components for ensuring the efficacy and security of the system. Existing average aggregation algorithms typically assume that all client-trained data holds equal value or that weights are based solely on the quantity of data contributed by each client. In contrast, alternative approaches involve training the model locally after aggregation to enhance adaptability. However, these approaches fundamentally ignore the inherent heterogeneity between different clients’ data and the complexity of variations in data at the aggregation stage, which may lead to a suboptimal global model. To address these issues, this study proposes a novel dual-criterion weighted aggregation algorithm involving the quantity and quality of data from the client node. Specifically, we quantify the data used for training and perform multiple rounds of local model inference accuracy evaluation on a specialized dataset to assess the data quality of each client. These two factors are utilized as weights within the aggregation process, applied through a dynamically weighted summation of these two factors. This approach allows the algorithm to adaptively adjust the weights, ensuring that every client can contribute to the global model, regardless of their data’s size or initial quality. Our experiments show that the proposed algorithm outperforms several existing state-of-the-art aggregation approaches on both a general-purpose open-source dataset, CIFAR-10, and a dataset specific to visual obstacle avoidance. Comments: 6 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.07816 [cs.LG] (or arXiv:2411.07816v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.07816 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] Federated Low-Rank Adaptation with Differential Privacy over Wireless Networks

链接: https://arxiv.org/abs/2411.07806
作者: Tianqu Kang,Zixin Wang,Hengtao He,Jun Zhang,Shenghui Song,Khaled B. Letaief
关键词-EN: large pre-trained foundation, Fine-tuning large pre-trained, pre-trained foundation models, presents considerable computational, devices presents considerable
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Signal Processing (eess.SP)
*备注: 6 pages, 3 figures, submitted to IEEE ICC 2025

点击查看摘要

Abstract:Fine-tuning large pre-trained foundation models (FMs) on distributed edge devices presents considerable computational and privacy challenges. Federated fine-tuning (FedFT) mitigates some privacy issues by facilitating collaborative model training without the need to share raw data. To lessen the computational burden on resource-limited devices, combining low-rank adaptation (LoRA) with federated learning enables parameter-efficient fine-tuning. Additionally, the split FedFT architecture partitions an FM between edge devices and a central server, reducing the necessity for complete model deployment on individual devices. However, the risk of privacy eavesdropping attacks in FedFT remains a concern, particularly in sensitive areas such as healthcare and finance. In this paper, we propose a split FedFT framework with differential privacy (DP) over wireless networks, where the inherent wireless channel noise in the uplink transmission is utilized to achieve DP guarantees without adding an extra artificial noise. We shall investigate the impact of the wireless noise on convergence performance of the proposed framework. We will also show that by updating only one of the low-rank matrices in the split FedFT with DP, the proposed method can mitigate the noise amplification effect. Simulation results will demonstrate that the proposed framework achieves higher accuracy under strict privacy budgets compared to baseline methods.

[LG-12] Kernel-based retrieval models for hyperspectral image data optimized with Kernel Flows

链接: https://arxiv.org/abs/2411.07800
作者: Zina-Sabrina Duma,Tuomas Sihvonen,Jouni Susiluoto,Otto Lamminpää,Heikki Haario,Satu-Pia Reinikainen
关键词-EN: performance depends heavily, Kernel-based statistical methods, Principal Component Regression, performance depends, depends heavily
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Kernel-based statistical methods are efficient, but their performance depends heavily on the selection of kernel parameters. In literature, the optimization studies on kernel-based chemometric methods is limited and often reduced to grid searching. Previously, the authors introduced Kernel Flows (KF) to learn kernel parameters for Kernel Partial Least-Squares (K-PLS) regression. KF is easy to implement and helps minimize overfitting. In cases of high collinearity between spectra and biogeophysical quantities in spectroscopy, simpler methods like Principal Component Regression (PCR) may be more suitable. In this study, we propose a new KF-type approach to optimize Kernel Principal Component Regression (K-PCR) and test it alongside KF-PLS. Both methods are benchmarked against non-linear regression techniques using two hyperspectral remote sensing datasets.

[LG-13] Spatially Regularized Graph Attention Autoencoder Framework for Detecting Rainfall Extremes

链接: https://arxiv.org/abs/2411.07753
作者: Mihir Agarwal,Progyan Das,Udit Bhatia
关键词-EN: Graph Attention Autoencoder, Attention Autoencoder, Graph Attention Network, Graph Attention, Indian Meteorological Department
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel Graph Attention Autoencoder (GAE) with spatial regularization to address the challenge of scalable anomaly detection in spatiotemporal rainfall data across India from 1990 to 2015. Our model leverages a Graph Attention Network (GAT) to capture spatial dependencies and temporal dynamics in the data, further enhanced by a spatial regularization term ensuring geographic coherence. We construct two graph datasets employing rainfall, pressure, and temperature attributes from the Indian Meteorological Department and ERA5 Reanalysis on Single Levels, respectively. Our network operates on graph representations of the data, where nodes represent geographic locations, and edges, inferred through event synchronization, denote significant co-occurrences of rainfall events. Through extensive experiments, we demonstrate that our GAE effectively identifies anomalous rainfall patterns across the Indian landscape. Our work paves the way for sophisticated spatiotemporal anomaly detection methodologies in climate science, contributing to better climate change preparedness and response strategies.

[LG-14] Exploring the loss landscape of regularized neural networks via convex duality

链接: https://arxiv.org/abs/2411.07729
作者: Sungyoon Kim,Aaron Mishkin,Mert Pilanci
关键词-EN: arbitrary global optimum, equivalent convex problem, regularized neural networks, neural networks, optimal solutions
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We discuss several aspects of the loss landscape of regularized neural networks: the structure of stationary points, connectivity of optimal solutions, path with nonincreasing loss to arbitrary global optimum, and the nonuniqueness of optimal solutions, by casting the problem into an equivalent convex problem and considering its dual. Starting from two-layer neural networks with scalar output, we first characterize the solution set of the convex problem using its dual and further characterize all stationary points. With the characterization, we show that the topology of the global optima goes through a phase transition as the width of the network changes, and construct counterexamples where the problem may have a continuum of optimal solutions. Finally, we show that the solution set characterization and connectivity results can be extended to different architectures, including two-layer vector-valued neural networks and parallel three-layer neural networks.

[LG-15] Convergence Rate Analysis of LION

链接: https://arxiv.org/abs/2411.07724
作者: Yiming Dong,Huan Li,Zhouchen Lin
关键词-EN: neural network training, evoLved sIgn mOmeNtum, deep neural network, large scale networks, simple sign update
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The LION (evoLved sIgn mOmeNtum) optimizer for deep neural network training was found by Google via program search, with the simple sign update yet showing impressive performance in training large scale networks. Although previous studies have investigated its convergence properties, a comprehensive analysis, especially the convergence rate, is still desirable. Recognizing that LION can be regarded as solving a specific constrained problem, this paper focuses on demonstrating its convergence to the Karush-Kuhn-Tucker (KKT) point at the rate of \cal O(\sqrtdK^-1/4) measured by gradient \ell_1 norm, where d is the problem dimension and K is the number of iteration steps. Step further, we remove the constraint and establish that LION converges to the critical point of the general unconstrained problem at the same rate. This rate not only delivers the currently optimal dependence on the problem dimension d but also tightly matches the theoretical lower bound for nonconvex stochastic optimization algorithms, which is typically measured using the gradient \ell_2 norm, with respect to the number of iterations K . Through extensive experiments, we not only demonstrate that LION achieves lower loss and higher performance compared to standard SGD, but also empirically confirm that the gradient \ell_1/\ell_2 norm ratio aligns with \Theta(\sqrtd) , thus proving that our convergence rate matches the theoretical lower bound with respect to d in the empirical sense.

[LG-16] OWLed: Outlier-weighed Layerwise Pruning for Efficient Autonomous Driving Framework

链接: https://arxiv.org/abs/2411.07711
作者: Jiaxi Li,Lu Yin,Xilu Wang
关键词-EN: offers promising enhancements, Large Language Models, integration of Large, systems offers promising, autonomous driving
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into autonomous driving systems offers promising enhancements in environmental understanding and decision-making. However, the substantial computational demands of deploying LLMs locally on vehicles render this approach unfeasible for real-world automotive applications. To address this challenge, we introduce OWLed, the Outlier-Weighed Layerwise Pruning for Efficient Autonomous Driving Framework that leverages outlier-weighted layerwise sparsity for model compression. Our method assigns non-uniform sparsity ratios to different layers based on the distribution of outlier features, significantly reducing the model size without the need for fine-tuning. To ensure the compressed model adapts well to autonomous driving tasks, we incorporate driving environment data into both the calibration and pruning processes. Our empirical studies reveal that the encoder component is more sensitive to pruning than the LLM, highlighting its critical role in the system. Experimental results demonstrate that OWLed outperforms existing methods in perception, action prediction, and language understanding while substantially lowering computational requirements. These findings underscore the potential of combining advanced pruning techniques with LLMs to develop efficient and robust autonomous driving systems capable of handling complex scenarios. Code will be made publicly available.

[LG-17] st Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning

链接: https://arxiv.org/abs/2411.07700
作者: Stefan Pranger,Hana Chockler,Martin Tappler,Bettina Könighofer
关键词-EN: Deep Reinforcement Learning, Reinforcement Learning, Deep Reinforcement, trained policy vary, state space
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many Deep Reinforcement Learning (RL) problems, decisions in a trained policy vary in significance for the expected safety and performance of the policy. Since RL policies are very complex, testing efforts should concentrate on states in which the agent’s decisions have the highest impact on the expected outcome. In this paper, we propose a novel model-based method to rigorously compute a ranking of state importance across the entire state space. We then focus our testing efforts on the highest-ranked states. In this paper, we focus on testing for safety. However, the proposed methods can be easily adapted to test for performance. In each iteration, our testing framework computes optimistic and pessimistic safety estimates. These estimates provide lower and upper bounds on the expected outcomes of the policy execution across all modeled states in the state space. Our approach divides the state space into safe and unsafe regions upon convergence, providing clear insights into the policy’s weaknesses. Two important properties characterize our approach. (1) Optimal Test-Case Selection: At any time in the testing process, our approach evaluates the policy in the states that are most critical for safety. (2) Guaranteed Safety: Our approach can provide formal verification guarantees over the entire state space by sampling only a fraction of the policy. Any safety properties assured by the pessimistic estimate are formally proven to hold for the policy. We provide a detailed evaluation of our framework on several examples, showing that our method discovers unsafe policy behavior with low testing effort.

[LG-18] What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?

链接: https://arxiv.org/abs/2411.07681
作者: Katie Kang,Amrith Setlur,Dibya Ghosh,Jacob Steinhardt,Claire Tomlin,Sergey Levine,Aviral Kumar
关键词-EN: abilities remain elusive, modern large language, problem-solving abilities remain, large language models, remain elusive
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the remarkable capabilities of modern large language models (LLMs), the mechanisms behind their problem-solving abilities remain elusive. In this work, we aim to better understand how the learning dynamics of LLM finetuning shapes downstream generalization. Our analysis focuses on reasoning tasks, whose problem structure allows us to distinguish between memorization (the exact replication of reasoning steps from the training data) and performance (the correctness of the final solution). We find that a model’s generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy: the accuracy of model samples on training queries before they begin to copy the exact reasoning steps from the training set. On the dataset level, this metric is able to reliably predict test accuracy, achieving R^2 of around or exceeding 0.9 across various models (Llama3 8, Gemma2 9B), datasets (GSM8k, MATH), and training configurations. On a per-example level, this metric is also indicative of whether individual model predictions are robust to perturbations in the training query. By connecting a model’s learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies. We focus on data curation as an example, and show that prioritizing examples with low pre-memorization accuracy leads to 1.5-2x improvements in data efficiency compared to i.i.d. data scaling, and outperforms other standard data curation techniques.

[LG-19] Safe Exploitative Play with Untrusted Type Beliefs NEURIPS2024

链接: https://arxiv.org/abs/2411.07679
作者: Tongxin Li,Tinashe Handina,Shaolei Ren,Adam Wierman
关键词-EN: unknown behaviors, rich history, controlling a single, system composed, composed of multiple
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: 26 pages, NeurIPS 2024

点击查看摘要

Abstract:The combination of the Bayesian game and learning has a rich history, with the idea of controlling a single agent in a system composed of multiple agents with unknown behaviors given a set of types, each specifying a possible behavior for the other agents. The idea is to plan an agent’s own actions with respect to those types which it believes are most likely to maximize the payoff. However, the type beliefs are often learned from past actions and likely to be incorrect. With this perspective in mind, we consider an agent in a game with type predictions of other components, and investigate the impact of incorrect beliefs to the agent’s payoff. In particular, we formally define a tradeoff between risk and opportunity by comparing the payoff obtained against the optimal payoff, which is represented by a gap caused by trusting or distrusting the learned beliefs. Our main results characterize the tradeoff by establishing upper and lower bounds on the Pareto front for both normal-form and stochastic Bayesian games, with numerical results provided.

[LG-20] Rethinking Structure Learning For Graph Neural Networks

链接: https://arxiv.org/abs/2411.07672
作者: Yilun Zheng,Zhuofan Zhang,Ziming Wang,Xiang Li,Sitao Luan,Xiaojiang Peng,Lihui Chen
关键词-EN: Graph Neural Networks, Neural Networks, Graph Structure Learning, effectively addressing issues, improve GNN performance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To improve the performance of Graph Neural Networks (GNNs), Graph Structure Learning (GSL) has been extensively applied to reconstruct or refine original graph structures, effectively addressing issues like heterophily, over-squashing, and noisy structures. While GSL is generally thought to improve GNN performance, it often leads to longer training times and more hyperparameter tuning. Besides, the distinctions among current GSL methods remain ambiguous from the perspective of GNN training, and there is a lack of theoretical analysis to quantify their effectiveness. Recent studies further suggest that, under fair comparisons with the same hyperparameter tuning, GSL does not consistently outperform baseline GNNs. This motivates us to ask a critical question: is GSL really useful for GNNs? To address this question, this paper makes two key contributions. First, we propose a new GSL framework, which includes three steps: GSL base (the representation used for GSL) construction, new structure construction, and view fusion, to better understand the effectiveness of GSL in GNNs. Second, after graph convolution, we analyze the differences in mutual information (MI) between node representations derived from the original topology and those from the newly constructed topology. Surprisingly, our empirical observations and theoretical analysis show that no matter which type of graph structure construction methods are used, after feeding the same GSL bases to the newly constructed graph, there is no MI gain compared to the original GSL bases. To fairly reassess the effectiveness of GSL, we conduct ablation experiments and find that it is the pretrained GSL bases that enhance GNN performance, and in most cases, GSL cannot improve GNN performance. This finding encourages us to rethink the essential components in GNNs, such as self-training and structural encoding, in GNN design rather than GSL.

[LG-21] Is Graph Convolution Always Beneficial For Every Feature?

链接: https://arxiv.org/abs/2411.07663
作者: Yilun Zheng,Xiang Li,Sitao Luan,Xiaojiang Peng,Lihui Chen
关键词-EN: Graph Neural Networks, Neural Networks, processing structured data, demonstrated strong capabilities, Graph Neural
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated strong capabilities in processing structured data. While traditional GNNs typically treat each feature dimension equally during graph convolution, we raise an important question: Is the graph convolution operation equally beneficial for each feature? If not, the convolution operation on certain feature dimensions can possibly lead to harmful effects, even worse than the convolution-free models. In prior studies, to assess the impacts of graph convolution on features, people proposed metrics based on feature homophily to measure feature consistency with the graph topology. However, these metrics have shown unsatisfactory alignment with GNN performance and have not been effectively employed to guide feature selection in GNNs. To address these limitations, we introduce a novel metric, Topological Feature Informativeness (TFI), to distinguish between GNN-favored and GNN-disfavored features, where its effectiveness is validated through both theoretical analysis and empirical observations. Based on TFI, we propose a simple yet effective Graph Feature Selection (GFS) method, which processes GNN-favored and GNN-disfavored features separately, using GNNs and non-GNN models. Compared to original GNNs, GFS significantly improves the extraction of useful topological information from each feature with comparable computational costs. Extensive experiments show that after applying GFS to 8 baseline and state-of-the-art (SOTA) GNN architectures across 10 datasets, 83.75% of the GFS-augmented cases show significant performance boosts. Furthermore, our proposed TFI metric outperforms other feature selection methods. These results validate the effectiveness of both GFS and TFI. Additionally, we demonstrate that GFS’s improvements are robust to hyperparameter tuning, highlighting its potential as a universal method for enhancing various GNN architectures.

[LG-22] op-nsigma: Not All Logits Are You Need

链接: https://arxiv.org/abs/2411.07641
作者: Chenxia Tang,Jianchun Liu,Hongli Xu,Liusheng Huang
关键词-EN: Large language models, Large language, typically employ greedy, language models, typically employ
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) typically employ greedy decoding or low-temperature sampling for reasoning tasks, reflecting a perceived trade-off between diversity and accuracy. We challenge this convention by introducing top- n\sigma , a novel sampling method that operates directly on pre-softmax logits by leveraging a statistical threshold. Our key insight is that logits naturally separate into a Gaussian-distributed noisy region and a distinct informative region, enabling efficient token filtering without complex probability manipulations. Unlike existing methods (e.g., top- p , min- p ) that inadvertently include more noise tokens at higher temperatures, top- n\sigma maintains a stable sampling space regardless of temperature scaling. We also provide a theoretical analysis of top- n\sigma to better understand its behavior. The extensive experimental results across four reasoning-focused datasets demonstrate that our method not only outperforms existing sampling approaches but also surpasses greedy decoding, while maintaining consistent performance even at high temperatures.

[LG-23] Decision Feedback In-Context Symbol Detection over Block-Fading Channels

链接: https://arxiv.org/abs/2411.07600
作者: Li Fan,Jing Yang,Cong Shen
关键词-EN: Pre-trained Transformers, demonstrated exceptional capabilities, in-context learning, model update, demonstrated exceptional
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Pre-trained Transformers, through in-context learning (ICL), have demonstrated exceptional capabilities to adapt to new tasks using example prompts \textitwithout model update. Transformer-based wireless receivers, where prompts consist of the pilot data in the form of transmitted and received signal pairs, have shown high estimation accuracy when pilot data are abundant. However, pilot information is often costly and limited in practice. In this work, we propose the \underlineDEcision \underlineFeedback \underlineIN-Cont\underlineExt \underlineDetection (DEFINED) solution as a new wireless receiver design, which bypasses channel estimation and directly performs symbol detection using the (sometimes extremely) limited pilot data. The key innovation in DEFINED is the proposed decision feedback mechanism in ICL, where we sequentially incorporate the detected symbols into the prompts to improve the detections for subsequent symbols. Extensive experiments across a broad range of wireless communication settings demonstrate that DEFINED achieves significant performance improvements, in some cases only needing a single pilot pair.

[LG-24] Overcoming the Curse of Dimensionality in Reinforcement Learning Through Approximate Factorization

链接: https://arxiv.org/abs/2411.07591
作者: Chenbei Lu,Laixi Shi,Zaiwei Chen,Chenye Wu,Adam Wierman
关键词-EN: Reinforcement Learning, curse of dimensionality, high sample complexity, fact that large-scale, exponentially high sample
类目: Machine Learning (cs.LG)
*备注: 61 pages, 10 figures

点击查看摘要

Abstract:Reinforcement Learning (RL) algorithms are known to suffer from the curse of dimensionality, which refers to the fact that large-scale problems often lead to exponentially high sample complexity. A common solution is to use deep neural networks for function approximation; however, such approaches typically lack theoretical guarantees. To provably address the curse of dimensionality, we observe that many real-world problems exhibit task-specific model structures that, when properly leveraged, can improve the sample efficiency of RL. Building on this insight, we propose overcoming the curse of dimensionality by approximately factorizing the original Markov decision processes (MDPs) into smaller, independently evolving MDPs. This factorization enables the development of sample-efficient RL algorithms in both model-based and model-free settings, with the latter involving a variant of variance-reduced Q-learning. We provide improved sample complexity guarantees for both proposed algorithms. Notably, by leveraging model structure through the approximate factorization of the MDP, the dependence of sample complexity on the size of the state-action space can be exponentially reduced. Numerically, we demonstrate the practicality of our proposed methods through experiments on both synthetic MDP tasks and a wind farm-equipped storage control problem.

[LG-25] Unraveling the Gradient Descent Dynamics of Transformers

链接: https://arxiv.org/abs/2411.07538
作者: Bingqing Song,Boran Han,Shuai Zhang,Jie Ding,Mingyi Hong
关键词-EN: achieved remarkable success, theoretical foundation explaining, fully developed, Transformer, achieved remarkable
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:While the Transformer architecture has achieved remarkable success across various domains, a thorough theoretical foundation explaining its optimization dynamics is yet to be fully developed. In this study, we aim to bridge this understanding gap by answering the following two core questions: (1) Which types of Transformer architectures allow Gradient Descent (GD) to achieve guaranteed convergence? and (2) Under what initial conditions and architectural specifics does the Transformer achieve rapid convergence during training? By analyzing the loss landscape of a single Transformer layer using Softmax and Gaussian attention kernels, our work provides concrete answers to these questions. Our findings demonstrate that, with appropriate weight initialization, GD can train a Transformer model (with either kernel type) to achieve a global optimal solution, especially when the input embedding dimension is large. Nonetheless, certain scenarios highlight potential pitfalls: training a Transformer using the Softmax attention kernel may sometimes lead to suboptimal local solutions. In contrast, the Gaussian attention kernel exhibits a much favorable behavior. Our empirical study further validate the theoretical findings.

[LG-26] Accident Impact Prediction based on a deep convolutional and recurrent neural network model

链接: https://arxiv.org/abs/2411.07537
作者: Pouyan Sajadi,Mahya Qorbani,Sobhan Moosavi,Erfan Hassannayebi
关键词-EN: substantial economic burden, resulting in numerous, numerous fatalities, burden each year, threat to public
类目: Machine Learning (cs.LG)
*备注: 28 pages, 18 figures

点击查看摘要

Abstract:Traffic accidents pose a significant threat to public safety, resulting in numerous fatalities, injuries, and a substantial economic burden each year. The development of predictive models capable of real-time forecasting of post-accident impact using readily available data can play a crucial role in preventing adverse outcomes and enhancing overall safety. However, existing accident predictive models encounter two main challenges: first, reliance on either costly or non-real-time data, and second the absence of a comprehensive metric to measure post-accident impact accurately. To address these limitations, this study proposes a deep neural network model known as the cascade model. It leverages readily available real-world data from Los Angeles County to predict post-accident impacts. The model consists of two components: Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN). The LSTM model captures temporal patterns, while the CNN extracts patterns from the sparse accident dataset. Furthermore, an external traffic congestion dataset is incorporated to derive a new feature called the “accident impact” factor, which quantifies the influence of an accident on surrounding traffic flow. Extensive experiments were conducted to demonstrate the effectiveness of the proposed hybrid machine learning method in predicting the post-accident impact compared to state-of-the-art baselines. The results reveal a higher precision in predicting minimal impacts (i.e., cases with no reported accidents) and a higher recall in predicting more significant impacts (i.e., cases with reported accidents).

[LG-27] Effective Virtual Reality Teleoperation of an Upper-body Humanoid with Modified Task Jacobians and Relaxed Barrier Functions for Self-Collision Avoidance IROS2022

链接: https://arxiv.org/abs/2411.07534
作者: Steven Jens Jorgensen,Ravi Bhadeshiya
关键词-EN: Virtual Reality, humanoid while ensuring, effectively teleoperate, teleoperate an upper-body, upper-body humanoid
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: XR Robotics Workshop, IROS 2022

点击查看摘要

Abstract:We present an approach for retartgeting off-the-shelf Virtual Reality (VR) trackers to effectively teleoperate an upper-body humanoid while ensuring self-collision-free motions. Key to the effectiveness was the proper assignment of trackers to joint sets via modified task Jacobians and relaxed barrier functions for self-collision avoidance. The approach was validated on Apptronik’s Astro hardware by demonstrating manipulation capabilities on a table-top environment with pick-and-place box packing and a two-handed box pick up and handover task.

[LG-28] Collaborative and Federated Black-box Optimization: A Bayesian Optimization Perspective

链接: https://arxiv.org/abs/2411.07523
作者: Raed Al Kontar
关键词-EN: heterogeneous black-box functions, collaborative sequential experimentation, federated black-box optimization, Bayesian optimization perspective, heterogeneous black-box
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We focus on collaborative and federated black-box optimization (BBOpt), where agents optimize their heterogeneous black-box functions through collaborative sequential experimentation. From a Bayesian optimization perspective, we address the fundamental challenges of distributed experimentation, heterogeneity, and privacy within BBOpt, and propose three unifying frameworks to tackle these issues: (i) a global framework where experiments are centrally coordinated, (ii) a local framework that allows agents to make decisions based on minimal shared information, and (iii) a predictive framework that enhances local surrogates through collaboration to improve decision-making. We categorize existing methods within these frameworks and highlight key open questions to unlock the full potential of federated BBOpt. Our overarching goal is to shift federated learning from its predominantly descriptive/predictive paradigm to a prescriptive one, particularly in the context of BBOpt - an inherently sequential decision-making problem.

[LG-29] Bayesian Deep Learning Approach for Real-time Lane-based Arrival Curve Reconstruction at Intersection using License Plate Recognition Data

链接: https://arxiv.org/abs/2411.07515
作者: Yang He,Chengchuan An,Jiawei Lu,Yao-Jan Wu,Zhenbo Lu,Jingxin Xia
关键词-EN: traffic control systems, proactive traffic control, connected vehicle environments, traffic arrival information, accurate traffic arrival
类目: Machine Learning (cs.LG)
*备注: accepted by T-ITS

点击查看摘要

Abstract:The acquisition of real-time and accurate traffic arrival information is of vital importance for proactive traffic control systems, especially in partially connected vehicle environments. License plate recognition (LPR) data that record both vehicle departures and identities are proven to be desirable in reconstructing lane-based arrival curves in previous works. Existing LPR databased methods are predominantly designed for reconstructing historical arrival curves. For real-time reconstruction of multi-lane urban roads, it is pivotal to determine the lane choice of real-time link-based arrivals, which has not been exploited in previous studies. In this study, we propose a Bayesian deep learning approach for real-time lane-based arrival curve reconstruction, in which the lane choice patterns and uncertainties of link-based arrivals are both characterized. Specifically, the learning process is designed to effectively capture the relationship between partially observed link-based arrivals and lane-based arrivals, which can be physically interpreted as lane choice proportion. Moreover, the lane choice uncertainties are characterized using Bayesian parameter inference techniques, minimizing arrival curve reconstruction uncertainties, especially in low LPR data matching rate conditions. Real-world experiment results conducted in multiple matching rate scenarios demonstrate the superiority and necessity of lane choice modeling in reconstructing arrival curves.

[LG-30] Robust Offline Reinforcement Learning for Non-Markovian Decision Processes

链接: https://arxiv.org/abs/2411.07514
作者: Ruiquan Huang,Yingbin Liang,Jing Yang
关键词-EN: Distributionally robust offline, uncertainty set, Distributionally robust, offline reinforcement learning, worst environment
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Distributionally robust offline reinforcement learning (RL) aims to find a policy that performs the best under the worst environment within an uncertainty set using an offline dataset collected from a nominal model. While recent advances in robust RL focus on Markov decision processes (MDPs), robust non-Markovian RL is limited to planning problem where the transitions in the uncertainty set are known. In this paper, we study the learning problem of robust offline non-Markovian RL. Specifically, when the nominal model admits a low-rank structure, we propose a new algorithm, featuring a novel dataset distillation and a lower confidence bound (LCB) design for robust values under different types of the uncertainty set. We also derive new dual forms for these robust values in non-Markovian RL, making our algorithm more amenable to practical implementation. By further introducing a novel type-I concentrability coefficient tailored for offline low-rank non-Markovian decision processes, we prove that our algorithm can find an \epsilon -optimal robust policy using O(1/\epsilon^2) offline samples. Moreover, we extend our algorithm to the case when the nominal model does not have specific structure. With a new type-II concentrability coefficient, the extended algorithm also enjoys polynomial sample efficiency under all different types of the uncertainty set.

[LG-31] AdaSS: a One-Shot Supernet Approach for Automatic Embedding Size Search in Deep Recommender System

链接: https://arxiv.org/abs/2411.07504
作者: He Wei,Yuekui Yang,Yang Zhang,Haiyang Wu,Meixi Liu,Shaoping Ma
关键词-EN: Deep Learning Recommendation, Deep Learning, embedding sizes, AES, embedding
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Learning Recommendation Model(DLRM)s utilize the embedding layer to represent various categorical features. Traditional DLRMs adopt unified embedding size for all features, leading to suboptimal performance and redundant parameters. Thus, lots of Automatic Embedding size Search (AES) works focus on obtaining mixed embedding sizes with strong model performance. However, previous AES works can hardly address several challenges together: (1) The search results of embedding sizes are unstable; (2) Recommendation effect with AES results is unsatisfactory; (3) Memory cost of embeddings is uncontrollable. To address these challenges, we propose a novel one-shot AES framework called AdaSS, in which a supernet encompassing various candidate embeddings is built and AES is performed as searching network architectures within it. Our framework contains two main stages: In the first stage, we decouple training parameters from searching embedding sizes, and propose the Adaptive Sampling method to yield a well-trained supernet, which further helps to produce stable AES results. In the second stage, to obtain embedding sizes that benefits the model effect, we design a reinforcement learning search process which utilizes the supernet trained previously. Meanwhile, to adapt searching to specific resource constraint, we introduce the resource competition penalty to balance the model effectiveness and memory cost of embeddings. We conduct extensive experiments on public datasets to show the superiority of AdaSS. Our method could improve AUC by about 0.3% while saving about 20% of model parameters. Empirical analysis also shows that the stability of searching results in AdaSS significantly exceeds other methods.

[LG-32] Privacy-Preserving Verifiable Neural Network Inference Service ACSA

链接: https://arxiv.org/abs/2411.07468
作者: Arman Riasi,Jorge Guajardo,Thang Hoang
关键词-EN: Machine learning, revolutionized data analysis, pattern recognition, limited accessibility, analysis and pattern
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This paper is to appear at the Annual Computer Security Applications Conference (ACSAC) 2024. The source code for our implementation can be found at $\href{ [this https URL](https://github.com/vt-asaplab/vPIN) }{ [this http URL](http://github.com/vt-asaplab/vPIN) }$

点击查看摘要

Abstract:Machine learning has revolutionized data analysis and pattern recognition, but its resource-intensive training has limited accessibility. Machine Learning as a Service (MLaaS) simplifies this by enabling users to delegate their data samples to an MLaaS provider and obtain the inference result using a pre-trained model. Despite its convenience, leveraging MLaaS poses significant privacy and reliability concerns to the client. Specifically, sensitive information from the client inquiry data can be leaked to an adversarial MLaaS provider. Meanwhile, the lack of a verifiability guarantee can potentially result in biased inference results or even unfair payment issues. While existing trustworthy machine learning techniques, such as those relying on verifiable computation or secure computation, offer solutions to privacy and reliability concerns, they fall short of simultaneously protecting the privacy of client data and providing provable inference verifiability. In this paper, we propose vPIN, a privacy-preserving and verifiable CNN inference scheme that preserves privacy for client data samples while ensuring verifiability for the inference. vPIN makes use of partial homomorphic encryption and commit-and-prove succinct non-interactive argument of knowledge techniques to achieve desirable security properties. In vPIN, we develop various optimization techniques to minimize the proving circuit for homomorphic inference evaluation thereby, improving the efficiency and performance of our technique. We fully implemented and evaluated our vPIN scheme on standard datasets (e.g., MNIST, CIFAR-10). Our experimental results show that vPIN achieves high efficiency in terms of proving time, verification time, and proof size, while providing client data privacy guarantees and provable verifiability. Comments: Accepted at the Annual Computer Security Applications Conference (ACSAC) 2024. Source code: this http URL Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2411.07468 [cs.CR] (or arXiv:2411.07468v2 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.07468 Focus to learn more arXiv-issued DOI via DataCite

[LG-33] Machines and Mathematical Mutations: Using GNNs to Characterize Quiver Mutation Classes

链接: https://arxiv.org/abs/2411.07467
作者: Jesse He,Helen Jenne,Herman Chau,Davis Brown,Mark Raugas,Sara Billey,Henry Kvinge
关键词-EN: increasingly valuable tool, identify subtle patterns, tool in mathematics, review and analyze, increasingly valuable
类目: Machine Learning (cs.LG); High Energy Physics - Theory (hep-th); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:Machine learning is becoming an increasingly valuable tool in mathematics, enabling one to identify subtle patterns across collections of examples so vast that they would be impossible for a single researcher to feasibly review and analyze. In this work, we use graph neural networks to investigate quiver mutation – an operation that transforms one quiver (or directed multigraph) into another – which is central to the theory of cluster algebras with deep connections to geometry, topology, and physics. In the study of cluster algebras, the question of mutation equivalence is of fundamental concern: given two quivers, can one efficiently determine if one quiver can be transformed into the other through a sequence of mutations? Currently, this question has only been resolved in specific cases. In this paper, we use graph neural networks and AI explainability techniques to discover mutation equivalence criteria for the previously unknown case of quivers of type \tildeD_n . Along the way, we also show that even without explicit training to do so, our model captures structure within its hidden representation that allows us to reconstruct known criteria from type D_n , adding to the growing evidence that modern machine learning models are capable of learning abstract and general rules from mathematical data.

[LG-34] Fast unsupervised ground metric learning with tree-Wasserstein distance

链接: https://arxiv.org/abs/2411.07432
作者: Kira M. Düsterwald,Samo Hromadka,Makoto Yamada
关键词-EN: unsupervised ground metric, ground metric learning, ground metric, WSV, unsupervised ground
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The performance of unsupervised methods such as clustering depends on the choice of distance metric between features, or ground metric. Commonly, ground metrics are decided with heuristics or learned via supervised algorithms. However, since many datasets are unlabelled, unsupervised ground metric learning approaches have been introduced. One recent, promising option uses Wasserstein singular vectors (WSV), which emerge when computing optimal transport distances between features and samples simultaneously. While WSV is effective, it has complexity \mathcalO(n^5) , which is prohibitively expensive in some applications. In this work, we propose to augment the WSV method by embedding samples and features on trees, on which we compute the tree-Wasserstein distance (TWD). We demonstrate theoretically and empirically that the algorithm converges to a better approximation of the full WSV approach than the best known alternatives, and does so with \mathcalO(n^3) complexity. In addition, we prove that the initial tree structure can be chosen flexibly, since tree geometry does not constrain the richness of the approximation up to the number of edge weights. This proof suggests a fast, recursive algorithm for computing the tree parameter basis set, which we find crucial to realising the efficiency gains at scale. Finally, we employ the tree-WSV algorithm to several single-cell RNA sequencing genomics datasets, demonstrating its scalability and utility for unsupervised cell-type clustering problems. These results poise unsupervised ground metric learning with TWD as a low-rank approximation of WSV with the potential for widespread low-compute application.

[LG-35] Just Label the Repeats for In-The-Wild Audio-to-Score Alignment

链接: https://arxiv.org/abs/2411.07428
作者: Irmak Bukey,Michael Feffer,Chris Donahue
关键词-EN: sheet music scans, high-quality offline alignment, high-quality offline, sheet music, sheet music induced
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 25th International Society for Music Information Retrieval Conference, San Francisco, 2024

点击查看摘要

Abstract:We propose an efficient workflow for high-quality offline alignment of in-the-wild performance audio and corresponding sheet music scans (images). Recent work on audio-to-score alignment extends dynamic time warping (DTW) to be theoretically able to handle jumps in sheet music induced by repeat signs-this method requires no human annotations, but we show that it often yields low-quality alignments. As an alternative, we propose a workflow and interface that allows users to quickly annotate jumps (by clicking on repeat signs), requiring a small amount of human supervision but yielding much higher quality alignments on average. Additionally, we refine audio and score feature representations to improve alignment quality by: (1) integrating measure detection into the score feature representation, and (2) using raw onset prediction probabilities from a music transcription model instead of piano roll. We propose an evaluation protocol for audio-to-score alignment that computes the distance between the estimated and ground truth alignment in units of measures. Under this evaluation, we find that our proposed jump annotation workflow and improved feature representations together improve alignment accuracy by 150% relative to prior work (33% to 82%).

[LG-36] Comparing Targeting Strategies for Maximizing Social Welfare with Limited Resources

链接: https://arxiv.org/abs/2411.07414
作者: Vibhhu Sharma,Bryan Wilder
关键词-EN: receive limited-resource interventions, individuals receive limited-resource, human services, learning is increasingly, receive limited-resource
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning is increasingly used to select which individuals receive limited-resource interventions in domains such as human services, education, development, and more. However, it is often not apparent what the right quantity is for models to predict. In particular, policymakers rarely have access to data from a randomized controlled trial (RCT) that would enable accurate estimates of treatment effects – which individuals would benefit more from the intervention. Observational data is more likely to be available, creating a substantial risk of bias in treatment effect estimates. Practitioners instead commonly use a technique termed “risk-based targeting” where the model is just used to predict each individual’s status quo outcome (an easier, non-causal task). Those with higher predicted risk are offered treatment. There is currently almost no empirical evidence to inform which choices lead to the most effect machine learning-informed targeting strategies in social domains. In this work, we use data from 5 real-world RCTs in a variety of domains to empirically assess such choices. We find that risk-based targeting is almost always inferior to targeting based on even biased estimates of treatment effects. Moreover, these results hold even when the policymaker has strong normative preferences for assisting higher-risk individuals. Our results imply that, despite the widespread use of risk prediction models in applied settings, practitioners may be better off incorporating even weak evidence about heterogeneous causal effects to inform targeting.

[LG-37] ODEStream: A Buffer-Free Online Learning Framework with ODE-based Adaptor for Streaming Time Series Forecasting

链接: https://arxiv.org/abs/2411.07413
作者: Futoon M.Abushaqra,Hao Xue,Yongli Ren,Flora D.Salim
关键词-EN: Addressing the challenges, real-world predictive modelling, predictive modelling, challenges of irregularity, irregularity and concept
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Addressing the challenges of irregularity and concept drift in streaming time series is crucial in real-world predictive modelling. Previous studies in time series continual learning often propose models that require buffering of long sequences, potentially restricting the responsiveness of the inference system. Moreover, these models are typically designed for regularly sampled data, an unrealistic assumption in real-world scenarios. This paper introduces ODEStream, a novel buffer-free continual learning framework that incorporates a temporal isolation layer that integrates temporal dependencies within the data. Simultaneously, it leverages the capability of neural ordinary differential equations to process irregular sequences and generate a continuous data representation, enabling seamless adaptation to changing dynamics in a data streaming scenario. Our approach focuses on learning how the dynamics and distribution of historical data change with time, facilitating the direct processing of streaming sequences. Evaluations on benchmark real-world datasets demonstrate that ODEStream outperforms the state-of-the-art online learning and streaming analysis baselines, providing accurate predictions over extended periods while minimising performance degradation over time by learning how the sequence dynamics change.

[LG-38] Identifying Differential Patient Care Through Inverse Intent Inference

链接: https://arxiv.org/abs/2411.07372
作者: Hyewon Jeong,Siddharth Nayak,Taylor Killian,Sanjat Kanjilal,Marzyeh Ghassemi
关键词-EN: life-threatening condition defined, end-organ dysfunction due, Surviving Sepsis Campaign, dysregulated host response, sepsis treatment guidelines
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sepsis is a life-threatening condition defined by end-organ dysfunction due to a dysregulated host response to infection. Although the Surviving Sepsis Campaign has launched and has been releasing sepsis treatment guidelines to unify and normalize the care for sepsis patients, it has been reported in numerous studies that disparities in care exist across the trajectory of patient stay in the emergency department and intensive care unit. Here, we apply a number of reinforcement learning techniques including behavioral cloning, imitation learning, and inverse reinforcement learning, to learn the optimal policy in the management of septic patient subgroups using expert demonstrations. Then we estimate the counterfactual optimal policies by applying the model to another subset of unseen medical populations and identify the difference in cure by comparing it to the real policy. Our data comes from the sepsis cohort of MIMIC-IV and the clinical data warehouses of the Mass General Brigham healthcare system. The ultimate objective of this work is to use the optimal learned policy function to estimate the counterfactual treatment policy and identify deviations across sub-populations of interest. We hope this approach would help us identify any disparities in care and also changes in cure in response to the publication of national sepsis treatment guidelines.

[LG-39] Factorised Active Inference for Strategic Multi-Agent Interactions

链接: https://arxiv.org/abs/2411.07362
作者: Jaime Ruiz-Serra,Patrick Sweeney,Michael S. Harré
关键词-EN: Understanding how individual, individual agents make, make strategic decisions, diverse as economics, multi-agent systems
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding how individual agents make strategic decisions within collectives is important for advancing fields as diverse as economics, neuroscience, and multi-agent systems. Two complementary approaches can be integrated to this end. The Active Inference framework (AIF) describes how agents employ a generative model to adapt their beliefs about and behaviour within their environment. Game theory formalises strategic interactions between agents with potentially competing objectives. To bridge the gap between the two, we propose a factorisation of the generative model whereby each agent maintains explicit, individual-level beliefs about the internal states of other agents, and uses them for strategic planning in a joint context. We apply our model to iterated general-sum games with 2 and 3 players, and study the ensemble effects of game transitions, where the agents’ preferences (game payoffs) change over time. This non-stationarity, beyond that caused by reciprocal adaptation, reflects a more naturalistic environment in which agents need to adapt to changing social contexts. Finally, we present a dynamical analysis of key AIF quantities: the variational free energy (VFE) and the expected free energy (EFE) from numerical simulation data. The ensemble-level EFE allows us to characterise the basins of attraction of games with multiple Nash Equilibria under different conditions, and we find that it is not necessarily minimised at the aggregate level. By integrating AIF and game theory, we can gain deeper insights into how intelligent collectives emerge, learn, and optimise their actions in dynamic environments, both cooperative and non-cooperative.

[LG-40] SynRL: Aligning Synthetic Clinical Trial Data with Human-preferred Clinical Endpoints Using Reinforcement Learning

链接: https://arxiv.org/abs/2411.07317
作者: Trisha Das,Zifeng Wang,Afrah Shafquat,Mandis Beigi,Jason Mezey,Jimeng Sun
关键词-EN: sharing patient records, data, medical interventions, federal regulations, privacy concerns
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Each year, hundreds of clinical trials are conducted to evaluate new medical interventions, but sharing patient records from these trials with other institutions can be challenging due to privacy concerns and federal regulations. To help mitigate privacy concerns, researchers have proposed methods for generating synthetic patient data. However, existing approaches for generating synthetic clinical trial data disregard the usage requirements of these data, including maintaining specific properties of clinical outcomes, and only use post hoc assessments that are not coupled with the data generation process. In this paper, we propose SynRL which leverages reinforcement learning to improve the performance of patient data generators by customizing the generated data to meet the user-specified requirements for synthetic data outcomes and endpoints. Our method includes a data value critic function to evaluate the quality of the generated data and uses reinforcement learning to align the data generator with the users’ needs based on the critic’s feedback. We performed experiments on four clinical trial datasets and demonstrated the advantages of SynRL in improving the quality of the generated synthetic data while keeping the privacy risks low. We also show that SynRL can be utilized as a general framework that can customize data generation of multiple types of synthetic data generators. Our code is available at this https URL.

[LG-41] Anomaly Detection in OKTA Logs using Autoencoders

链接: https://arxiv.org/abs/2411.07314
作者: Jericho Cain,Hayden Beadles,Karthik Venkatesan
关键词-EN: detect cybersecurity events, back periods, Okta logs, today to detect, detect cybersecurity
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 11 pages, 3 tables, 8 figures, Databricks AI Summit 2024

点击查看摘要

Abstract:Okta logs are used today to detect cybersecurity events using various rule-based models with restricted look back periods. These functions have limitations, such as a limited retrospective analysis, a predefined rule set, and susceptibility to generating false positives. To address this, we adopt unsupervised techniques, specifically employing autoencoders. To properly use an autoencoder, we need to transform and simplify the complexity of the log data we receive from our users. This transformed and filtered data is then fed into the autoencoder, and the output is evaluated.

[LG-42] Merit-Based Sortition in Decentralized Systems

链接: https://arxiv.org/abs/2411.07302
作者: J. M. Diederik Kruijssen,Renata Valieva,Kenneth Peluso,Nicholas Emmons,Steven N. Longmore(Allora Foundation)
关键词-EN: optimizing resource efficiency, satisfying computational limitations, total participant pool, resource efficiency, goal of satisfying
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures; appeared in ADI (October 2024)

点击查看摘要

Abstract:In decentralized systems, it is often necessary to select an ‘active’ subset of participants from the total participant pool, with the goal of satisfying computational limitations or optimizing resource efficiency. This selection can sometimes be made at random, mirroring the sortition practice invented in classical antiquity aimed at achieving a high degree of statistical representativeness. However, the recent emergence of specialized decentralized networks that solve concrete coordination problems and are characterized by measurable success metrics often requires prioritizing performance optimization over representativeness. We introduce a simple algorithm for ‘merit-based sortition’, in which the quality of each participant influences its probability of being drafted into the active set, while simultaneously retaining representativeness by allowing inactive participants an infinite number of chances to be drafted into the active set with non-zero probability. Using a suite of numerical experiments, we demonstrate that our algorithm boosts the quality metric describing the performance of the active set by 2 times the intrinsic stochasticity. This implies that merit-based sortition ensures a statistically significant performance boost to the drafted, ‘active’ set, while retaining the property of classical, random sortition that it enables upward mobility from a much larger ‘inactive’ set. This way, merit-based sortition fulfils a key requirement for decentralized systems in need of performance optimization.

[LG-43] ASTD Patterns for Integrated Continuous Anomaly Detection In Data Logs

链接: https://arxiv.org/abs/2411.07272
作者: Chaymae El Jabri,Marc Frappier,Pierre-Martin Tardif
关键词-EN: ensemble anomaly detection, anomaly detection systems, paper investigates, learning models, data streams
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the use of the ASTD language for ensemble anomaly detection in data logs. It uses a sliding window technique for continuous learning in data streams, coupled with updating learning models upon the completion of each window to maintain accurate detection and align with current data trends. It proposes ASTD patterns for combining learning models, especially in the context of unsupervised learning, which is commonly used for data streams. To facilitate this, a new ASTD operator is proposed, the Quantified Flow, which enables the seamless combination of learning models while ensuring that the specification remains concise. Our contribution is a specification pattern, highlighting the capacity of ASTDs to abstract and modularize anomaly detection systems. The ASTD language provides a unique approach to develop data flow anomaly detection systems, grounded in the combination of processes through the graphical representation of the language operators. This simplifies the design task for developers, who can focus primarily on defining the functional operations that constitute the system.

[LG-44] Analysis and Forecasting of the Dynamics of a Floating Wind Turbine Using Dynamic Mode Decomposition

链接: https://arxiv.org/abs/2411.07263
作者: Giorgio Palma,Andrea Bardazzi,Alessia Lucarelli,Chiara Pilloton,Andrea Serani,Claudio Lugni,Matteo Diez
关键词-EN: Dynamic Mode Decomposition, data-driven equation-free modeling, Mode Decomposition, hexafloat floating offshore, Dynamic Mode
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:This article presents a data-driven equation-free modeling of the dynamics of a hexafloat floating offshore wind turbine based on the Dynamic Mode Decomposition (DMD). The DMD is here used to provide a modal analysis and extract knowledge from the dynamic system. A forecasting algorithm for the motions, accelerations, and forces acting on the floating system, as well as the height of the incoming waves, the wind speed, and the power extracted by the wind turbine, is developed by using a methodological extension called Hankel-DMD, that includes time-delayed copies of the states in an augmented state vector. All the analyses are performed on experimental data collected from an operating prototype. The quality of the forecasts obtained varying two main hyperparameters of the algorithm, namely the number of delayed copies and the length of the observation time, is assessed using three different error metrics, each analyzing complementary aspects of the prediction. A statistical analysis exposed the existence of optimal values for the algorithm hyperparameters. Results show the approach’s capability for short-term future estimates of the system’s state, which can be used for real-time prediction and control. Furthermore, a novel Stochastic Hankel-DMD formulation is introduced by considering hyperparameters as stochastic variables. The stochastic version of the method not only enriches the prediction with its related uncertainty but is also found to improve the normalized root mean square error up to 10% on a statistical basis compared to the deterministic counterpart.

[LG-45] Ozone level forecasting in Mexico City with temporal features and interactions

链接: https://arxiv.org/abs/2411.07259
作者: J. M. Sánchez Cerritos,J. A. Martínez-Cadena,A. Marín-López,J. Delgado-Fernández
关键词-EN: negatively impacts human, impacts human health, Tropospheric ozone, atmospheric pollutant, pollutant that negatively
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Tropospheric ozone is an atmospheric pollutant that negatively impacts human health and the environment. Precise estimation of ozone levels is essential for preventive measures and mitigating its effects. This work compares the accuracy of multiple regression models in forecasting ozone levels in Mexico City, first without adding temporal features and interactions, and then with these features included. Our findings show that incorporating temporal features and interactions improves the accuracy of the models.

[LG-46] Model Reconstruction Using Counterfactual Explanations: A Perspective From Polytope Theory NEURIPS2024

链接: https://arxiv.org/abs/2405.05369
作者: Pasan Dissanayake,Sanghamitra Dutta
关键词-EN: minimum input perturbation, favorable model outcome, Counterfactual explanations provide, input perturbation, model
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:Counterfactual explanations provide ways of achieving a favorable model outcome with minimum input perturbation. However, counterfactual explanations can also be leveraged to reconstruct the model by strategically training a surrogate model to give similar predictions as the original (target) model. In this work, we analyze how model reconstruction using counterfactuals can be improved by further leveraging the fact that the counterfactuals also lie quite close to the decision boundary. Our main contribution is to derive novel theoretical relationships between the error in model reconstruction and the number of counterfactual queries required using polytope theory. Our theoretical analysis leads us to propose a strategy for model reconstruction that we call Counterfactual Clamping Attack (CCA) which trains a surrogate model using a unique loss function that treats counterfactuals differently than ordinary instances. Our approach also alleviates the related problem of decision boundary shift that arises in existing model reconstruction approaches when counterfactuals are treated as ordinary instances. Experimental results demonstrate that our strategy improves fidelity between the target and surrogate model predictions on several datasets.

[LG-47] Doubly Robust Regression Discontinuity Designs

链接: https://arxiv.org/abs/2411.07978
作者: Masahiro Kato
关键词-EN: doubly robust, study introduces, introduces a doubly, regression discontinuity, regression
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study introduces a doubly robust (DR) estimator for regression discontinuity (RD) designs. In RD designs, treatment effects are estimated in a quasi-experimental setting where treatment assignment depends on whether a running variable surpasses a predefined cutoff. A common approach in RD estimation is to apply nonparametric regression methods, such as local linear regression. In such an approach, the validity relies heavily on the consistency of nonparametric estimators and is limited by the nonparametric convergence rate, thereby preventing \sqrtn -consistency. To address these issues, we propose the DR-RD estimator, which combines two distinct estimators for the conditional expected outcomes. If either of these estimators is consistent, the treatment effect estimator remains consistent. Furthermore, due to the debiasing effect, our proposed estimator achieves \sqrtn -consistency if both regression estimators satisfy certain mild conditions, which also simplifies statistical inference.

[LG-48] ukey g-and-h neural network regression for non-Gaussian data

链接: https://arxiv.org/abs/2411.07957
作者: Arthur P. Guillaumin,Natalia Efremova
关键词-EN: http URL Tukey, flexible parametric transform, normal random variable, paper addresses non-Gaussian, standard normal random
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses non-Gaussian regression with neural networks via the use of the Tukey g-and-h this http URL Tukey g-and-h transform is a flexible parametric transform with two parameters g and h which, when applied to a standard normal random variable, introduces both skewness and kurtosis, resulting in a distribution commonly called the Tukey g-and-h distribution. Specific values of g and h produce good approximations to other families of distributions, such as the Cauchy and student-t distributions. The flexibility of the Tukey g-and-h distribution has driven its popularity in the statistical community, in applied sciences and finance. In this work we consider the training of a neural network to predict the parameters of a Tukey g-and-h distribution in a regression framework via the minimization of the corresponding negative log-likelihood, despite the latter having no closed-form expression. We demonstrate the efficiency of our procedure in simulated examples and apply our method to a real-world dataset of global crop yield for several types of crops. Finally, we show how we can carry out a goodness-of-fit analysis between the predicted distributions and the test data. A Pytorch implementation is made available on Github and as a Pypi package.

[LG-49] CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR ICASSP2025

链接: https://arxiv.org/abs/2411.07607
作者: Wei Zhou,Junteng Jia,Leda Sari,Jay Mahadeokar,Ozlem Kalinli
关键词-EN: integrate audio encoders, gained growing interest, CTC compressor, CTC compressor based, CTC
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: submitted to ICASSP2025

点击查看摘要

Abstract:CTC compressor can be an effective approach to integrate audio encoders to decoder-only models, which has gained growing interest for different speech applications. In this work, we propose a novel CTC compressor based joint speech and text training (CJST) framework for decoder-only ASR. CJST matches speech and text modalities from both directions by exploring a simple modality adaptor and several features of the CTC compressor, including sequence compression, on-the-fly forced peaky alignment and CTC class embeddings. Experimental results on the Librispeech and TED-LIUM2 corpora show that the proposed CJST achieves an effective text injection without the need of duration handling, leading to the best performance for both in-domain and cross-domain scenarios. We also provide a comprehensive study on CTC compressor, covering various compression modes, edge case handling and behavior under both clean and noisy data conditions, which reveals the most robust setting to use CTC compressor for decoder-only models.

[LG-50] Exogenous Randomness Empowering Random Forests

链接: https://arxiv.org/abs/2411.07554
作者: Tianxing Mei,Yingying Fan,Jinchi Lv
关键词-EN: exogenous randomness, tree-building rules independent, training data, empirical insights, random forests
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 103 pages, 10 figures

点击查看摘要

Abstract:We offer theoretical and empirical insights into the impact of exogenous randomness on the effectiveness of random forests with tree-building rules independent of training data. We formally introduce the concept of exogenous randomness and identify two types of commonly existing randomness: Type I from feature subsampling, and Type II from tie-breaking in tree-building processes. We develop non-asymptotic expansions for the mean squared error (MSE) for both individual trees and forests and establish sufficient and necessary conditions for their consistency. In the special example of the linear regression model with independent features, our MSE expansions are more explicit, providing more understanding of the random forests’ mechanisms. It also allows us to derive an upper bound on the MSE with explicit consistency rates for trees and forests. Guided by our theoretical findings, we conduct simulations to further explore how exogenous randomness enhances random forest performance. Our findings unveil that feature subsampling reduces both the bias and variance of random forests compared to individual trees, serving as an adaptive mechanism to balance bias and variance. Furthermore, our results reveal an intriguing phenomenon: the presence of noise features can act as a “blessing” in enhancing the performance of random forests thanks to feature subsampling.

[LG-51] ADMM for Structured Fractional Minimization

链接: https://arxiv.org/abs/2411.07496
作者: Ganzhao Yuan
关键词-EN: weakly convex square, convex square root, weakly convex, convex square, convex nonsmooth function
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We consider a class of structured fractional minimization problems, where the numerator includes a differentiable function, a simple nonconvex nonsmooth function, a concave nonsmooth function, and a convex nonsmooth function composed with a linear operator, while the denominator is a continuous function that is either weakly convex or has a weakly convex square root. These problems are widespread and span numerous essential applications in machine learning and data science. Existing methods are mainly based on subgradient methods and smoothing proximal gradient methods, which may suffer from slow convergence and numerical stability issues. In this paper, we introduce \sf FADMM, the first Alternating Direction Method of Multipliers tailored for this class of problems. \sf FADMM decouples the original problem into linearized proximal subproblems, featuring two variants: one using Dinkelbach’s parametric method (\sf FADMM-D) and the other using the quadratic transform method (\sf FADMM-Q). By introducing a novel Lyapunov function, we establish that \sf FADMM converges to \epsilon -approximate critical points of the problem within an oracle complexity of \mathcalO(1/\epsilon^3) . Our experiments on synthetic and real-world data for sparse Fisher discriminant analysis, robust Sharpe ratio minimization, and robust sparse recovery demonstrate the effectiveness of our approach. Keywords: Fractional Minimization, Nonconvex Optimization, Proximal Linearized ADMM, Nonsmooth Optimization, Convergence Analysis Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2411.07496 [math.OC] (or arXiv:2411.07496v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2411.07496 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] Constructing Gaussian Processes via Samplets

链接: https://arxiv.org/abs/2411.07277
作者: Marcel Neugebauer
关键词-EN: Gaussian Processes face, Gaussian Processes, face two primary, large datasets, datasets and selecting
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Gaussian Processes face two primary challenges: constructing models for large datasets and selecting the optimal model. This master’s thesis tackles these challenges in the low-dimensional case. We examine recent convergence results to identify models with optimal convergence rates and pinpoint essential parameters. Utilizing this model, we propose a Samplet-based approach to efficiently construct and train the Gaussian Processes, reducing the cubic computational complexity to a log-linear scale. This method facilitates optimal regression while maintaining efficient performance.

[LG-53] Empirical Quantum Advantage Analysis of Quantum Kernel in Gene Expression Data

链接: https://arxiv.org/abs/2411.07276
作者: Arpita Ghosh,MD Muhtasim Fuad,Seemanta Bhattacharjee
关键词-EN: quantum machine learning, classification models demonstrates, machine learning, learning classification models, models demonstrates
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:The incorporation of quantum ansatz with machine learning classification models demonstrates the ability to extract patterns from data for classification tasks. However, taking advantage of the enhanced computational power of quantum machine learning necessitates dealing with various constraints. In this paper, we focus on constraints like finding suitable datasets where quantum advantage is achievable and evaluating the relevance of features chosen by classical and quantum methods. Additionally, we compare quantum and classical approaches using benchmarks and estimate the computational complexity of quantum circuits to assess real-world usability. For our experimental validation, we selected the gene expression dataset, given the critical role of genetic variations in regulating physiological behavior and disease susceptibility. Through this study, we aim to contribute to the advancement of quantum machine learning methodologies, offering valuable insights into their potential for addressing complex classification challenges in various domains.

[LG-54] SPDIM: Source-Free Unsupervised Conditional and Label Shifts Adaptation in EEG

链接: https://arxiv.org/abs/2411.07249
作者: Shanglin Li,Motoaki Kawanabe,Reinmar J. Kobler
关键词-EN: days and subjects, nature of electroencephalography, posing a significant, EEG, non-stationary nature
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The non-stationary nature of electroencephalography (EEG) introduces distribution shifts across domains (e.g., days and subjects), posing a significant challenge to EEG-based neurotechnology generalization. Without labeled calibration data for target domains, the problem is a source-free unsupervised domain adaptation (SFUDA) problem. For scenarios with constant label distribution, Riemannian geometry-aware statistical alignment frameworks on the symmetric positive definite (SPD) manifold are considered state-of-the-art. However, many practical scenarios, including EEG-based sleep staging, exhibit label shifts. Here, we propose a geometric deep learning framework for SFUDA problems under specific distribution shifts, including label shifts. We introduce a novel, realistic generative model and show that prior Riemannian statistical alignment methods on the SPD manifold can compensate for specific marginal and conditional distribution shifts but hurt generalization under label shifts. As a remedy, we propose a parameter-efficient manifold optimization strategy termed SPDIM. SPDIM uses the information maximization principle to learn a single SPD-manifold-constrained parameter per target domain. In simulations, we demonstrate that SPDIM can compensate for the shifts under our generative model. Moreover, using public EEG-based brain-computer interface and sleep staging datasets, we show that SPDIM outperforms prior approaches.

信息检索

[IR-0] A Theoretical Analysis of Recommendation Loss Functions under Negative Sampling

链接: https://arxiv.org/abs/2411.07770
作者: Giulia Di Teodoro,Federico Siciliano,Nicola Tonellotto,Fabrizio Silvestri
关键词-EN: Recommender Systems, Bayesian Personalized Ranking, music streaming, social media, pivotal in diverse
类目: Information Retrieval (cs.IR)
*备注: main paper 8 pages, 4 figures

点击查看摘要

Abstract:Recommender Systems (RSs) are pivotal in diverse domains such as e-commerce, music streaming, and social media. This paper conducts a comparative analysis of prevalent loss functions in RSs: Binary Cross-Entropy (BCE), Categorical Cross-Entropy (CCE), and Bayesian Personalized Ranking (BPR). Exploring the behaviour of these loss functions across varying negative sampling settings, we reveal that BPR and CCE are equivalent when one negative sample is used. Additionally, we demonstrate that all losses share a common global minimum. Evaluation of RSs mainly relies on ranking metrics known as Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR). We produce bounds of the different losses for negative sampling settings to establish a probabilistic lower bound for NDCG. We show that the BPR bound on NDCG is weaker than that of BCE, contradicting the common assumption that BPR is superior to BCE in RSs training. Experiments on five datasets and four models empirically support these theoretical findings. Our code is available at \urlthis https URL .

[IR-1] Advancing Sustainability via Recommender Systems: A Survey

链接: https://arxiv.org/abs/2411.07658
作者: Xin Zhou,Lei Zhang,Honglei Zhang,Yixin Zhang,Xiaoxiong Zhang,Jie Zhang,Zhiqi Shen
关键词-EN: consumption collectively precipitating, Human behavioral patterns, collectively precipitating substantial, precipitating substantial ecological, resource consumption collectively
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注: 20pages, 10 figures. Working paper: this https URL

点击查看摘要

Abstract:Human behavioral patterns and consumption paradigms have emerged as pivotal determinants in environmental degradation and climate change, with quotidian decisions pertaining to transportation, energy utilization, and resource consumption collectively precipitating substantial ecological impacts. Recommender systems, which generate personalized suggestions based on user preferences and historical interaction data, exert considerable influence on individual behavioral trajectories. However, conventional recommender systems predominantly optimize for user engagement and economic metrics, inadvertently neglecting the environmental and societal ramifications of their recommendations, potentially catalyzing over-consumption and reinforcing unsustainable behavioral patterns. Given their instrumental role in shaping user decisions, there exists an imperative need for sustainable recommender systems that incorporate sustainability principles to foster eco-conscious and socially responsible choices. This comprehensive survey addresses this critical research gap by presenting a systematic analysis of sustainable recommender systems. As these systems can simultaneously advance multiple sustainability objectives–including resource conservation, sustainable consumer behavior, and social impact enhancement–examining their implementations across distinct application domains provides a more rigorous analytical framework. Through a methodological analysis of domain-specific implementations encompassing transportation, food, buildings, and auxiliary sectors, we can better elucidate how these systems holistically advance sustainability objectives while addressing sector-specific constraints and opportunities. Moreover, we delineate future research directions for evolving recommender systems beyond sustainability advocacy toward fostering environmental resilience and social consciousness in society.

[IR-2] owards Automated Model Design on Recommender Systems

链接: https://arxiv.org/abs/2411.07569
作者: Tunhou Zhang,Dehua Cheng,Yuchen He,Zhengxing Chen,Xiaoliang Dai,Liang Xiong,Yudong Liu,Feng Cheng,Yufan Cao,Feng Yan,Hai Li,Yiran Chen,Wei Wen
关键词-EN: developing AI-based recommender, AI-based recommender systems, Automated Machine Learning, increasing popularity, created new opportunities
类目: Information Retrieval (cs.IR)
*备注: Accepted in ACM Transactions on Recommender Systems. arXiv admin note: substantial text overlap with arXiv:2207.07187

点击查看摘要

Abstract:The increasing popularity of deep learning models has created new opportunities for developing AI-based recommender systems. Designing recommender systems using deep neural networks requires careful architecture design, and further optimization demands extensive co-design efforts on jointly optimizing model architecture and hardware. Design automation, such as Automated Machine Learning (AutoML), is necessary to fully exploit the potential of recommender model design, including model choices and model-hardware co-design strategies. We introduce a novel paradigm that utilizes weight sharing to explore abundant solution spaces. Our paradigm creates a large supernet to search for optimal architectures and co-design strategies to address the challenges of data multi-modality and heterogeneity in the recommendation domain. From a model perspective, the supernet includes a variety of operators, dense connectivity, and dimension search options. From a co-design perspective, it encompasses versatile Processing-In-Memory (PIM) configurations to produce hardware-efficient models. Our solution space’s scale, heterogeneity, and complexity pose several challenges, which we address by proposing various techniques for training and evaluating the supernet. Our crafted models show promising results on three Click-Through Rates (CTR) prediction benchmarks, outperforming both manually designed and AutoML-crafted models with state-of-the-art performance when focusing solely on architecture search. From a co-design perspective, we achieve 2x FLOPs efficiency, 1.8x energy efficiency, and 1.5x performance improvements in recommender models.

[IR-3] Feature Interaction Fusion Self-Distillation Network For CTR Prediction

链接: https://arxiv.org/abs/2411.07508
作者: Lei Sang,Qiuze Ru,Honghao Li,Yiwen Zhang,Xindong Wu
关键词-EN: Click-Through Rate, online advertising, recommender systems, search engines, plays a vital
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Click-Through Rate (CTR) prediction plays a vital role in recommender systems, online advertising, and search engines. Most of the current approaches model feature interactions through stacked or parallel structures, with some employing knowledge distillation for model compression. However, we observe some limitations with these approaches: (1) In parallel structure models, the explicit and implicit components are executed independently and simultaneously, which leads to insufficient information sharing within the feature set. (2) The introduction of knowledge distillation technology brings about the problems of complex teacher-student framework design and low knowledge transfer efficiency. (3) The dataset and the process of constructing high-order feature interactions contain significant noise, which limits the model’s effectiveness. To address these limitations, we propose FSDNet, a CTR prediction framework incorporating a plug-and-play fusion self-distillation module. Specifically, FSDNet forms connections between explicit and implicit feature interactions at each layer, enhancing the sharing of information between different features. The deepest fusion layer is then used as the teacher model, utilizing self-distillation to guide the training of shallow layers. Empirical evaluation across four benchmark datasets validates the framework’s efficacy and generalization capabilities. The code is available on this https URL.

[IR-4] Music Discovery Dialogue Generation Using Human Intent Analysis and Large Language Models

链接: https://arxiv.org/abs/2411.07439
作者: SeungHeon Doh,Keunwoo Choi,Daeyong Kwon,Taesu Kim,Juhan Nam
关键词-EN: conversational music retrieval, music retrieval system, conversational music, large language model, music
类目: ound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: Accepted for publication at the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

点击查看摘要

Abstract:A conversational music retrieval system can help users discover music that matches their preferences through dialogue. To achieve this, a conversational music retrieval system should seamlessly engage in multi-turn conversation by 1) understanding user queries and 2) responding with natural language and retrieved music. A straightforward solution would be a data-driven approach utilizing such conversation logs. However, few datasets are available for the research and are limited in terms of volume and quality. In this paper, we present a data generation framework for rich music discovery dialogue using a large language model (LLM) and user intents, system actions, and musical attributes. This is done by i) dialogue intent analysis using grounded theory, ii) generating attribute sequences via cascading database filtering, and iii) generating utterances using large language models. By applying this framework to the Million Song dataset, we create LP-MusicDialog, a Large Language Model based Pseudo Music Dialogue dataset, containing over 288k music conversations using more than 319k music items. Our evaluation shows that the synthetic dataset is competitive with an existing, small human dialogue dataset in terms of dialogue consistency, item relevance, and naturalness. Furthermore, using the dataset, we train a conversational music retrieval model and show promising results.

Arxiv今日论文 | 2024-11-13

目录

概览 (2024-11-13)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载