Arxiv今日论文 | 2024-11-19

本篇博文主要展示 2024-11-19 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决Mamba模型在大规模语言模型（LLM）训练和部署中面临的计算复杂度高、内存需求大以及能源消耗显著的问题。解决方案的关键在于引入Bi-Mamba，这是一种可扩展且强大的1-bit Mamba架构，旨在通过低比特表示实现更高效的LLM。Bi-Mamba模型通过自回归蒸馏损失从头开始训练，实验结果表明其在语言建模任务中性能与全精度模型相当，且显著优于后训练二值化（PTB）的Mamba基线，同时大幅减少了内存占用和能源消耗。这一研究开创了在低比特表示下线性计算复杂度的LLM框架，并为未来针对1-bit Mamba的专用硬件设计奠定了基础。

链接: https://arxiv.org/abs/2411.11843
作者: Shengkun Tang,Liqun Ma,Haonan Li,Mingjie Sun,Zhiqiang Shen
关键词-EN: typical selective state-space, limitations of Transformers, selective state-space model, significant inference-time memory, inference-time memory requirements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The typical selective state-space model (SSM) of Mamba addresses several limitations of Transformers, such as quadratic computational complexity with sequence length and significant inference-time memory requirements due to the key-value cache. However, the growing size of Mamba models continues to pose training and deployment challenges and raises environmental concerns due to considerable energy consumption. In this work, we introduce Bi-Mamba, a scalable and powerful 1-bit Mamba architecture designed for more efficient large language models with multiple sizes across 780M, 1.3B, and 2.7B. Bi-Mamba models are trained from scratch on data volume as regular LLM pertaining using an autoregressive distillation loss. Extensive experimental results on language modeling demonstrate that Bi-Mamba achieves performance comparable to its full-precision counterparts (e.g., FP16 or BF16) and much better accuracy than post-training-binarization (PTB) Mamba baselines, while significantly reducing memory footprint and energy consumption compared to the original Mamba model. Our study pioneers a new linear computational complexity LLM framework under low-bit representation and facilitates the future design of specialized hardware tailored for efficient 1-bit Mamba-based LLMs.
摘要：典型的选择性状态空间模型 (SSM) Mamba 解决了 Transformer 的几个局限性，例如与序列长度相关的二次计算复杂性以及由于键值缓存导致的显著推理时间内存需求。然而，Mamba 模型规模的不断增长继续带来训练和部署的挑战，并由于大量能源消耗而引发环境担忧。在这项工作中，我们引入了 Bi-Mamba，这是一种可扩展且强大的 1 比特 Mamba 架构，专为更高效的多尺寸大语言模型设计，涵盖 780M、1.3B 和 2.7B。Bi-Mamba 模型从头开始训练，使用自回归蒸馏损失处理数据量，如同常规的大语言模型。在语言建模的广泛实验结果表明，Bi-Mamba 实现了与其全精度对应模型（例如 FP16 或 BF16）相当的性能，并且在精度上远超训练后二值化 (PTB) Mamba 基线，同时显著减少了内存占用和能源消耗，相比于原始 Mamba 模型。我们的研究开创了在低比特表示下线性计算复杂性的大语言模型框架，并促进了未来针对高效 1 比特 Mamba 基础大语言模型的专用硬件设计。

[NLP-1] ackling prediction tasks in relational databases with LLM s

【速读】：该论文试图解决大型语言模型（LLMs）在关系数据库预测任务中的应用问题，特别是由于关系数据库中表的相互关联、复杂关系和异构数据类型导致的性能不佳的问题。解决方案的关键在于利用最近引入的RelBench基准测试，证明了即使是直接应用LLMs也能在这些任务中取得有竞争力的表现。这一发现确立了LLMs作为关系数据库机器学习任务中的一个有前景的新基线，并鼓励了该领域的进一步研究。

链接: https://arxiv.org/abs/2411.11829
作者: Marek Wydmuch,Łukasz Borchmann,Filip Graliński
关键词-EN: large language models, remains largely unexplored, databases remains largely, demonstrated exceptional performance, relational databases remains
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Though large language models (LLMs) have demonstrated exceptional performance across numerous problems, their application to predictive tasks in relational databases remains largely unexplored. In this work, we address the notion that LLMs cannot yield satisfactory results on relational databases due to their interconnected tables, complex relationships, and heterogeneous data types. Using the recently introduced RelBench benchmark, we demonstrate that even a straightforward application of LLMs achieves competitive performance on these tasks. These findings establish LLMs as a promising new baseline for ML on relational databases and encourage further research in this direction.
摘要：尽管大语言模型（Large Language Models, LLMs）在众多问题上展示了卓越的性能，但其在关系数据库预测任务中的应用仍未得到充分探索。本文针对大语言模型因关系数据库中相互关联的表格、复杂的关系以及异质数据类型而难以取得满意结果的观点进行了探讨。通过使用最近引入的 RelBench 基准测试，我们展示了即使是大语言模型的直接应用，也能在这些任务上取得具有竞争力的表现。这些发现确立了大语言模型作为关系数据库机器学习领域一个有前景的新基准，并鼓励了该方向的进一步研究。

[NLP-2] CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task

【速读】：该论文试图解决汉语拼音缩写转换为汉字的问题，这是一个在中国拼写校正（Chinese Spelling Correction, CSC）领域中的重要分支。由于拼音缩写信息量有限，实现精确转换具有挑战性。论文提出的解决方案是CNMBert，即zh-CN拼音多掩码Bert模型（zh-CN Pinyin Multi-mask Bert Model）。该模型在10,424样本的汉语拼音缩写测试数据集上，超越了少样本GPT模型，实现了59.63%的MRR（Mean Reciprocal Rank），是解决这一问题的关键。

链接: https://arxiv.org/abs/2411.11770
作者: Zishuo Feng,Feng Cao
关键词-EN: Chinese Spelling Correction, Chinese characters represents, Spelling Correction, Chinese Spelling, converting Hanyu Pinyin
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 2figures

点击查看摘要

Abstract:The task of converting Hanyu Pinyin abbreviations to Chinese characters represents a significant branch within the domain of Chinese Spelling Correction (CSC). This task is typically one of text-length alignment, however, due to the limited informational content in pinyin abbreviations, achieving accurate conversion is challenging. In this paper, we propose CNMBert which stands for zh-CN Pinyin Multi-mask Bert Model as a solution to this issue. CNMBert surpasses few-shot GPT models, achieving a 59.63% MRR on a 10,424-sample Hanyu Pinyin abbreviation test dataset.
摘要：将汉语拼音缩写转换为汉字是中文拼写校正（Chinese Spelling Correction, CSC）领域中的一个重要分支。这一任务通常涉及文本长度的对齐，但由于拼音缩写的信息量有限，实现精确转换具有挑战性。本文提出了一种名为 CNMBert 的模型，即 zh-CN 拼音多掩码 Bert 模型，作为解决这一问题的方案。CNMBert 在少样本 GPT 模型中表现优异，在包含 10,424 个样本的汉语拼音缩写测试数据集上达到了 59.63% 的 MRR（Mean Reciprocal Rank）。

[NLP-3] Drowning in Documents: Consequences of Scaling Reranker Inference

【速读】：该论文试图解决的问题是现有重排序器（rerankers）在全检索任务中的性能表现，特别是当它们被用于对初始检索系统返回的大量文档进行重排序时，是否仍然能够保持高效。论文通过实验揭示了一个令人意外的趋势：尽管重排序器在理论上被认为是更有效的，但在处理大量文档时，它们的效果逐渐减弱，甚至可能降低检索质量。关键在于，论文发现重排序器在处理过多文档时，往往会错误地给与查询无关的文档高分，这表明现有的重排序器在处理大规模检索任务时存在局限性。解决方案的关键在于推动未来研究，以改进重排序器的性能，特别是在处理大量文档时的准确性和效率。

链接: https://arxiv.org/abs/2411.11767
作者: Mathew Jacob,Erik Lindgren,Matei Zaharia,Michael Carbin,Omar Khattab,Andrew Drozdov
关键词-EN: typically cross-encoders, initial IR systems, retrieved by cheaper, cheaper initial, Abstract
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Rerankers, typically cross-encoders, are often used to re-score the documents retrieved by cheaper initial IR systems. This is because, though expensive, rerankers are assumed to be more effective. We challenge this assumption by measuring reranker performance for full retrieval, not just re-scoring first-stage retrieval. Our experiments reveal a surprising trend: the best existing rerankers provide diminishing returns when scoring progressively more documents and actually degrade quality beyond a certain limit. In fact, in this setting, rerankers can frequently assign high scores to documents with no lexical or semantic overlap with the query. We hope that our findings will spur future research to improve reranking.
摘要：重排序器，通常是交叉编码器，常用于对由成本较低的初始信息检索系统检索到的文档进行重新评分。这是因为，尽管重排序器成本较高，但人们普遍认为它们更为有效。我们通过测量重排序器在完整检索任务中的表现，而不仅仅是重新评分第一阶段的检索结果，来挑战这一假设。我们的实验揭示了一个令人惊讶的趋势：当对越来越多的文档进行评分时，最佳现有重排序器的回报逐渐减少，并且在超过某个限度后，实际上会降低质量。事实上，在这种设置下，重排序器经常会给与查询在词汇或语义上没有重叠的文档分配高分。我们希望我们的发现能够激发未来对重排序技术的改进研究。

[NLP-4] he Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning

【速读】：该论文试图解决大模态模型（LMMs）在跨文化情境下的有效性问题，特别是由于数据和模型主要以西方为中心，导致其在跨文化任务中的表现受限。解决方案的关键在于引入MosAIC，一个多代理框架（Multi-Agent framework），通过赋予LMMs不同的文化角色（cultural personas）来增强跨文化图像描述（Image Captioning）的能力。该框架通过多代理交互，显著提升了单一模型在不同文化背景下的表现，并提供了一个文化丰富的图像描述数据集和一种文化适应性评估指标（culture-adaptable metric），以量化和评估图像描述中的文化信息。

链接: https://arxiv.org/abs/2411.11758
作者: Longju Bai,Angana Borah,Oana Ignat,Rada Mihalcea
关键词-EN: Large Multimodal Models, Large Multimodal, exhibit impressive performance, exhibit impressive, Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) exhibit impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of most data and models. Conversely, multi-agent models have shown significant capability in solving complex tasks. Our study evaluates the collective performance of LMMs in a multi-agent interaction setting for the novel task of cultural image captioning. Our contributions are as follows: (1) We introduce MosAIC, a Multi-Agent framework to enhance cross-cultural Image Captioning using LMMs with distinct cultural personas; (2) We provide a dataset of culturally enriched image captions in English for images from China, India, and Romania across three datasets: GeoDE, GD-VCR, CVQA; (3) We propose a culture-adaptable metric for evaluating cultural information within image captions; and (4) We show that the multi-agent interaction outperforms single-agent models across different metrics, and offer valuable insights for future research. Our dataset and models can be accessed at this https URL.
摘要：大型多模态模型 (Large Multimodal Models, LMMs) 在多种多模态任务中展现出令人印象深刻的表现。然而，由于大多数数据和模型主要以西方为中心，它们在跨文化背景下的有效性仍然有限。相比之下，多智能体模型在解决复杂任务方面显示出显著的能力。本研究评估了在多智能体交互环境中，LMMs 在新颖的文化图像描述任务中的集体表现。我们的贡献如下：(1) 我们引入了 MosAIC，一个多智能体框架，通过具有不同文化身份的 LMMs 来增强跨文化图像描述；(2) 我们提供了一个包含文化丰富图像描述的数据集，涵盖中国、印度和罗马尼亚的图像，并基于三个数据集：GeoDE、GD-VCR、CVQA；(3) 我们提出了一种可适应文化的评价指标，用于评估图像描述中的文化信息；(4) 我们展示了多智能体交互在不同评价指标上优于单智能体模型，并为未来的研究提供了宝贵的见解。我们的数据集和模型可通过此 https URL 访问。

[NLP-5] Advacheck at GenAI Detection Task 1: AI Detection Powered by Domain-Aware Multi-Tasking

【速读】：该论文旨在解决单语种文本中机器生成文本与人类书写文本的识别问题，这是GenAI Detection Task 1竞赛中的一个子任务。解决方案的关键在于采用了一种多任务架构，该架构在多个分类头之间共享一个Transformer编码器。其中一个分类头负责二元分类，区分人类书写文本和机器生成文本，而其他分类头则是辅助的多类别分类器，用于区分来自特定数据集的不同领域文本。通过训练多类别分类器来区分数据中的不同领域，系统能够更好地理解样本特征，从而在测试集上实现了83.07%的宏F1分数，超越了基线10%，并在官方排名中获得第一名。进一步的消融、错误和表示分析表明，多任务学习优于单任务模式，并且同时任务在嵌入空间中形成了集群结构。

链接: https://arxiv.org/abs/2411.11736
作者: German Gritsai,Anastasia Voznyuk,Ildar Khabutdinov,Andrey Grabovoy
关键词-EN: GenAI Detection Task, designed by Advacheck, Advacheck team, GenAI Detection, shared Transformer Encoder
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The paper describes a system designed by Advacheck team to recognise machine-generated and human-written texts in the monolingual subtask of GenAI Detection Task 1 competition. Our developed system is a multi-task architecture with shared Transformer Encoder between several classification heads. One head is responsible for binary classification between human-written and machine-generated texts, while the other heads are auxiliary multiclass classifiers for texts of different domains from particular datasets. As multiclass heads were trained to distinguish the domains presented in the data, they provide a better understanding of the samples. This approach led us to achieve the first place in the official ranking with 83.07% macro F1-score on the test set and bypass the baseline by 10%. We further study obtained system through ablation, error and representation analyses, finding that multi-task learning outperforms single-task mode and simultaneous tasks form a cluster structure in embeddings space.
摘要：本文描述了Advacheck团队为GenAI Detection Task 1竞赛中的单语子任务设计的系统，用于识别机器生成和人类撰写的文本。我们开发的系统是一种多任务架构，具有多个分类头之间共享的Transformer编码器。其中一个头负责人类撰写和机器生成文本的二元分类，而其他头则是特定数据集中不同领域文本的辅助多类分类器。由于多类头被训练用于区分数据中呈现的领域，它们提供了对样本的更好理解。这种方法使我们在官方排名中以83.07%的宏F1分数在测试集上获得第一名，并超越了基线10%。我们通过消融、错误和表示分析进一步研究了获得的系统，发现多任务学习优于单任务模式，并且同时任务在嵌入空间中形成了集群结构。

[NLP-6] Moral Persuasion in Large Language Models : Evaluating Susceptibility and Ethical Alignment

【速读】：该论文试图解决的问题是如何通过提示使大型语言模型（LLMs）改变其初始决策，并使其与既定的伦理框架保持一致。解决方案的关键在于设计两个实验来评估LLMs在道德模糊场景下的易受影响性，以及它们在预定义伦理框架下的对齐能力。第一个实验通过道德模糊场景评估基础代理LLM的初始决策如何被说服代理修改，第二个实验则通过提示LLMs采用基于哲学理论的特定价值对齐来评估其对伦理框架的适应性。研究结果表明，LLMs在道德情境中确实可以被说服，说服的成功与否取决于模型类型、场景复杂性和对话长度等因素。特别是，来自同一家公司但规模不同的LLMs在伦理说服方面的表现存在显著差异，这突显了它们在易受伦理说服方面的变异性。

链接: https://arxiv.org/abs/2411.11731
作者: Allison Huang,Yulu Niki Pi,Carlos Mougan
关键词-EN: large language models, Base Agent LLM, Base Agent initial, Agent initial decisions, Base Agent
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We explore how large language models (LLMs) can be influenced by prompting them to alter their initial decisions and align them with established ethical frameworks. Our study is based on two experiments designed to assess the susceptibility of LLMs to moral persuasion. In the first experiment, we examine the susceptibility to moral ambiguity by evaluating a Base Agent LLM on morally ambiguous scenarios and observing how a Persuader Agent attempts to modify the Base Agent’s initial decisions. The second experiment evaluates the susceptibility of LLMs to align with predefined ethical frameworks by prompting them to adopt specific value alignments rooted in established philosophical theories. The results demonstrate that LLMs can indeed be persuaded in morally charged scenarios, with the success of persuasion depending on factors such as the model used, the complexity of the scenario, and the conversation length. Notably, LLMs of distinct sizes but from the same company produced markedly different outcomes, highlighting the variability in their susceptibility to ethical persuasion.
摘要：我们探讨了如何通过提示大语言模型 (LLM) 来改变其初始决策，使其与既定的伦理框架保持一致。本研究基于两个实验，旨在评估 LLM 对道德说服的敏感性。在第一个实验中，我们通过评估一个基础 AI 智能体 (Base Agent) 在道德模糊情境下的表现，观察说服者 AI 智能体 (Persuader Agent) 如何尝试修改基础 AI 智能体的初始决策，以此考察其对道德模糊性的敏感性。第二个实验则通过提示 LLM 采纳基于既定哲学理论的特定价值对齐方式，评估其与预定义伦理框架对齐的敏感性。结果表明，LLM 在涉及道德的情境中确实可以被说服，说服的成功与否取决于模型类型、情境复杂性及对话长度等因素。值得注意的是，来自同一家公司但规模不同的 LLM 产生了显著不同的结果，突显了它们对伦理说服敏感性的差异。

[NLP-7] FedCoLLM : A Parameter-Efficient Federated Co-tuning Framework for Large and Small Language Models

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 与下游客户端的小语言模型 (Small Language Models, SLMs) 之间无法实现同步互促增强的问题。解决方案的关键在于提出了一种新颖且参数高效的联邦学习框架 FedCoLLM，该框架通过在 LLMs 和 SLMs 之间引入轻量级适配器 (lightweight adapters)，实现了服务器端 LLMs 知识向客户端 SLMs 的适应性传递，同时从客户端获取领域洞察以丰富 LLMs。这种方法不仅尊重数据隐私，还显著减少了计算和通信开销，从而在保持数据隐私的前提下，有效提升了客户端 SLMs 的性能，并使 LLMs 在未直接访问客户端数据的情况下，达到了与直接微调相当的效果。

链接: https://arxiv.org/abs/2411.11707
作者: Tao Fan,Yan Kang,Guoqiang Ma,Lixin Fan,Kai Chen,Qiang Yang
关键词-EN: Large Language Models, adapting Large Language, Small Language Models, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:By adapting Large Language Models (LLMs) to domain-specific tasks or enriching them with domain-specific knowledge, we can fully harness the capabilities of LLMs. Nonetheless, a gap persists in achieving simultaneous mutual enhancement between the server’s LLM and the downstream clients’ Small Language Models (SLMs). To address this, we propose FedCoLLM, a novel and parameter-efficient federated framework designed for co-tuning LLMs and SLMs. This approach is aimed at adaptively transferring server-side LLMs knowledge to clients’ SLMs while simultaneously enriching the LLMs with domain insights from the clients. To accomplish this, FedCoLLM utilizes lightweight adapters in conjunction with SLMs, facilitating knowledge exchange between server and clients in a manner that respects data privacy while also minimizing computational and communication overhead. Our evaluation of FedCoLLM, utilizing various public LLMs and SLMs across a range of NLP text generation tasks, reveals that the performance of clients’ SLMs experiences notable improvements with the assistance of the LLMs. Simultaneously, the LLMs enhanced via FedCoLLM achieves comparable performance to that obtained through direct fine-tuning on clients’ data.
摘要：通过将大语言模型 (Large Language Models, LLMs) 适应于特定领域的任务或通过领域特定知识对其进行丰富，我们可以充分发挥 LLMs 的能力。然而，在实现服务器端 LLM 与下游客户端小语言模型 (Small Language Models, SLMs) 之间的同步互促方面仍存在差距。为此，我们提出了 FedCoLLM，这是一种新颖且参数高效的联邦框架，专门设计用于共同微调 LLMs 和 SLMs。该方法旨在自适应地将服务器端 LLMs 的知识传递给客户端的 SLMs，同时通过客户端的领域洞察来丰富 LLMs。为实现这一目标，FedCoLLM 利用轻量级适配器与 SLMs 结合，促进服务器与客户端之间的知识交换，同时尊重数据隐私并最小化计算和通信开销。我们对 FedCoLLM 的评估，利用了多种公共 LLMs 和 SLMs 在各种自然语言处理 (NLP) 文本生成任务中的表现，结果显示客户端的 SLMs 在 LLMs 的帮助下性能显著提升。同时，通过 FedCoLLM 增强的 LLMs 达到了与直接在客户端数据上微调相当的性能。

[NLP-8] chnical Report: Enhancing LLM Reasoning with Reward-guided Tree Search

【速读】：该论文试图解决大语言模型 (LLMs) 在推理能力上的不足问题，特别是在数学推理任务中的表现。解决方案的关键在于引入奖励引导的树搜索算法 (reward-guided tree search algorithms)，通过集成策略模型 (policy model)、奖励模型 (reward model) 和搜索算法，构建一个动态扩展的树搜索框架。该框架的核心是利用策略模型在奖励模型的引导下，探索和扩展树结构，从而生成更准确的推理路径和解决方案。通过这种方法，论文在四个具有挑战性的数学推理数据集上进行了广泛评估，显著提升了 LLMs 的推理能力。

链接: https://arxiv.org/abs/2411.11694
作者: Jinhao Jiang,Zhipeng Chen,Yingqian Min,Jie Chen,Xiaoxue Cheng,Jiapeng Wang,Yiru Tang,Haoxiang Sun,Jia Deng,Wayne Xin Zhao,Zheng Liu,Dong Yan,Jian Xie,Zhongyuan Wang,Ji-Rong Wen
关键词-EN: garnered significant attention, test-time scaling, largely due, released by OpenAI, scaling has garnered
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: LLM;Complex Reasoning;Math

点击查看摘要

Abstract:Recently, test-time scaling has garnered significant attention from the research community, largely due to the substantial advancements of the o1 model released by OpenAI. By allocating more computational resources during the inference phase, large language models~(LLMs) can extensively explore the solution space by generating more thought tokens or diverse solutions, thereby producing more accurate responses. However, developing an o1-like reasoning approach is challenging, and researchers have been making various attempts to advance this open area of research. In this paper, we present a preliminary exploration into enhancing the reasoning abilities of LLMs through reward-guided tree search algorithms. This framework is implemented by integrating the policy model, reward model, and search algorithm. It is primarily constructed around a tree search algorithm, where the policy model navigates a dynamically expanding tree guided by a specially trained reward model. We thoroughly explore various design considerations necessary for implementing this framework and provide a detailed report of the technical aspects. To assess the effectiveness of our approach, we focus on mathematical reasoning tasks and conduct extensive evaluations on four challenging datasets, significantly enhancing the reasoning abilities of LLMs.
摘要：近年来，测试时缩放技术引起了研究界的广泛关注，这在很大程度上归功于 OpenAI 发布的 o1 模型的重大进展。通过在推理阶段分配更多的计算资源，大语言模型 (LLMs) 可以通过生成更多的思维 Token 或多样化的解决方案来广泛探索解空间，从而产生更准确的响应。然而，开发类似 o1 的推理方法具有挑战性，研究人员一直在尝试推进这一开放领域的研究。本文中，我们初步探索了通过奖励引导的树搜索算法来增强 LLMs 的推理能力。该框架通过整合策略模型、奖励模型和搜索算法来实现。它主要围绕树搜索算法构建，其中策略模型在专门训练的奖励模型的引导下导航动态扩展的树。我们全面探讨了实现该框架所需的各种设计考虑，并详细报告了技术方面的内容。为了评估我们方法的有效性，我们专注于数学推理任务，并在四个具有挑战性的数据集上进行了广泛的评估，显著增强了 LLMs 的推理能力。

[NLP-9] Chapter 7 Review of Data-Driven Generative AI Models for Knowledge Extraction from Scientific Literature in Healthcare

【速读】：该论文试图解决自然语言处理（NLP）在生成式文本摘要（abstractive text summarization）方面的发展问题，并将其与现有的抽取式摘要技术进行比较。解决方案的关键在于评估和分析从20世纪50年代至今的文本摘要技术发展历程，特别是预训练语言模型如BERT和GPT的引入。论文通过筛选和评估60项研究，最终选择了7项进行深入分析，并提供了GPT-3与GPT-4在科学文本摘要中的对比示例。尽管NLP在生成简短文本摘要方面尚未完全发挥其潜力，但论文指出，随着相关问题的逐步解决，这些模型将逐渐应用于实际场景。

链接: https://arxiv.org/abs/2411.11635
作者: Leon Kopitar,Primoz Kocbek,Lucija Gosak,Gregor Stiglic
关键词-EN: Generative Pre-training Transformers, abstractive NLP-based text, NLP-based text summarization, text summarization approaches, Bidirectional Encoder Representations
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures, 1 table

点击查看摘要

Abstract:This review examines the development of abstractive NLP-based text summarization approaches and compares them to existing techniques for extractive summarization. A brief history of text summarization from the 1950s to the introduction of pre-trained language models such as Bidirectional Encoder Representations from Transformer (BERT) and Generative Pre-training Transformers (GPT) are presented. In total, 60 studies were identified in PubMed and Web of Science, of which 29 were excluded and 24 were read and evaluated for eligibility, resulting in the use of seven studies for further analysis. This chapter also includes a section with examples including an example of a comparison between GPT-3 and state-of-the-art GPT-4 solutions in scientific text summarisation. Natural language processing has not yet reached its full potential in the generation of brief textual summaries. As there are acknowledged concerns that must be addressed, we can expect gradual introduction of such models in practise.
摘要：本文综述了基于自然语言处理 (NLP) 的生成式文本摘要方法的发展，并将其与现有的抽取式摘要技术进行了比较。简要回顾了从 20 世纪 50 年代到引入预训练语言模型（如 Transformer 的双向编码器表示 (BERT) 和生成式预训练 Transformer (GPT)）的文本摘要历史。总共在 PubMed 和 Web of Science 中识别了 60 项研究，其中 29 项被排除，24 项被阅读并评估其合格性，最终选择了 7 项研究进行进一步分析。本章还包括一个示例部分，其中包括在科学文本摘要中 GPT-3 与最先进的 GPT-4 解决方案之间的比较示例。自然语言处理在生成简短文本摘要方面尚未充分发挥其潜力。由于存在一些公认的问题需要解决，我们可以预期这些模型将在实践中逐步引入。

[NLP-10] Federated Incremental Named Entity Recognition

【速读】：该论文试图解决联邦命名实体识别 (Federated Named Entity Recognition, FNER) 中由于新实体类型不断增加和新客户端不定期加入导致的异质性遗忘问题。解决方案的关键在于提出了一种本地-全局遗忘防御模型 (Local-Global Forgetting Defense, LGFD)。具体来说，针对本地客户端内的遗忘问题，采用了结构知识蒸馏损失 (structural knowledge distillation loss) 来保留潜在空间的特征结构，并通过伪标签引导的跨类型对比损失 (pseudo-label-guided inter-type contrastive loss) 增强不同实体类型的区分能力，从而有效保留本地客户端中已学习到的知识。针对跨客户端的遗忘问题，提出了任务切换监控器 (task switching monitor)，能够在隐私保护的前提下自动识别新实体类型，并存储最新的旧全局模型用于知识蒸馏和伪标签生成。实验结果表明，LGFD模型在性能上显著优于比较方法。

链接: https://arxiv.org/abs/2411.11623
作者: Duzhen Zhang,Yahan Yu,Chenxing Li,Jiahua Dong,Dong Yu
关键词-EN: Named Entity Recognition, Federated Named Entity, Federated Named, Entity Recognition, Named Entity
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Federated Named Entity Recognition (FNER) boosts model training within each local client by aggregating the model updates of decentralized local clients, without sharing their private data. However, existing FNER methods assume fixed entity types and local clients in advance, leading to their ineffectiveness in practical applications. In a more realistic scenario, local clients receive new entity types continuously, while new local clients collecting novel data may irregularly join the global FNER training. This challenging setup, referred to here as Federated Incremental NER, renders the global model suffering from heterogeneous forgetting of old entity types from both intra-client and inter-client perspectives. To overcome these challenges, we propose a Local-Global Forgetting Defense (LGFD) model. Specifically, to address intra-client forgetting, we develop a structural knowledge distillation loss to retain the latent space’s feature structure and a pseudo-label-guided inter-type contrastive loss to enhance discriminative capability over different entity types, effectively preserving previously learned knowledge within local clients. To tackle inter-client forgetting, we propose a task switching monitor that can automatically identify new entity types under privacy protection and store the latest old global model for knowledge distillation and pseudo-labeling. Experiments demonstrate significant improvement of our LGFD model over comparison methods.
摘要：联邦命名实体识别 (Federated Named Entity Recognition, FNER) 通过聚合分散的本地客户端的模型更新来增强每个本地客户端的模型训练，而无需共享其私有数据。然而，现有的 FNER 方法假设实体类型和本地客户端是预先固定的，这导致其在实际应用中的效果不佳。在更现实的场景中，本地客户端会持续接收新的实体类型，同时收集新数据的新的本地客户端可能会不定期地加入全局 FNER 训练。这种具有挑战性的设置，我们称之为联邦增量命名实体识别 (Federated Incremental NER)，使得全局模型在从客户端内部和客户端间两个角度来看，都遭受对旧实体类型的异质性遗忘。为了克服这些挑战，我们提出了一种本地-全局遗忘防御 (Local-Global Forgetting Defense, LGFD) 模型。具体来说，为了解决客户端内部的遗忘问题，我们开发了一种结构知识蒸馏损失，以保留潜在空间的特征结构，并引入了一种伪标签引导的跨类型对比损失，以增强对不同实体类型的区分能力，从而有效保留本地客户端中先前学习到的知识。为了应对客户端间的遗忘问题，我们提出了一种任务切换监控器，该监控器能够在隐私保护下自动识别新的实体类型，并存储最新的旧全局模型以进行知识蒸馏和伪标签生成。实验结果表明，我们的 LGFD 模型在对比方法中显著提升了性能。

[NLP-11] OASIS: Open Agents Social Interaction Simulations on One Million Agents

【速读】：该论文试图解决现有基于规则的代理模型（ABMs）在模拟社交媒体平台（如X、Reddit）时，难以扩展到大规模用户和多样化现象的问题。解决方案的关键是提出了OASIS，一个可扩展且通用的社交媒体模拟器。OASIS通过整合动态更新的环境（如动态社交网络和帖子信息）、多样化的行为空间（如关注、评论）以及推荐系统（基于兴趣和热度评分），能够模拟多达一百万用户的行为。此外，OASIS的设计使其易于扩展到不同的社交媒体平台，从而研究大规模群体现象和行为。通过复现多种社会现象（如信息传播、群体极化和羊群效应），OASIS展示了其在数字环境中研究复杂系统的潜力。

链接: https://arxiv.org/abs/2411.11581
作者: Ziyi Yang,Zaibin Zhang,Zirui Zheng,Yuxian Jiang,Ziyue Gan,Zhiyu Wang,Zijian Ling,Jinsong Chen,Martz Ma,Bowen Dong,Prateek Gupta,Shuyue Hu,Zhenfei Yin,Guohao Li,Xu Jia,Lijun Wang,Bernard Ghanem,Huchuan Lu,Wanli Ouyang,Yu Qiao,Philip Torr,Jing Shao
关键词-EN: enhancing rule-based agent-based, realistic large language, social media platforms, rule-based agent-based models, large language model
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There has been a growing interest in enhancing rule-based agent-based models (ABMs) for social media platforms (\emphi.e., X, Reddit) with more realistic large language model (LLM) agents, thereby allowing for a more nuanced study of complex systems. As a result, several LLM-based ABMs have been proposed in the past year. While they hold promise, each simulator is specifically designed to study a particular scenario, making it time-consuming and resource-intensive to explore other phenomena using the same ABM. Additionally, these models simulate only a limited number of agents, whereas real-world social media platforms involve millions of users. To this end, we propose OASIS, a generalizable and scalable social media simulator. OASIS is designed based on real-world social media platforms, incorporating dynamically updated environments (\emphi.e., dynamic social networks and post information), diverse action spaces (\emphi.e., following, commenting), and recommendation systems (\emphi.e., interest-based and hot-score-based). Additionally, OASIS supports large-scale user simulations, capable of modeling up to one million users. With these features, OASIS can be easily extended to different social media platforms to study large-scale group phenomena and behaviors. We replicate various social phenomena, including information spreading, group polarization, and herd effects across X and Reddit platforms. Moreover, we provide observations of social phenomena at different agent group scales. We observe that the larger agent group scale leads to more enhanced group dynamics and more diverse and helpful agents’ opinions. These findings demonstrate OASIS’s potential as a powerful tool for studying complex systems in digital environments.
摘要：近年来，随着对复杂系统进行更细致研究的兴趣日益增长，人们越来越关注如何通过引入更真实的大语言模型 (LLM) 智能体来增强基于规则的智能体模型 (ABMs)，特别是在社交媒体平台（如 X、Reddit）上的应用。因此，过去一年中提出了多个基于 LLM 的 ABMs。尽管这些模拟器具有潜力，但它们各自专为特定场景设计，使得使用同一 ABM 探索其他现象既耗时又资源密集。此外，这些模型仅模拟有限数量的智能体，而现实世界的社交媒体平台涉及数百万用户。为此，我们提出了 OASIS，一个可泛化且可扩展的社交媒体模拟器。OASIS 基于真实世界的社交媒体平台设计，整合了动态更新的环境（如动态社交网络和帖子信息）、多样化的行动空间（如关注、评论）以及推荐系统（如基于兴趣和热门评分）。此外，OASIS 支持大规模用户模拟，能够建模多达一百万用户。凭借这些特性，OASIS 可以轻松扩展到不同的社交媒体平台，用于研究大规模群体现象和行为。我们复现了多种社会现象，包括信息传播、群体极化和羊群效应，涵盖 X 和 Reddit 平台。此外，我们提供了在不同智能体群体规模下对社会现象的观察。我们发现，智能体群体规模越大，群体动态越增强，智能体的意见也越多样化和有帮助。这些发现展示了 OASIS 作为研究数字环境中复杂系统的强大工具的潜力。

[NLP-12] Addressing Hallucinations in Language Models with Knowledge Graph Embeddings as an Additional Modality

【速读】：该论文试图解决大型语言模型（LLMs）中存在的幻觉问题（hallucinations），即模型生成不准确或虚假信息的现象。解决方案的关键在于将知识图谱（Knowledge Graphs, KGs）作为新的模态引入到LLMs中，通过将输入文本转换为知识图谱嵌入（KG embeddings），并使用适配器（adapter）将这些嵌入整合到语言模型空间中，从而在不依赖外部检索过程的情况下提高模型的准确性。具体实现中，论文创建了包含超过300万条Wikipedia文本及其对应实体嵌入的WikiEntities数据集，用于训练实体链接模型和适配器。该方法无需对语言模型本身进行微调，仅训练适配器即可，确保模型在其他任务上的性能不受影响。实验结果表明，该方法在HaluEval、True-False基准测试和FEVER数据集上显著提升了模型的表现，有效减少了幻觉现象。

链接: https://arxiv.org/abs/2411.11531
作者: Viktoriia Chekalina,Anton Razzigaev,Elizaveta Goncharova,Andrey Kuznetsov
关键词-EN: incorporating Knowledge Graphs, Knowledge Graphs, Large Language Models, Large Language, incorporating Knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper we present an approach to reduce hallucinations in Large Language Models (LLMs) by incorporating Knowledge Graphs (KGs) as an additional modality. Our method involves transforming input text into a set of KG embeddings and using an adapter to integrate these embeddings into the language model space, without relying on external retrieval processes. To facilitate this, we created WikiEntities, a dataset containing over 3 million Wikipedia texts annotated with entities from Wikidata and their corresponding embeddings from PyTorch-BigGraph. This dataset serves as a valuable resource for training Entity Linking models and adapting the described method to various LLMs using specialized adapters. Our method does not require fine-tuning of the language models themselves; instead, we only train the adapter. This ensures that the model’s performance on other tasks is not affected. We trained an adapter for the Mistral 7B, LLaMA 2-7B (chat), and LLaMA 3-8B (instruct) models using this dataset and demonstrated that our approach improves performance on the HaluEval, True-False benchmarks and FEVER dataset. The results indicate that incorporating KGs as a new modality can effectively reduce hallucinations and improve the factual accuracy of language models, all without the need for external retrieval. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.11531 [cs.CL] (or arXiv:2411.11531v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.11531 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：本文提出了一种通过引入知识图谱 (Knowledge Graphs, KGs) 作为额外模态来减少大语言模型 (Large Language Models, LLMs) 中幻觉现象的方法。我们的方法涉及将输入文本转换为一组 KG 嵌入，并使用适配器将这些嵌入整合到语言模型空间中，而无需依赖外部检索过程。为此，我们创建了 WikiEntities，这是一个包含超过 300 万条维基百科文本的数据集，这些文本通过 Wikidata 中的实体及其对应的 PyTorch-BigGraph 嵌入进行注释。该数据集作为训练实体链接模型和使用专用适配器将所述方法应用于各种 LLMs 的宝贵资源。我们的方法不需要对语言模型本身进行微调；相反，我们仅训练适配器。这确保了模型在其他任务上的性能不受影响。我们使用该数据集为 Mistral 7B、LLaMA 2-7B (chat) 和 LLaMA 3-8B (instruct) 模型训练了适配器，并证明我们的方法在 HaluEval、True-False 基准和 FEVER 数据集上的性能有所提升。结果表明，将 KGs 作为新模态引入可以有效减少幻觉现象，并提高语言模型的实际准确性，而无需外部检索。

主题：计算与语言 (cs.CL)；人工智能 (cs.AI)
引用方式：arXiv:2411.11531 [cs.CL]（或 arXiv:2411.11531v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2411.11531
通过 DataCite 发布的 arXiv DOI（注册待定）

[NLP-13] Search Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering

【速读】：该论文试图解决基础模型在提供有效监督信号方面的挑战，特别是在生成式 AI (Generative AI) 和大规模监督信号开发日益重要的背景下。解决方案的关键在于提出了一种名为“验证器工程 (verifier engineering)”的新型后训练范式。验证器工程的核心是通过一系列自动化验证器来执行验证任务，并向基础模型提供有意义的反馈。该过程被系统地分为三个阶段：搜索、验证和反馈，每个阶段都有其独特的研究进展和技术应用。论文认为，验证器工程是实现人工通用智能 (Artificial General Intelligence) 的基本途径。

链接: https://arxiv.org/abs/2411.11504
作者: Xinyan Guan,Yanjiang Liu,Xinyu Lu,Boxi Cao,Ben He,Xianpei Han,Le Sun,Jie Lou,Bowen Yu,Yaojie Lu,Hongyu Lin
关键词-EN: scalable supervision signals, supervision signals, evolution of machine, machine learning, learning has increasingly
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The evolution of machine learning has increasingly prioritized the development of powerful models and more scalable supervision signals. However, the emergence of foundation models presents significant challenges in providing effective supervision signals necessary for further enhancing their capabilities. Consequently, there is an urgent need to explore novel supervision signals and technical approaches. In this paper, we propose verifier engineering, a novel post-training paradigm specifically designed for the era of foundation models. The core of verifier engineering involves leveraging a suite of automated verifiers to perform verification tasks and deliver meaningful feedback to foundation models. We systematically categorize the verifier engineering process into three essential stages: search, verify, and feedback, and provide a comprehensive review of state-of-the-art research developments within each stage. We believe that verifier engineering constitutes a fundamental pathway toward achieving Artificial General Intelligence.
摘要：机器学习的演进越来越注重开发更强大的模型和更具扩展性的监督信号。然而，基础模型的出现为提供有效监督信号以进一步增强其能力带来了重大挑战。因此，迫切需要探索新的监督信号和技术方法。本文提出了一种名为“验证器工程”的新型后训练范式，专门针对基础模型时代而设计。验证器工程的核心在于利用一系列自动化验证器执行验证任务，并向基础模型提供有意义的反馈。我们将验证器工程过程系统地分为三个关键阶段：搜索、验证和反馈，并对每个阶段内的最新研究进展进行了全面综述。我们相信，验证器工程是实现通用人工智能（Artificial General Intelligence）的基本途径。

[NLP-14] Safe Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models

【速读】：该论文试图解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在安全性方面的问题，特别是如何利用视觉模态引入的不可预见领域来绕过模型的安全防护。解决方案的关键在于揭示了LVLMs的两个基本特性：普遍推理能力和安全雪球效应。基于这些特性，论文提出了一个名为Safety Snowball Agent (SSA)的新型代理框架，通过利用代理的自主性和工具使用能力来实现对LVLMs的越狱。SSA的工作流程包括两个主要阶段：初始响应生成阶段，工具根据潜在的有害意图生成或检索越狱图像；以及有害雪球效应阶段，通过精炼的后续提示逐步诱导出有害输出。实验结果表明，SSA能够利用几乎任何图像诱导LVLMs产生不安全内容，成功率很高，这为生成式多模态系统的安全性提出了新的挑战。

链接: https://arxiv.org/abs/2411.11496
作者: Chenhang Cui,Gelei Deng,An Zhang,Jingnan Zheng,Yicong Li,Lianli Gao,Tianwei Zhang,Tat-Seng Chua
关键词-EN: Large Vision-Language Models, Recent advances, Vision-Language Models, advances in Large, Large Vision-Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Vision-Language Models (LVLMs) have showcased strong reasoning abilities across multiple modalities, achieving significant breakthroughs in various real-world applications. Despite this great success, the safety guardrail of LVLMs may not cover the unforeseen domains introduced by the visual modality. Existing studies primarily focus on eliciting LVLMs to generate harmful responses via carefully crafted image-based jailbreaks designed to bypass alignment defenses. In this study, we reveal that a safe image can be exploited to achieve the same jailbreak consequence when combined with additional safe images and prompts. This stems from two fundamental properties of LVLMs: universal reasoning capabilities and safety snowball effect. Building on these insights, we propose Safety Snowball Agent (SSA), a novel agent-based framework leveraging agents’ autonomous and tool-using abilities to jailbreak LVLMs. SSA operates through two principal stages: (1) initial response generation, where tools generate or retrieve jailbreak images based on potential harmful intents, and (2) harmful snowballing, where refined subsequent prompts induce progressively harmful outputs. Our experiments demonstrate that \ours can use nearly any image to induce LVLMs to produce unsafe content, achieving high success jailbreaking rates against the latest LVLMs. Unlike prior works that exploit alignment flaws, \ours leverages the inherent properties of LVLMs, presenting a profound challenge for enforcing safety in generative multimodal systems. Our code is avaliable at \urlthis https URL.
摘要：近年来，大视觉语言模型 (Large Vision-Language Models, LVLMs) 的进展展示了其在多模态推理方面的强大能力，在各种实际应用中取得了显著突破。尽管取得了如此巨大的成功，LVLMs 的安全防护措施可能无法覆盖视觉模态引入的不可预见领域。现有研究主要集中在通过精心设计的基于图像的越狱手段，诱导 LVLMs 生成有害响应，以绕过对齐防御。在本研究中，我们揭示了一个安全的图像在与额外的安全图像和提示结合时，可以被利用以达到相同的越狱效果。这源于 LVLMs 的两个基本特性：普遍推理能力和安全雪球效应。基于这些洞察，我们提出了安全雪球智能体 (Safety Snowball Agent, SSA)，这是一个利用智能体自主性和工具使用能力来越狱 LVLMs 的新型智能体框架。SSA 通过两个主要阶段运作：(1) 初始响应生成，工具根据潜在的有害意图生成或检索越狱图像；(2) 有害雪球效应，通过精炼的后续提示逐步诱导出更具危害性的输出。我们的实验表明，\ours 可以使用几乎任何图像诱导 LVLMs 生成不安全内容，对最新的 LVLMs 实现高成功率的越狱。与先前利用对齐缺陷的工作不同，\ours 利用了 LVLMs 的固有特性，为生成多模态系统的安全性实施带来了深刻挑战。我们的代码可在 \urlthis https URL 获取。

[NLP-15] Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在评估中忽视抽象价值维度（如个性和价值观）的问题。解决方案的关键在于引入Value-Spectrum，这是一个基于Schwartz价值维度的视觉问答基准，旨在评估VLMs在处理涉及核心价值观的内容时的表现。通过构建包含超过50,000个短视频的数据库，涵盖家庭、健康、爱好、社会和技术等多个主题，并开发自动化视频浏览和分析的VLM代理管道，论文不仅揭示了现有VLMs在价值导向内容上的显著差异，还探索了模型在明确提示下扮演特定角色的能力，从而为VLMs在价值基础任务上的进步提供了全面的评估工具，并推动了更复杂角色扮演AI代理的开发。

链接: https://arxiv.org/abs/2411.11479
作者: Jingxuan Li,Yuning Yang,Shengqi Yang,Yizhou Zhao,Ying Nian Wu
关键词-EN: overlooking abstract aspects, expanded multimodal applications, overlooking abstract, multimodal applications, object recognition
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of Vision-Language Models (VLMs) has expanded multimodal applications, yet evaluations often focus on basic tasks like object recognition, overlooking abstract aspects such as personalities and values. To address this gap, we introduce Value-Spectrum, a visual question-answering benchmark aimed at assessing VLMs based on Schwartz’s value dimensions, which capture core values guiding people’s beliefs and actions across cultures. We constructed a vectorized database of over 50,000 short videos sourced from TikTok, YouTube Shorts, and Instagram Reels, covering multiple months and a wide array of topics such as family, health, hobbies, society, and technology. We also developed a VLM agent pipeline to automate video browsing and analysis. Benchmarking representative VLMs on Value-Spectrum reveals significant differences in their responses to value-oriented content, with most models exhibiting a preference for hedonistic topics. Beyond identifying natural preferences, we explored the ability of VLM agents to adopt specific personas when explicitly prompted, revealing insights into the models’ adaptability in role-playing scenarios. These findings highlight the potential of Value-Spectrum as a comprehensive evaluation set for tracking VLM advancements in value-based tasks and for developing more sophisticated role-playing AI agents.
摘要：视觉-语言模型 (Vision-Language Models, VLMs) 的快速发展扩展了多模态应用，然而评估往往集中在物体识别等基础任务上，忽视了人格和价值观等抽象方面。为了填补这一空白，我们引入了价值光谱 (Value-Spectrum)，这是一个视觉问答基准，旨在基于 Schwartz 的价值维度评估 VLMs，这些维度捕捉了跨文化引导人们信仰和行为的核心价值观。我们构建了一个包含超过 50,000 个短视频的向量化数据库，这些视频来自 TikTok、YouTube Shorts 和 Instagram Reels，涵盖了多个时间段和广泛的主题，如家庭、健康、爱好、社会和技术。我们还开发了一个 VLM 智能体管道，用于自动化视频浏览和分析。在价值光谱上对代表性 VLMs 进行基准测试，揭示了它们在价值导向内容上的响应存在显著差异，大多数模型表现出对享乐主义主题的偏好。除了识别自然偏好外，我们还探讨了 VLM 智能体在明确提示下采用特定人格的能力，揭示了模型在角色扮演场景中的适应性。这些发现突显了价值光谱作为全面评估集的潜力，用于追踪 VLM 在基于价值的任务中的进展，并开发更复杂的角色扮演 AI 智能体。

[NLP-16] Re-examining learning linear functions in context

【速读】：该论文旨在探讨上下文学习（In Context Learning, ICL）在不同训练和测试设置下，对于从零开始训练的多种尺寸的transformer模型的表现，并揭示ICL在泛化到训练分布之外数据时的局限性。解决方案的关键在于识别并分析这些模型在面对非训练分布数据时的系统性失败，从而揭示ICL策略与标准解决方案之间的差异。

链接: https://arxiv.org/abs/2411.11465
作者: Omar Naim,Guilhem Fouilhé,Nicholas Asher
关键词-EN: context learning, range of problems, attractive method, method of solving, solving a wide
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In context learning (ICL) is an attractive method of solving a wide range of problems. Inspired by Garg et al. (2022), we look closely at ICL in a variety of train and test settings for several transformer models of different sizes trained from scratch. Our study complements prior work by pointing out several systematic failures of these models to generalize to data not in the training distribution, thereby showing some limitations of ICL. We find that models adopt a strategy for this task that is very different from standard solutions.
摘要：上下文学习（In Context Learning, ICL）是一种解决广泛问题的吸引人方法。受 Garg 等人（2022）的启发，我们深入研究了在多种训练和测试设置下，从头开始训练的多个不同大小的 Transformer 模型。我们的研究通过指出这些模型在泛化到训练分布之外的数据时出现的几个系统性失败，补充了先前的工作，从而揭示了 ICL 的一些局限性。我们发现，模型采用了一种与标准解决方案截然不同的策略来完成这一任务。

[NLP-17] Causal Effect of Group Diversity on Redundancy and Coverage in Peer-Reviewing

【速读】：该论文试图解决的问题是如何通过多样化的评审者组合来提高同行评审过程的有效性。解决方案的关键在于提出了两个评估评审效用的指标：评审覆盖率（review coverage）和评审冗余度（review redundancy）。研究假设多样化的评审者将表现出高覆盖率和低冗余度。通过观察性数据分析，研究发现，主题多样性、资历层次差异和出版网络差异的评审者组合能够提高评审覆盖率，而组织多样性和地理多样性则未显示出显著影响。此外，除了地理多样性之外的所有多样性维度都能降低评审冗余度，特别是出版网络多样性还能带来不同的视角，从而降低冗余度。研究建议在评审者分配过程中考虑这些多样性维度，以优化评审过程。

链接: https://arxiv.org/abs/2411.11437
作者: Navita Goyal,Ivan Stelmakh,Nihar Shah,Hal Daumé III
关键词-EN: mitigate individual biases, review, aiming to gather, individual biases, large host
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Applications (stat.AP)
备注:

点击查看摘要

Abstract:A large host of scientific journals and conferences solicit peer reviews from multiple reviewers for the same submission, aiming to gather a broader range of perspectives and mitigate individual biases. In this work, we reflect on the role of diversity in the slate of reviewers assigned to evaluate a submitted paper as a factor in diversifying perspectives and improving the utility of the peer-review process. We propose two measures for assessing review utility: review coverage – reviews should cover most contents of the paper – and review redundancy – reviews should add information not already present in other reviews. We hypothesize that reviews from diverse reviewers will exhibit high coverage and low redundancy. We conduct a causal study of different measures of reviewer diversity on review coverage and redundancy using observational data from a peer-reviewed conference with approximately 5,000 submitted papers. Our study reveals disparate effects of different diversity measures on review coverage and redundancy. Our study finds that assigning a group of reviewers that are topically diverse, have different seniority levels, or have distinct publication networks leads to broader coverage of the paper or review criteria, but we find no evidence of an increase in coverage for reviewer slates with reviewers from diverse organizations or geographical locations. Reviewers from different organizations, seniority levels, topics, or publications networks (all except geographical diversity) lead to a decrease in redundancy in reviews. Furthermore, publication network-based diversity alone also helps bring in varying perspectives (that is, low redundancy), even within specific review criteria. Our study adopts a group decision-making perspective for reviewer assignments in peer review and suggests dimensions of diversity that can help guide the reviewer assignment process.
摘要：众多科学期刊和会议为同一篇投稿征集多位审稿人的评审意见，旨在收集更广泛的观点并减少个人偏见。在此研究中，我们探讨了分配给评审某篇论文的审稿人多样性在拓宽视角和提升同行评审过程效用方面的作用。我们提出了两种评估评审效用的指标：评审覆盖率——评审应涵盖论文的大部分内容——和评审冗余度——评审应提供其他评审中未包含的信息。我们假设来自不同背景的审稿人将展现出高覆盖率和低冗余度。我们利用来自一个约5000篇投稿的同行评审会议的观测数据，对不同审稿人多样性指标对评审覆盖率和冗余度的影响进行了因果研究。研究发现，不同多样性指标对评审覆盖率和冗余度的影响各异。研究显示，分配一组在主题、资历或出版网络方面具有多样性的审稿人，可以带来更广泛的论文或评审标准覆盖，但未发现来自不同机构或地理位置的审稿人组合能显著提高覆盖率。来自不同机构、资历、主题或出版网络（地理位置除外）的审稿人能减少评审的冗余度。此外，仅基于出版网络的多样性也能引入不同的视角（即低冗余度），即使在特定的评审标准内也是如此。本研究从群体决策的角度出发，为同行评审中的审稿人分配提供了视角，并指出了有助于指导审稿人分配过程的多样性维度。

[NLP-18] Membership Inference Attack against Long-Context Large Language Models

【速读】：该论文试图解决长上下文语言模型 (Long-Context Language Models, LCLMs) 在处理大规模外部数据时可能面临的隐私风险问题。解决方案的关键在于首次提出了针对 LCLMs 的六种成员推断攻击 (Membership Inference Attack, MIA) 策略，并通过广泛的实验验证了这些攻击策略的有效性。具体来说，论文通过分析模型生成的内容与外部文档之间的生成损失和语义相似度，来推断某个文档是否被包含在 LCLMs 的上下文中。实验结果表明，这些攻击策略在大多数情况下能够准确推断成员状态，例如在 Multi-document QA 数据集上使用 LongChat-7b-v1.5-32k 模型时，攻击的 F1-score 达到了 90.66%，从而揭示了 LCLMs 输入上下文中存在显著的成员信息泄露风险。此外，论文还探讨了 LCLMs 易受此类攻击的潜在原因。

链接: https://arxiv.org/abs/2411.11424
作者: Zixiong Wang,Gaoyang Liu,Yang Yang,Chen Wang
关键词-EN: Large Language Models, Large Language, Language Models, Long-Context Language Models, Recent advances
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have enabled them to overcome their context window limitations, and demonstrate exceptional retrieval and reasoning capacities on longer context. Quesion-answering systems augmented with Long-Context Language Models (LCLMs) can automatically search massive external data and incorporate it into their contexts, enabling faithful predictions and reducing issues such as hallucinations and knowledge staleness. Existing studies targeting LCLMs mainly concentrate on addressing the so-called lost-in-the-middle problem or improving the inference effiencicy, leaving their privacy risks largely unexplored. In this paper, we aim to bridge this gap and argue that integrating all information into the long context makes it a repository of sensitive information, which often contains private data such as medical records or personal identities. We further investigate the membership privacy within LCLMs external context, with the aim of determining whether a given document or sequence is included in the LCLMs context. Our basic idea is that if a document lies in the context, it will exhibit a low generation loss or a high degree of semantic similarity to the contents generated by LCLMs. We for the first time propose six membership inference attack (MIA) strategies tailored for LCLMs and conduct extensive experiments on various popular models. Empirical results demonstrate that our attacks can accurately infer membership status in most cases, e.g., 90.66% attack F1-score on Multi-document QA datasets with LongChat-7b-v1.5-32k, highlighting significant risks of membership leakage within LCLMs input contexts. Furthermore, we examine the underlying reasons why LCLMs are susceptible to revealing such membership information.
摘要：近年来，大语言模型 (LLM) 的技术进步使其能够克服上下文窗口的限制，并在更长的上下文中展现出卓越的检索和推理能力。配备长上下文语言模型 (LCLM) 的问答系统能够自动搜索大量外部数据并将其整合到上下文中，从而实现准确的预测并减少诸如幻觉和知识陈旧等问题。现有的针对 LCLM 的研究主要集中在解决所谓的“中间迷失”问题或提高推理效率上，而对其隐私风险的探讨则相对较少。本文旨在填补这一空白，并指出将所有信息整合到长上下文中使其成为敏感信息的存储库，这些信息通常包含医疗记录或个人身份等私人数据。我们进一步研究了 LCLM 外部上下文中的成员隐私问题，旨在确定给定的文档或序列是否包含在 LCLM 的上下文中。我们的基本思路是，如果一个文档存在于上下文中，它将表现出较低的生成损失或与 LCLM 生成内容的高度语义相似性。我们首次提出了六种针对 LCLM 的成员推断攻击 (MIA) 策略，并在多种流行模型上进行了广泛的实验。实证结果表明，我们的攻击在大多数情况下能够准确推断成员状态，例如在 LongChat-7b-v1.5-32k 模型上对多文档问答数据集的攻击 F1 分数达到 90.66%，突显了 LCLM 输入上下文中成员信息泄露的重大风险。此外，我们还探讨了 LCLM 为何容易泄露此类成员信息的根本原因。

[NLP-19] Rethinking Thinking Tokens: Understanding Why They Underperform in Practice

【速读】：该论文试图解决的问题是评估和改进无监督方法“思考标记 (Thinking Tokens, TT)”在语言模型中的推理能力。解决方案的关键在于揭示TT方法在多个基准测试中表现不佳的原因，即其依赖单一嵌入 (embedding) 导致学习信号不一致和梯度噪声增加。通过全面的实证分析，论文验证了这一假设，并讨论了其对未来无监督推理研究在大型语言模型 (LLMs) 中的影响。

链接: https://arxiv.org/abs/2411.11371
作者: Sreeram Vennam,David Valente,David Herel,Ponnurangam Kumaraguru
关键词-EN: Thinking Tokens, language models, method to facilitate, Thinking, Tokens
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Thinking Tokens (TT) have been proposed as an unsupervised method to facilitate reasoning in language models. However, despite their conceptual appeal, our findings show that TTs marginally improves performance and consistently underperforms compared to Chain-of-Thought (CoT) reasoning across multiple benchmarks. We hypothesize that this underperformance stems from the reliance on a single embedding for TTs, which results in inconsistent learning signals and introduces noisy gradients. This paper provides a comprehensive empirical analysis to validate this hypothesis and discusses the implications for future research on unsupervised reasoning in LLMs.
摘要：思考 Token (Thinking Tokens, TT) 被提出作为一种无监督方法，以促进语言模型中的推理能力。然而，尽管其在概念上具有吸引力，我们的研究发现，TT 在多个基准测试中仅略微提升了性能，并且始终不如思维链 (Chain-of-Thought, CoT) 推理表现出色。我们假设这种表现不佳的原因在于 TT 依赖于单一嵌入表示，这导致了不一致的学习信号并引入了噪声梯度。本文通过全面的实证分析来验证这一假设，并讨论了其对未来在大语言模型中无监督推理研究的影响。

[NLP-20] MAIRA-Seg: Enhancing Radiology Report Generation with Segmentation-Aware Multimodal Large Language Models ML4H2024

【速读】：该论文试图解决的问题是如何通过引入像素级别的分割掩码（segmentation masks）来提升多模态大语言模型（MLLMs）在放射报告生成中的细粒度图像解释能力。解决方案的关键在于提出了MAIRA-Seg框架，该框架通过结合语义分割掩码和胸部X光片（CXRs）来生成放射报告。具体步骤包括训练专家分割模型以获取CXR中放射学特定结构的掩码伪标签，然后在MAIRA模型的基础上，集成一个可训练的分割标记提取器，利用这些掩码伪标签，并采用掩码感知的提示（mask-aware prompting）来生成初步的放射报告。实验结果表明，MAIRA-Seg在公开的MIMIC-CXR数据集上优于非分割基线模型，证实了使用分割掩码可以增强MLLMs的细致推理能力，从而可能有助于更好的临床结果。

链接: https://arxiv.org/abs/2411.11362
作者: Harshita Sharma,Valentina Salvatelli,Shaury Srivastav,Kenza Bouzid,Shruthi Bannur,Daniel C. Castro,Maximilian Ilse,Sam Bond-Taylor,Mercy Prasanna Ranjit,Fabian Falck,Fernando Pérez-García,Anton Schwaighofer,Hannah Richardson,Maria Teodora Wetscherek,Stephanie L. Hyland,Javier Alvarez-Valle
关键词-EN: chest X-rays, radiology report generation, report generation, growing interest, interest in applying
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted as Proceedings Paper at ML4H 2024

点击查看摘要

Abstract:There is growing interest in applying AI to radiology report generation, particularly for chest X-rays (CXRs). This paper investigates whether incorporating pixel-level information through segmentation masks can improve fine-grained image interpretation of multimodal large language models (MLLMs) for radiology report generation. We introduce MAIRA-Seg, a segmentation-aware MLLM framework designed to utilize semantic segmentation masks alongside CXRs for generating radiology reports. We train expert segmentation models to obtain mask pseudolabels for radiology-specific structures in CXRs. Subsequently, building on the architectures of MAIRA, a CXR-specialised model for report generation, we integrate a trainable segmentation tokens extractor that leverages these mask pseudolabels, and employ mask-aware prompting to generate draft radiology reports. Our experiments on the publicly available MIMIC-CXR dataset show that MAIRA-Seg outperforms non-segmentation baselines. We also investigate set-of-marks prompting with MAIRA and find that MAIRA-Seg consistently demonstrates comparable or superior performance. The results confirm that using segmentation masks enhances the nuanced reasoning of MLLMs, potentially contributing to better clinical outcomes.
摘要：越来越多的研究关注将人工智能应用于放射学报告生成，特别是针对胸部X光片（CXRs）。本文探讨了通过分割掩码引入像素级信息是否能提升多模态大语言模型（MLLMs）在放射学报告生成中的细粒度图像解释能力。我们提出了MAIRA-Seg，这是一个分割感知的多模态大语言模型框架，旨在利用语义分割掩码与CXRs共同生成放射学报告。我们训练了专家分割模型，以获取CXRs中放射学特定结构的掩码伪标签。随后，基于MAIRA的架构，这是一个专门用于报告生成的CXR模型，我们整合了一个可训练的分割Token提取器，该提取器利用这些掩码伪标签，并采用掩码感知的提示生成初步放射学报告。我们在公开的MIMIC-CXR数据集上的实验表明，MAIRA-Seg优于非分割基线模型。我们还研究了使用MAIRA的标记集提示，发现MAIRA-Seg始终表现出相当或更优的性能。结果证实，使用分割掩码增强了MLLMs的细致推理能力，可能有助于改善临床结果。

[NLP-21] Mitigating Knowledge Conflicts in Language Model-Driven Question Answering

【速读】：该论文试图解决生成式序列到序列任务（如文档问答和摘要生成）中由于参数化知识与答案之间的不当关联导致的模型幻觉（hallucination）问题。解决方案的关键在于通过显式关联输入源与生成内容来缓解幻觉现象，特别是在问答任务中，通过加强实体与其描述在训练时的关联，以改善模型在推理阶段的性能。

链接: https://arxiv.org/abs/2411.11344
作者: Han Cao,Zhaoyang Zhang,Xiangtian Li,Chufan Wu,Hansong Zhang,Wenqing Zhang
关键词-EN: sequence generation tasks, abstract summarization typically, summarization typically requires, Knowledge-aware sequence, retrieved contextual information
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge-aware sequence to sequence generation tasks such as document question answering and abstract summarization typically requires two types of knowledge: encoded parametric knowledge and retrieved contextual information. Previous work show improper correlation between parametric knowledge and answers in the training set could cause the model ignore input information at test time, resulting in un-desirable model behaviour such as over-stability and hallucination. In this work, we argue that hallucination could be mitigated via explicit correlation between input source and generated content. We focus on a typical example of hallucination, entity-based knowledge conflicts in question answering, where correlation of entities and their description at training time hinders model behaviour during inference.
摘要：知识感知的序列到序列生成任务，如文档问答和摘要生成，通常需要两种类型的知识：编码的参数化知识（encoded parametric knowledge）和检索的上下文信息（retrieved contextual information）。以往的研究表明，训练集中参数化知识与答案之间的不当关联可能导致模型在测试时忽略输入信息，从而产生诸如过度稳定性和幻觉（hallucination）等不良模型行为。在本研究中，我们认为通过输入源与生成内容之间的显式关联可以缓解幻觉问题。我们特别关注幻觉的一个典型例子，即问答中的基于实体的知识冲突，其中训练时实体与其描述之间的关联在推理过程中阻碍了模型的行为。

[NLP-22] ranscending Language Boundaries: Harnessing LLM s for Low-Resource Language Translation

【速读】：该论文试图解决低资源语言翻译中的性能问题，特别是在将英语翻译成这些语言时。解决方案的关键在于引入了一种基于检索的方法，通过专注于关键词的翻译和从现有数据中检索相应的例子来提高翻译质量。这种方法通过更有效地利用现有资源，展示了在提高词级准确性和整体语义理解方面的潜力。

链接: https://arxiv.org/abs/2411.11295
作者: Peng Shu,Junhao Chen,Zhengliang Liu,Hui Wang,Zihao Wu,Tianyang Zhong,Yiwei Li,Huaqin Zhao,Hanqi Jiang,Yi Pan,Yifan Zhou,Constance Owl,Xiaoming Zhai,Ninghao Liu,Claudio Saunt,Tianming Liu
关键词-EN: demonstrated remarkable success, Large Language Models, Large Language, tasks and domains, demonstrated remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable success across a wide range of tasks and domains. However, their performance in low-resource language translation, particularly when translating into these languages, remains underexplored. This gap poses significant challenges, as linguistic barriers hinder the cultural preservation and development of minority communities. To address this issue, this paper introduces a novel retrieval-based method that enhances translation quality for low-resource languages by focusing on key terms, which involves translating keywords and retrieving corresponding examples from existing data. To evaluate the effectiveness of this method, we conducted experiments translating from English into three low-resource languages: Cherokee, a critically endangered indigenous language of North America; Tibetan, a historically and culturally significant language in Asia; and Manchu, a language with few remaining speakers. Our comparison with the zero-shot performance of GPT-4o and LLaMA 3.1 405B, highlights the significant challenges these models face when translating into low-resource languages. In contrast, our retrieval-based method shows promise in improving both word-level accuracy and overall semantic understanding by leveraging existing resources more effectively.
摘要：大语言模型 (LLMs) 在众多任务和领域中展示了显著的成功。然而，其在低资源语言翻译中的表现，特别是翻译到这些语言时，仍未得到充分探索。这一差距带来了重大挑战，因为语言障碍阻碍了少数族群的文化保存和发展。为解决这一问题，本文提出了一种基于检索的新方法，通过专注于关键词来提升低资源语言的翻译质量，具体包括翻译关键词并从现有数据中检索相应示例。为评估该方法的有效性，我们进行了从英语到三种低资源语言的翻译实验：切罗基语，北美濒危严重的原住民语言；藏语，亚洲历史和文化上具有重要意义的语言；以及满语，仅有少数使用者的语言。与 GPT-4o 和 LLaMA 3.1 405B 的零样本性能相比，我们的实验突显了这些模型在翻译低资源语言时面临的重大挑战。相比之下，我们的基于检索的方法通过更有效地利用现有资源，显示出在提高词级准确性和整体语义理解方面的潜力。

[NLP-23] LP Data Pipeline: Lightweight Purpose-driven Data Pipeline for Large Language Models

【速读】：该论文试图解决大规模语言模型（LLMs）数据集创建过程中依赖GPU进行质量筛选所带来的时间成本高和计算资源限制问题。解决方案的关键是引入轻量级、目的驱动（Lightweight, Purpose-driven, LP）数据管道框架，该框架完全基于CPU运行，简化了数据集提取、过滤和整理的流程。通过遵循四个核心原则，LP数据管道显著减少了数据准备时间和成本，同时保持了高质量的数据，并能够创建针对特定领域和语言定制的数据集，从而增强了LLMs在专业场景中的适用性。这一解决方案有望降低LLM开发的门槛，使更多组织能够更容易地访问和利用LLMs。

链接: https://arxiv.org/abs/2411.11289
作者: Yungi Kim,Hyunsoo Ha,Seonghoon Yang,Sukyung Lee,Jihoo Kim,Chanjun Park
关键词-EN: Creating high-quality, large language models, GPU-accelerated models, relies on resource-intensive, making the process
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Creating high-quality, large-scale datasets for large language models (LLMs) often relies on resource-intensive, GPU-accelerated models for quality filtering, making the process time-consuming and costly. This dependence on GPUs limits accessibility for organizations lacking significant computational infrastructure. To address this issue, we introduce the Lightweight, Purpose-driven (LP) Data Pipeline, a framework that operates entirely on CPUs to streamline the processes of dataset extraction, filtering, and curation. Based on our four core principles, the LP Data Pipeline significantly reduces preparation time and cost while maintaining high data quality. Importantly, our pipeline enables the creation of purpose-driven datasets tailored to specific domains and languages, enhancing the applicability of LLMs in specialized contexts. We anticipate that our pipeline will lower the barriers to LLM development, enabling a wide range of organizations to access LLMs more easily.
摘要：为大语言模型（Large Language Models, LLMs）创建高质量、大规模的数据集通常依赖于资源密集型、GPU加速的模型进行质量筛选，这使得整个过程既耗时又昂贵。这种对GPU的依赖限制了缺乏大规模计算基础设施的组织的可访问性。为解决这一问题，我们引入了轻量级、目标驱动的（Lightweight, Purpose-driven, LP）数据管道框架，该框架完全基于CPU运行，以简化数据集的提取、过滤和整理过程。基于我们的四个核心原则，LP数据管道显著减少了准备时间和成本，同时保持了高数据质量。重要的是，我们的管道能够创建针对特定领域和语言定制的目标驱动型数据集，增强了LLMs在专业场景中的适用性。我们预计，我们的管道将降低LLM开发的门槛，使更广泛的组织能够更轻松地访问LLMs。

[NLP-24] VersaTune: Fine-Tuning Multi-Ability LLM s Efficiently

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在监督微调（Supervised Fine-Tuning, SFT）过程中面临的跨领域知识遗忘问题。解决方案的关键在于引入了一种名为VersaTune的新型数据组合框架，该框架通过动态调整不同领域知识的权重，确保在微调过程中既能提升目标领域的性能，又能有效减少其他领域知识的遗忘。具体步骤包括：首先检测基础模型中各领域知识的分布情况，然后根据模型现有知识分布组合训练数据，最后在微调过程中根据各领域的学习潜力和遗忘程度动态调整权重。实验结果表明，VersaTune显著提升了多领域任务的综合性能，并在特定领域优化时减少了其他领域性能的下降。

链接: https://arxiv.org/abs/2411.11266
作者: Keer Lu,Keshi Zhao,Zheng Liang,Da Pan,Shusen Zhang,Xin Wu,Weipeng Chen,Zenan Zhou,Guosheng Dong,Bin Cui,Wentao Zhang
关键词-EN: Large Language Models, Large Language, exhibit remarkable capabilities, handling multiple tasks, Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit remarkable capabilities in handling multiple tasks across domains due to their emergent properties. These capabilities are further augmented during the Supervised Fine-Tuning (SFT) phase. Despite their potential, existing work mainly focuses on domain-specific enhancements during fine-tuning, the challenge of which lies in catastrophic forgetting of knowledge across other domains. In this study, we introduce VersaTune, a novel data composition framework designed for enhancing LLMs’ overall multi-ability performances during fine-tuning. We categorize knowledge into distinct domains including law, medicine, finance, science, code. We begin with detecting the distribution of domain-specific knowledge within the base model, followed by the composition of training data that aligns with the model’s existing knowledge distribution. During the fine-tuning process, weights of different domains are dynamically adjusted based on their learnable potential and forgetting degree. Experimental results demonstrate that VersaTune achieves significant improvements in multi-domain performance, with a 35.21% enhancement in comprehensive multi-domain tasks. Additionally, in scenarios where specific domain optimization is required, VersaTune reduces the degradation of performance in other domains by 38.77%, without compromising the target domain’s training efficacy.
摘要：大语言模型（LLMs）因其涌现特性而在跨领域处理多项任务时展现出卓越的能力。这些能力在监督微调（SFT）阶段进一步增强。尽管潜力巨大，现有研究主要集中在微调阶段的领域特定增强上，其挑战在于其他领域知识的灾难性遗忘。在本研究中，我们提出了VersaTune，这是一种新颖的数据组合框架，旨在在微调过程中提升LLMs的整体多能力表现。我们将知识划分为不同的领域，包括法律、医学、金融、科学和代码。首先，我们检测基础模型中领域特定知识的分布情况，然后组合与模型现有知识分布相符的训练数据。在微调过程中，根据各领域的可学习潜力和遗忘程度动态调整不同领域的权重。实验结果表明，VersaTune在多领域性能上实现了显著提升，综合多领域任务的性能提升了35.21%。此外，在需要特定领域优化的场景中，VersaTune在不损害目标领域训练效果的情况下，将其他领域性能的下降减少了38.77%。

[NLP-25] Large corpora and large language models : a replicable method for automating grammatical annotation

【速读】：该论文试图解决大规模文本语料库中语言特征的手动标注难题。解决方案的关键在于利用大型语言模型（Large Language Models）通过提示工程（Prompt Engineering）、训练和评估，辅助语言学家进行语法标注。具体方法包括构建一个可复制的监督学习流程，应用于英语评价动词结构“consider X (as) (to be) Y”的形式变异研究，基于Claude 3.5 Sonnet模型和Davies的NOW及EnTenTen21语料库数据。该方法在少量训练数据下实现了超过90%的模型准确率，验证了其在未来大规模标注任务中的有效性，并强调了AI辅助工具在语言学研究中的广泛应用潜力。

链接: https://arxiv.org/abs/2411.11260
作者: Cameron Morin,Matti Marttinen Larsson
关键词-EN: rapid quantitative growth, created practical difficulties, manually annotate large, annotate large data, text corpora
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Much linguistic research relies on annotated datasets of features extracted from text corpora, but the rapid quantitative growth of these corpora has created practical difficulties for linguists to manually annotate large data samples. In this paper, we present a replicable, supervised method that leverages large language models for assisting the linguist in grammatical annotation through prompt engineering, training, and evaluation. We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction ‘consider X (as) (to be) Y’, based on the large language model Claude 3.5 Sonnet and corpus data from Davies’ NOW and EnTenTen21 (SketchEngine). Overall, we reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data, validating the method for the annotation of very large quantities of tokens of the construction in the future. We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change, underlining the value of AI copilots as tools for future linguistic research.
摘要：许多语言学研究依赖于从文本语料库中提取特征的标注数据集，但这些语料库的快速量化增长给语言学家手动标注大量数据样本带来了实际困难。本文提出了一种可复制的、有监督的方法，利用大语言模型通过提示工程、训练和评估来辅助语言学家进行语法标注。我们介绍了一种方法论流程，应用于英语评价性动词结构“consider X (as) (to be) Y”的正式变异案例研究，基于大语言模型 Claude 3.5 Sonnet 和 Davies 的 NOW 及 EnTenTen21 (SketchEngine) 语料库数据。总体而言，我们在仅使用少量训练数据的情况下，在保留测试样本上达到了超过 90% 的模型准确率，验证了该方法在未来标注大量该结构 Token 的可行性。我们讨论了我们的结果在更广泛语法结构和语法变异与变化案例研究中的普遍性，强调了 AI 智能体作为未来语言学研究工具的价值。

[NLP-26] ZeFaV: Boosting Large Language Models for Zero-shot Fact Verification PRICAI2024

【速读】：该论文试图解决大语言模型在事实验证任务中的性能提升问题，提出了一个基于零样本学习（zero-shot）的事实验证框架ZeFaV。解决方案的关键在于利用大语言模型的上下文学习能力，通过提取声明中实体间的关系，并以关系逻辑的形式重新组织证据信息，然后将这些信息与原始证据结合，生成用于事实验证模型判断的上下文，从而提高事实验证的准确性。实验结果表明，ZeFaV在HoVer和FEVEROUS两个多跳事实验证数据集上的表现与现有最先进的方法相当。

链接: https://arxiv.org/abs/2411.11247
作者: Son T. Luu,Hiep Nguyen,Trung Vo,Le-Minh Nguyen
关键词-EN: large language models, model provide verdicts, relationally logical form, in-context learning ability, fact verification task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This pre-print has been published in PRICAI 2024: Trends in Artificial Intelligence. The published version is available at this https URL

点击查看摘要

Abstract:In this paper, we propose ZeFaV - a zero-shot based fact-checking verification framework to enhance the performance on fact verification task of large language models by leveraging the in-context learning ability of large language models to extract the relations among the entities within a claim, re-organized the information from the evidence in a relationally logical form, and combine the above information with the original evidence to generate the context from which our fact-checking model provide verdicts for the input claims. We conducted empirical experiments to evaluate our approach on two multi-hop fact-checking datasets including HoVer and FEVEROUS, and achieved potential results results comparable to other state-of-the-art fact verification task methods.
摘要：本文提出了一种基于零样本的事实核查验证框架——ZeFaV。该框架通过利用大语言模型的上下文学习能力，从声明中提取实体间的关系，并以关系逻辑的形式重新组织证据中的信息，然后将这些信息与原始证据结合，生成用于事实核查模型判断输入声明的上下文。我们在两个多跳事实核查数据集（包括HoVer和FEVEROUS）上进行了实证实验，结果显示我们的方法在事实核查任务中的表现与当前最先进的方法相当。

[NLP-27] MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis

【速读】：该论文试图解决的问题是评估生成式模型和多模态大语言模型 (MLLMs) 在情感分析中的能力，特别是它们在理解和表达人类情感方面的准确性。解决方案的关键在于引入了一个名为 MEMO-Bench 的综合基准，该基准包含 7,145 张描绘六种不同情感的肖像，由 12 个文本到图像 (T2I) 模型生成。MEMO-Bench 提供了一个框架，用于评估 T2I 模型和 MLLMs 在情感分析中的表现，并采用从粗粒度到细粒度的渐进评估方法，以提供更详细和全面的情感分析能力评估。实验结果表明，现有的 T2I 模型在生成积极情感方面比消极情感更有效，而 MLLMs 虽然在区分和识别人类情感方面表现出一定效果，但在细粒度情感分析中仍未达到人类水平的准确性。

链接: https://arxiv.org/abs/2411.11235
作者: Yingjie Zhou,Zicheng Zhang,Jiezhang Cao,Jun Jia,Yanwei Jiang,Farong Wen,Xiaohong Liu,Xiongkuo Min,Guangtao Zhai
关键词-EN: Artificial Intelligence, virtual digital humans, embodied intelligence, demonstrated significant capabilities, Multimodal Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) has demonstrated significant capabilities in various fields, and in areas such as human-computer interaction (HCI), embodied intelligence, and the design and animation of virtual digital humans, both practitioners and users are increasingly concerned with AI’s ability to understand and express emotion. Consequently, the question of whether AI can accurately interpret human emotions remains a critical challenge. To date, two primary classes of AI models have been involved in human emotion analysis: generative models and Multimodal Large Language Models (MLLMs). To assess the emotional capabilities of these two classes of models, this study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions, generated by 12 Text-to-Image (T2I) models. Unlike previous works, MEMO-Bench provides a framework for evaluating both T2I models and MLLMs in the context of sentiment analysis. Additionally, a progressive evaluation approach is employed, moving from coarse-grained to fine-grained metrics, to offer a more detailed and comprehensive assessment of the sentiment analysis capabilities of MLLMs. The experimental results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones. Meanwhile, although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy, particularly in fine-grained emotion analysis. The MEMO-Bench will be made publicly available to support further research in this area.
摘要：人工智能（AI）在多个领域展示了显著的能力，特别是在人机交互（HCI）、具身智能以及虚拟数字人的设计和动画领域，从业者和用户越来越关注AI理解和表达情感的能力。因此，AI能否准确解读人类情感的问题仍然是一个关键挑战。迄今为止，参与人类情感分析的AI模型主要有两类：生成式模型和多模态大语言模型（MLLMs）。为了评估这两类模型的情感能力，本研究引入了MEMO-Bench，这是一个综合基准，包含由12个文本到图像（T2I）模型生成的7,145幅肖像，每幅肖像描绘了六种不同情感之一。与以往的工作不同，MEMO-Bench提供了一个框架，用于在情感分析的背景下评估T2I模型和MLLMs。此外，采用了一种渐进式评估方法，从粗粒度到细粒度指标，以提供对MLLMs情感分析能力的更详细和全面的评估。实验结果表明，现有的T2I模型在生成积极情感方面比生成消极情感更为有效。同时，尽管MLLMs在区分和识别人类情感方面显示出一定的有效性，但它们在细粒度情感分析方面仍未达到人类水平的准确性。MEMO-Bench将公开发布，以支持该领域的进一步研究。

[NLP-28] Capturing Sparks of Abstraction for the ARC Challenge

【速读】：该论文试图解决在ARC Challenge中，现有大型语言模型（LLMs）在理解和解决复杂问题时遇到的瓶颈问题，特别是超过60%准确率的挑战。解决方案的关键在于通过提供完整的代码解决方案，并要求LLM从不同抽象层次解释任务的解决过程，从而提取出“抽象的火花”（Sparks of Abstraction）。具体方法包括：(a) 生成带注释的代码；(b) 将代码重构为可重用的功能块；© 提取问题解决步骤；(d) 识别高层次的问题解决策略。通过这种方法，论文展示了如何利用LLM的输出进行下游任务，特别是适用于参与ARC Prize的本地LLMs。

链接: https://arxiv.org/abs/2411.11206
作者: Martin Andrews
关键词-EN: solving ARC Challenge, ARC Challenge problems, Excellent progress, ARC Challenge, Challenge problems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted as a paper entry for the 2024 ARC Prize

点击查看摘要

Abstract:Excellent progress has been made recently in solving ARC Challenge problems. However, it seems that new techniques may be required to push beyond 60% accuracy. Even commercial Large Language Models (LLMs) struggle to ‘understand’ many of the problems (when given the input and output grids), which makes discovering solutions by LLM-lead program search somewhat futile. In this work, LLM ‘understanding’ is attempted from a stronger starting position : An LLM is given complete solutions to tasks in code, and then asked to explain how the task is being solved at various levels of abstraction. Specifically, the LLM was given code solutions implemented in arc-dsl-llm (an LLM-legible version of Hodel’s arc-dsl to obtain: (a) commented code; (b) code refactored into reusable functional chunks; © problem solution steps; and (d) high-level problem-solving tactics. We demonstrate that ‘Sparks of Abstraction’ can be extracted from the LLM output - in a form that could be used in downstream tasks with Local LLMs eligible to enter the ARC Prize. Both the arc-dsl-llm DSL framework (with the re-engineered solutions) and the Gemini LLM-generated data (along with the generation code) are made Open Source. Comments: Submitted as a paper entry for the 2024 ARC Prize Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.11206 [cs.CL] (or arXiv:2411.11206v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.11206 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Martin Andrews [view email] [v1] Sun, 17 Nov 2024 23:40:00 UTC (160 KB)
摘要：近期在解决 ARC 挑战问题方面取得了显著进展。然而，似乎需要新的技术才能将准确率提升至 60% 以上。即使是商用大语言模型 (LLM) 在面对许多问题时（即使给出了输入和输出网格）也难以“理解”，这使得通过 LLM 主导的程序搜索来发现解决方案变得有些徒劳。在本研究中，我们尝试从更强的起点出发，让 LLM “理解”问题：给 LLM 提供任务的完整代码解决方案，然后要求其在不同抽象层次上解释任务是如何被解决的。具体而言，LLM 被提供了在 arc-dsl-llm（Hodel 的 arc-dsl 的 LLM 可读版本）中实现的代码解决方案，以获取：(a) 带注释的代码；(b) 重构为可重用功能块的代码；© 问题解决方案步骤；以及 (d) 高层次的问题解决策略。我们展示了可以从 LLM 输出中提取出“抽象的火花”——这种形式可以用于下游任务，并且适用于有资格进入 ARC 奖项的本地 LLM。arc-dsl-llm DSL 框架（包含重新设计的解决方案）和 Gemini LLM 生成的数据（以及生成代码）均已开源。

评论：作为 2024 年 ARC 奖项的论文提交。
主题：计算与语言 (cs.CL)；人工智能 (cs.AI)；机器学习 (cs.LG)
引用为：arXiv:2411.11206 [cs.CL]（或 arXiv:2411.11206v1 [cs.CL] 用于此版本）
https://doi.org/10.48550/arXiv.2411.11206
了解更多信息
arXiv 通过 DataCite 发布的 DOI（待注册）
提交历史
来自 Martin Andrews [查看电子邮件]
[v1] 2024 年 11 月 17 日 23:40:00 UTC (160 KB)

[NLP-29] LL"aMmlein: Compact and Competitive German-Only Language Models from Scratch WWW

【速读】：该论文旨在为德语自然语言处理（NLP）研究社区提供两个仅限德语的解码器模型，即LLäMmlein 120M和1B，并通过公开发布模型及其训练数据来促进透明度和可重复性。解决方案的关键在于从零开始创建这些模型，包括广泛的数据预处理、定制德语分词器（tokenizer）的开发、模型的训练以及在多个基准测试上的评估。通过使用SuperGLEBer基准对训练过程中的多个检查点进行分析，研究团队能够监控模型的学习动态，并发现这些模型在SuperGLEBer基准上与具有相似参数大小的最先进模型相比表现出色，甚至在某些任务上超越了这些模型。这一过程揭示了模型质量随规模增长的预期趋势，但也指出了在某些任务上性能提升的早期瓶颈，为未来模型开发的资源分配提供了宝贵见解。

链接: https://arxiv.org/abs/2411.11171
作者: Jan Pfister,Julia Wunderle,Andreas Hotho
关键词-EN: German NLP research, NLP research community, German-only decoder models, German NLP, create two German-only
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: first draft; this https URL

点击查看摘要

Abstract:We create two German-only decoder models, LLäMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessing, the creation of a custom German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks. Throughout the training process, multiple checkpoints were saved and analyzed using the SuperGLEBer benchmark to monitor the models’ learning dynamics. Compared to state-of-the-art models on the SuperGLEBer benchmark, both LLäMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models’ quality scales with size as expected, but performance improvements on some tasks plateaued early, offering valuable insights into resource allocation for future model development.
摘要：我们创建了两个仅限德语的解码器模型，LLäMmlein 120M 和 1B，从零开始透明地构建并发布，连同训练数据一起，供德语自然语言处理（NLP）研究社区使用。模型训练涉及多个关键步骤，包括广泛的数据预处理、定制德语 Tokenizer 的创建、训练本身以及在各种基准上的最终模型评估。在整个训练过程中，保存了多个检查点，并使用 SuperGLEBer 基准进行分析，以监控模型的学习动态。与 SuperGLEBer 基准上的最先进模型相比，LLäMmlein 模型在竞争中表现出色，持续匹配或超越参数规模相似的模型。结果显示，模型的质量随着规模的增长而按预期扩展，但在某些任务上的性能改进早期就趋于平稳，为未来模型开发的资源分配提供了宝贵的见解。

[NLP-30] he Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection

【速读】：该论文试图解决媒体偏见检测任务中的大规模高质量数据集创建的高成本问题。解决方案的关键在于利用大型语言模型 (LLMs) 自动化数据标注过程，从而降低标注成本并保持数据质量。研究通过创建首个大规模合成标注的媒体偏见分类数据集 (annolexical)，并在此基础上微调分类器，结果显示该分类器在马修斯相关系数 (MCC) 上优于所有标注LLMs，并在两个媒体偏见基准数据集 (BABE和BASIL) 上表现接近或优于基于人工标注数据训练的模型。这一方法显著降低了媒体偏见领域数据集创建的成本，并推动了分类器的发展，尽管在后续的行为压力测试中揭示了其当前的一些局限性和权衡。

链接: https://arxiv.org/abs/2411.11081
作者: Tomas Horych,Christoph Mandl,Terry Ruas,Andre Greiner-Petter,Bela Gipp,Akiko Aizawa,Timo Spinde
关键词-EN: High annotation costs, training reliable text, Large Language Models, High annotation, high-quality datasets needed
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High annotation costs from hiring or crowdsourcing complicate the creation of large, high-quality datasets needed for training reliable text classifiers. Recent research suggests using Large Language Models (LLMs) to automate the annotation process, reducing these costs while maintaining data quality. LLMs have shown promising results in annotating downstream tasks like hate speech detection and political framing. Building on the success in these areas, this study investigates whether LLMs are viable for annotating the complex task of media bias detection and whether a downstream media bias classifier can be trained on such data. We create annolexical, the first large-scale dataset for media bias classification with over 48000 synthetically annotated examples. Our classifier, fine-tuned on this dataset, surpasses all of the annotator LLMs by 5-9 percent in Matthews Correlation Coefficient (MCC) and performs close to or outperforms the model trained on human-labeled data when evaluated on two media bias benchmark datasets (BABE and BASIL). This study demonstrates how our approach significantly reduces the cost of dataset creation in the media bias domain and, by extension, the development of classifiers, while our subsequent behavioral stress-testing reveals some of its current limitations and trade-offs.
摘要：雇佣或众包带来的高标注成本使得创建用于训练可靠文本分类器的大型高质量数据集变得复杂。近期研究提出利用大语言模型 (LLM) 来自动化标注过程，从而在保持数据质量的同时降低成本。LLM 在标注下游任务（如仇恨言论检测和政治框架分析）方面已显示出有前景的结果。基于在这些领域的成功，本研究探讨了 LLM 是否适用于标注复杂的媒体偏见检测任务，以及是否可以在这种数据上训练下游的媒体偏见分类器。我们创建了 annolexical，这是首个用于媒体偏见分类的大规模数据集，包含超过 48000 个合成标注的示例。我们基于此数据集微调的分类器在 Matthews 相关系数 (MCC) 上超越了所有标注 LLM 5-9 个百分点，并且在两个媒体偏见基准数据集（BABE 和 BASIL）上的评估结果接近或优于基于人工标注数据训练的模型。本研究展示了我们的方法如何显著降低媒体偏见领域数据集创建的成本，并由此降低了分类器开发的成本，同时我们的后续行为压力测试揭示了其当前的一些局限性和权衡。

[NLP-31] Multilingual Large Language Models : A Systematic Survey

【速读】：该论文旨在全面综述多语言大型语言模型（Multilingual Large Language Models, MLLMs）的最新研究进展，解决的问题是如何有效构建、评估和应用这些模型。解决方案的关键在于：1) 详细讨论MLLMs的架构和预训练目标，强调其多语言能力的关键组件和方法；2) 强调多语言预训练和数据对齐的重要性，特别是数据质量和多样性对模型性能的影响；3) 提供详细的评估框架，涵盖跨语言知识、推理、与人类价值观的对齐、安全性、可解释性及特定应用的评估；4) 探讨MLLMs的可解释性、跨语言迁移和语言偏见问题，以增强模型的透明度；5) 展示MLLMs在生物学、医学、计算机科学、数学和法律等多个领域的实际应用，强调其在推动专业领域创新和改进中的作用。

链接: https://arxiv.org/abs/2411.11072
作者: Shaolin Zhu,Supryadi,Shaoyang Xu,Haoran Sun,Leiyu Pan,Menglong Cui,Jiangcun Du,Renren Jin,António Branco,Deyi Xiong
关键词-EN: latest research, multilingual large language, MLLMs, large language models, multilingual
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper provides a comprehensive survey of the latest research on multilingual large language models (MLLMs). MLLMs not only are able to understand and generate language across linguistic boundaries, but also represent an important advancement in artificial intelligence. We first discuss the architecture and pre-training objectives of MLLMs, highlighting the key components and methodologies that contribute to their multilingual capabilities. We then discuss the construction of multilingual pre-training and alignment datasets, underscoring the importance of data quality and diversity in enhancing MLLM performance. An important focus of this survey is on the evaluation of MLLMs. We present a detailed taxonomy and roadmap covering the assessment of MLLMs’ cross-lingual knowledge, reasoning, alignment with human values, safety, interpretability and specialized applications. Specifically, we extensively discuss multilingual evaluation benchmarks and datasets, and explore the use of LLMs themselves as multilingual evaluators. To enhance MLLMs from black to white boxes, we also address the interpretability of multilingual capabilities, cross-lingual transfer and language bias within these models. Finally, we provide a comprehensive review of real-world applications of MLLMs across diverse domains, including biology, medicine, computer science, mathematics and law. We showcase how these models have driven innovation and improvements in these specialized fields while also highlighting the challenges and opportunities in deploying MLLMs within diverse language communities and application this http URL listed the paper related in this survey and publicly available at this https URL .
摘要：本文对多语言大语言模型 (MLLMs) 的最新研究进行了全面综述。MLLMs 不仅能够跨越语言界限理解和生成语言，还代表了人工智能领域的重要进展。首先，我们讨论了 MLLMs 的架构和预训练目标，突出了构成其多语言能力的关键组件和方法。接着，我们探讨了多语言预训练和对齐数据集的构建，强调了数据质量和多样性对提升 MLLM 性能的重要性。

本综述的一个重要焦点是 MLLMs 的评估。我们提供了一个详细的分类和路线图，涵盖了 MLLMs 的跨语言知识、推理、与人类价值观的对齐、安全性、可解释性以及特定应用的评估。具体而言，我们广泛讨论了多语言评估基准和数据集，并探讨了使用大语言模型自身作为多语言评估工具的可能性。

为了将 MLLMs 从“黑箱”转变为“白箱”，我们还探讨了多语言能力的可解释性、跨语言迁移以及模型中的语言偏见。最后，我们全面回顾了 MLLMs 在生物学、医学、计算机科学、数学和法律等多个领域的实际应用。我们展示了这些模型如何在这些专业领域推动创新和改进，同时也指出了在多语言社区和应用中部署 MLLMs 所面临的挑战和机遇。本文相关论文已列出，并公开可用。

[NLP-32] Beyond Human-Like Processing: Large Language Models Perform Equivalently on Forward and Backward Scientific Text

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）在语言处理任务中的卓越表现是否意味着它们模拟了人类语言处理机制。论文提出，LLMs的成功更多地归因于其灵活的Transformer学习架构，而非对人类语言处理机制的模拟。解决方案的关键在于通过实验验证这一假设：研究者训练LLMs处理正向和反向的科学文本，发现无论文本顺序如何，LLMs在神经科学基准测试中的表现均优于人类专家。这一结果表明，LLMs能够从任何结构化的输入中提取预测模式，而非依赖于特定的语言结构。因此，论文建议在解释LLMs在语言任务中的成功时，应谨慎对待将其视为人类语言处理机制的证据。

链接: https://arxiv.org/abs/2411.11061
作者: Xiaoliang Luo,Michael Ramscar,Bradley C. Love
关键词-EN: large language models, human language processing, language processing, models, language models
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:The impressive performance of large language models (LLMs) has led to their consideration as models of human language processing. Instead, we suggest that the success of LLMs arises from the flexibility of the transformer learning architecture. To evaluate this conjecture, we trained LLMs on scientific texts that were either in a forward or backward format. Despite backward text being inconsistent with the structure of human languages, we found that LLMs performed equally well in either format on a neuroscience benchmark, eclipsing human expert performance for both forward and backward orders. Our results are consistent with the success of transformers across diverse domains, such as weather prediction and protein design. This widespread success is attributable to LLM’s ability to extract predictive patterns from any sufficiently structured input. Given their generality, we suggest caution in interpreting LLM’s success in linguistic tasks as evidence for human-like mechanisms.
摘要：大语言模型（Large Language Models, LLMs）的卓越表现使其被视为人类语言处理模型的候选。然而，我们认为LLMs的成功源于Transformer学习架构的灵活性。为验证这一猜想，我们在科学文本上训练了LLMs，这些文本要么以正向格式呈现，要么以反向格式呈现。尽管反向文本与人类语言结构不一致，我们发现LLMs在神经科学基准测试中，无论在正向还是反向格式下，表现均同样出色，甚至超越了人类专家的表现。我们的结果与Transformer在多个领域（如天气预测和蛋白质设计）的成功一致。这种广泛的成功归因于LLMs从任何足够结构化的输入中提取预测模式的能力。鉴于其通用性，我们建议在将LLMs在语言任务中的成功解读为人类类似机制的证据时应持谨慎态度。

[NLP-33] FastDraft: How to Train Your Draft NEURIPS

【速读】：该论文试图解决大语言模型（LLMs）在自回归推理过程中加速的问题，特别是由于词汇不兼容性导致的现有语言模型缺乏高效草稿模型的问题。解决方案的关键在于提出了FastDraft，一种新颖且高效的方法，通过结合高效的预训练和在目标模型生成的合成数据集上的微调，来预训练和校准草稿模型。FastDraft能够在24小时内使用8个Intel Gaudi 2加速器在一台服务器上生成约100亿个标记的草稿模型，显著提高了推理速度和内存效率，从而使得大语言模型在AI-PC和其他边缘设备上的推理成为可能。

链接: https://arxiv.org/abs/2411.11055
作者: Ofir Zafrir,Igor Margulis,Dorin Shteyman,Guy Boudoukh
关键词-EN: Speculative Decoding, Large Language Models, Large Language, Language Models, auto-regressive inference process
类目: Computation and Language (cs.CL)
备注: ENLSP NeurIPS Workshop 2024

点击查看摘要

Abstract:Speculative Decoding has gained popularity as an effective technique for accelerating the auto-regressive inference process of Large Language Models (LLMs). However, Speculative Decoding entirely relies on the availability of efficient draft models, which are often lacking for many existing language models due to a stringent constraint of vocabulary incompatibility. In this work we introduce FastDraft, a novel and efficient approach for pre-training and aligning a draft model to any large language model by incorporating efficient pre-training, followed by fine-tuning over synthetic datasets generated by the target model. We demonstrate FastDraft by training two highly parameter efficient drafts for the popular Phi-3-mini and Llama-3.1-8B models. Using FastDraft, we were able to produce a draft with approximately 10 billion tokens on a single server with 8 Intel ^\circledR Gaudi ^\circledR 2 accelerators in under 24 hours. Our results show that the draft model achieves impressive results in key metrics of acceptance rate, block efficiency and up to 3x memory bound speed up when evaluated on code completion and up to 2x in summarization, text completion and instruction tasks. We validate our theoretical findings through benchmarking on the latest Intel ^\circledR Core ^\tiny \textTM Ultra, achieving a wall-clock time speedup of up to 2x, indicating a significant reduction in runtime. Due to its high quality, FastDraft unlocks large language models inference on AI-PC and other edge-devices.
摘要：推测性解码（Speculative Decoding）作为一种有效加速大语言模型（LLM）自回归推理过程的技术，近年来备受关注。然而，推测性解码完全依赖于高效的草稿模型的可用性，而由于词汇不兼容的严格限制，许多现有语言模型往往缺乏这样的草稿模型。在本研究中，我们提出了FastDraft，一种新颖且高效的方法，通过结合高效的预训练和在目标模型生成的合成数据集上的微调，来预训练和调整草稿模型以适应任何大语言模型。我们通过为流行的Phi-3-mini和Llama-3.1-8B模型训练两个高度参数高效的草稿模型来展示FastDraft的应用。使用FastDraft，我们能够在配备8个Intel® Gaudi®2加速器的单个服务器上，在不到24小时内生成约100亿个Token的草稿模型。我们的结果表明，在代码补全任务中，草稿模型在接受率、块效率等关键指标上取得了显著成果，内存带宽速度提升高达3倍；在摘要生成、文本补全和指令任务中，速度提升高达2倍。我们通过在最新的Intel® Core™ Ultra上的基准测试验证了理论发现，实现了高达2倍的实际运行时间加速，表明运行时显著减少。由于其高质量，FastDraft使得大语言模型推理在AI-PC和其他边缘设备上成为可能。

[NLP-34] SRA-MCTS: Self-driven Reasoning Aurmentation with Monte Carlo Tree Search for Enhanced Code Generation

【速读】：该论文试图解决大型语言模型在处理复杂问题时面临的推理和问题分解能力不足的问题。解决方案的关键在于提出了一个推理增强的数据生成过程，称为SRA-MCTS（Self-Reasoning-Augmented Monte Carlo Tree Search），该过程通过引导模型自主生成高质量的中间推理路径，形成一个正反馈循环，从而实现持续改进。该方法完全依赖模型自身，无需额外监督，通过合成自然语言推理路径并将其转化为可执行代码，确保分析的准确性，并提高解决复杂任务的成功率。实验结果表明，即使在无额外监督信号的情况下，该方法在不同模型规模上均能实现性能提升，显示出小模型自我改进的巨大潜力。此外，在传统思维链（Chain-of-Thought, CoT）方法性能下降时，该方法仍表现出鲁棒性，并在多样性指标如pass@10上有所提升。

链接: https://arxiv.org/abs/2411.11053
作者: Bin Xu,Yiguan Lin,Yinghao Li,YangGao
关键词-EN: Large language models, models demonstrate exceptional, demonstrate exceptional performance, Large language, demonstrate exceptional
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models demonstrate exceptional performance in simple code generation tasks but still face challenges in tackling complex problems. These challenges may stem from insufficient reasoning and problem decomposition capabilities. To address this issue, we propose a reasoning-augmented data generation process, SRA-MCTS, which guides the model to autonomously generate high-quality intermediate reasoning paths. This creates a positive feedback loop, enabling continuous improvement. Our method operates entirely through the model itself without requiring additional supervision. By synthesizing natural language reasoning paths and translating them into executable code, the approach ensures analytical accuracy and enhances the success rate in solving complex tasks. Experimental results show that, even without additional supervisory signals, our method achieves performance improvements across different model scales, demonstrating the significant potential of self-improvement in small models. Furthermore, the method remains robust when traditional Chain-of-Thought (CoT) approaches exhibit performance degradation, with notable improvements observed in diversity metrics such as pass@10. We encourage further exploration of reasoning processes within training data to enhance the ability of language models to address complex problems.
摘要：大语言模型在简单的代码生成任务中表现出色，但在应对复杂问题时仍面临挑战。这些挑战可能源于推理能力和问题分解能力的不足。为解决这一问题，我们提出了一种推理增强的数据生成过程，即SRA-MCTS，该过程引导模型自主生成高质量的中间推理路径。这形成了一个正反馈循环，实现了持续改进。我们的方法完全通过模型自身运行，无需额外的监督。通过合成自然语言推理路径并将其转化为可执行代码，该方法确保了分析的准确性，并提高了解决复杂任务的成功率。实验结果表明，即使在无额外监督信号的情况下，我们的方法在不同模型规模下均实现了性能提升，展示了小模型自我改进的巨大潜力。此外，当传统的思维链（Chain-of-Thought, CoT）方法出现性能下降时，该方法仍保持稳健，在多样性指标如pass@10上观察到显著改进。我们鼓励进一步探索训练数据中的推理过程，以增强语言模型解决复杂问题的能力。

[NLP-35] BianCang: A Traditional Chinese Medicine Large Language Model

【速读】：该论文试图解决大型语言模型（LLMs）在中医（TCM）诊断和辨证分型方面的不足，主要原因是中医理论与现代医学理论的显著差异以及缺乏专门的、高质量的语料库。解决方案的关键在于提出了一个专门针对中医的LLM——BianCang，并通过两阶段的训练过程来解决这一问题：首先注入领域特定的知识，然后通过针对性的刺激进行对齐。具体实施包括构建预训练语料库、基于真实医院记录的指令对齐数据集以及从《中华人民共和国药典》衍生的ChP-TCM数据集。通过这些措施，论文构建了一个全面的中医和医学语料库，用于连续预训练和监督微调，从而提升模型对中医的理解和应用能力。

链接: https://arxiv.org/abs/2411.11027
作者: Sibo Wei,Xueping Peng,Yi-fei Wang,Jiasheng Si,Weiyu Zhang,Wenpeng Lu,Xiaoming Wu,Yinglong Wang
关键词-EN: including traditional Chinese, traditional Chinese medicine, driven significant progress, Chinese medicine, traditional Chinese
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of large language models (LLMs) has driven significant progress in medical applications, including traditional Chinese medicine (TCM). However, current medical LLMs struggle with TCM diagnosis and syndrome differentiation due to substantial differences between TCM and modern medical theory, and the scarcity of specialized, high-quality corpora. This paper addresses these challenges by proposing BianCang, a TCM-specific LLM, using a two-stage training process that first injects domain-specific knowledge and then aligns it through targeted stimulation. To enhance diagnostic and differentiation capabilities, we constructed pre-training corpora, instruction-aligned datasets based on real hospital records, and the ChP-TCM dataset derived from the Pharmacopoeia of the People’s Republic of China. We compiled extensive TCM and medical corpora for continuous pre-training and supervised fine-tuning, building a comprehensive dataset to refine the model’s understanding of TCM. Evaluations across 11 test sets involving 29 models and 4 tasks demonstrate the effectiveness of BianCang, offering valuable insights for future research. Code, datasets, and models are available at this https URL.
摘要：大语言模型 (LLM) 的兴起在医疗应用领域，包括传统中医 (TCM)，推动了显著的进展。然而，当前的医疗 LLM 在处理 TCM 诊断和辨证分型时面临挑战，主要原因是 TCM 与现代医学理论之间存在显著差异，以及专业高质量语料库的稀缺。本文通过提出 BianCang，一个专门针对 TCM 的 LLM，来应对这些挑战。BianCang 采用两阶段训练过程，首先注入领域特定知识，然后通过针对性刺激进行对齐。为了增强诊断和辨证能力，我们构建了预训练语料库、基于真实医院记录的指令对齐数据集，以及从《中华人民共和国药典》衍生的 ChP-TCM 数据集。我们编译了广泛的 TCM 和医学语料库，用于持续预训练和监督微调，构建了一个全面的数据集，以精炼模型对 TCM 的理解。在涉及 29 个模型和 4 项任务的 11 个测试集上的评估结果显示了 BianCang 的有效性，为未来的研究提供了宝贵的见解。代码、数据集和模型可在以下链接获取：https URL。

[NLP-36] A Topic-aware Comparable Corpus of Chinese Variations

【速读】：该论文旨在填补大陆普通话和台湾普通话在社交媒体上可比语料库的空白。解决方案的关键在于构建一个基于社交媒体的主题感知可比语料库，分别从中国大陆的新浪微博（Sina Weibo）和台湾的Dcard平台收集数据，以反映现代社交媒体上的语言使用情况，并确保语料库的定期更新。

链接: https://arxiv.org/abs/2411.10955
作者: Da-Chen Lian,Shu-Kai Hsieh
关键词-EN: Mainland Chinese Mandarin, China and Taiwan, Taiwanese Mandarin, Mainland Chinese, Mainland China
类目: Computation and Language (cs.CL)
备注: 4 pages, 4 figures, presented at APCLC2018: ASIA-PACIFIC CORPUS LINGUISTICS CONFERENCE 2018

点击查看摘要

Abstract:This study aims to fill the gap by constructing a topic-aware comparable corpus of Mainland Chinese Mandarin and Taiwanese Mandarin from the social media in Mainland China and Taiwan, respectively. Using Dcard for Taiwanese Mandarin and Sina Weibo for Mainland Chinese, we create a comparable corpus that updates regularly and reflects modern language use on social media.
摘要：本研究旨在通过构建一个主题感知的可比语料库，填补大陆普通话与台湾普通话之间的研究空白。我们分别从中国大陆和台湾的社交媒体中提取数据，使用台湾的Dcard平台和大陆的新浪微博平台，创建了一个定期更新的可比语料库，以反映现代社交媒体上的语言使用情况。

[NLP-37] Dialectal Toxicity Detection: Evaluating LLM -as-a-Judge Consistency Across Language Varieties

【速读】：该论文试图解决现代大型语言模型（LLMs）在不同方言差异下的毒性检测能力问题，并探讨了将LLMs作为评估者（“LLM-as-a-judge”）时对方言细微差别的敏感性。解决方案的关键在于创建了一个多方言数据集，通过合成变换和人工辅助翻译覆盖了10个语言集群和60种方言，并在此基础上对三个LLMs进行了跨语言、方言和LLM-人类一致性的毒性评估。研究发现，LLMs在处理多语言和方言变体方面表现敏感，但在一致性排名中，LLM-人类一致性最弱，其次是方言一致性。

链接: https://arxiv.org/abs/2411.10954
作者: Fahim Faisal,Md Mushfiqur Rahman,Antonios Anastasopoulos
关键词-EN: differences affect toxicity, affect toxicity detection, dialectal differences affect, systematic study, differences affect
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There has been little systematic study on how dialectal differences affect toxicity detection by modern LLMs. Furthermore, although using LLMs as evaluators (“LLM-as-a-judge”) is a growing research area, their sensitivity to dialectal nuances is still underexplored and requires more focused attention. In this paper, we address these gaps through a comprehensive toxicity evaluation of LLMs across diverse dialects. We create a multi-dialect dataset through synthetic transformations and human-assisted translations, covering 10 language clusters and 60 varieties. We then evaluated three LLMs on their ability to assess toxicity across multilingual, dialectal, and LLM-human consistency. Our findings show that LLMs are sensitive in handling both multilingual and dialectal variations. However, if we have to rank the consistency, the weakest area is LLM-human agreement, followed by dialectal consistency. Code repository: \urlthis https URL
摘要：关于方言差异如何影响现代大语言模型（LLM）的毒性检测，目前尚缺乏系统的研究。此外，尽管将大语言模型用作评估者（即“LLM-as-a-judge”）的研究领域正在不断扩展，但它们对方言细微差别的敏感性仍未得到充分探索，需要更多关注。本文通过对方言多样性下的大语言模型进行全面的毒性评估，填补了这一研究空白。我们通过合成转换和人工辅助翻译创建了一个多方言数据集，涵盖了10个语言群组和60种方言。随后，我们对三种大语言模型在多语言、方言及大语言模型与人类一致性方面的毒性评估能力进行了评估。研究结果表明，大语言模型在处理多语言和方言变体方面表现出敏感性。然而，若需对一致性进行排序，最弱的是大语言模型与人类的一致性，其次是方言一致性。代码仓库：\urlthis https URL

[NLP-38] Understanding Multimodal LLM s: the Mechanistic Interpretability of Llava in Visual Question Answering

【速读】：该论文试图解决多模态大语言模型（Multi-modal Large Language Models, MLLMs）在视觉问答（Visual Question Answering, VQA）机制方面的理解不足问题。解决方案的关键在于应用机制可解释性方法分析Llava模型在颜色回答任务中的VQA机制，并与文本问答（Textual Question Answering, TQA）机制进行比较。研究发现，VQA机制类似于TQA中的上下文学习机制，视觉特征在投影视觉嵌入到嵌入空间时表现出显著的可解释性，且Llava在视觉指令调优过程中增强了对应文本大模型Vicuna的现有能力。基于这些发现，论文开发了一种可解释性工具，帮助用户和研究人员识别最终预测中重要的视觉位置，从而更好地理解视觉幻觉现象。该方法在速度和效果上均优于现有的可解释性方法。

链接: https://arxiv.org/abs/2411.10950
作者: Zeping Yu,Sophia Ananiadou
关键词-EN: Large Language Models, Multi-modal Large Language, designing improved models, Large Language, Language Models
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Understanding the mechanisms behind Large Language Models (LLMs) is crucial for designing improved models and strategies. While recent studies have yielded valuable insights into the mechanisms of textual LLMs, the mechanisms of Multi-modal Large Language Models (MLLMs) remain underexplored. In this paper, we apply mechanistic interpretability methods to analyze the visual question answering (VQA) mechanisms in the first MLLM, Llava. We compare the mechanisms between VQA and textual QA (TQA) in color answering tasks and find that: a) VQA exhibits a mechanism similar to the in-context learning mechanism observed in TQA; b) the visual features exhibit significant interpretability when projecting the visual embeddings into the embedding space; and c) Llava enhances the existing capabilities of the corresponding textual LLM Vicuna during visual instruction tuning. Based on these findings, we develop an interpretability tool to help users and researchers identify important visual locations for final predictions, aiding in the understanding of visual hallucination. Our method demonstrates faster and more effective results compared to existing interpretability approaches. Code: \urlthis https URL
摘要：理解大语言模型（Large Language Models, LLMs）背后的机制对于设计改进的模型和策略至关重要。尽管最近的研究已经对文本型大语言模型的机制提供了宝贵的见解，但多模态大语言模型（Multi-modal Large Language Models, MLLMs）的机制仍未得到充分探索。本文中，我们应用机制可解释性方法来分析首个多模态大语言模型Llava中的视觉问答（Visual Question Answering, VQA）机制。我们在颜色回答任务中比较了VQA与文本问答（Textual Question Answering, TQA）的机制，并发现：a) VQA展示了一种类似于TQA中观察到的上下文学习机制；b) 当将视觉嵌入投影到嵌入空间时，视觉特征表现出显著的可解释性；c) Llava在视觉指令调优过程中增强了相应文本型大语言模型Vicuna的现有能力。基于这些发现，我们开发了一个可解释性工具，帮助用户和研究人员识别最终预测中重要的视觉位置，从而有助于理解视觉幻觉现象。与现有的可解释性方法相比，我们的方法展示了更快且更有效的结果。代码链接：\urlthis https URL

[NLP-39] Memory-Augmented Multimodal LLM s for Surgical VQA via Self-Contained Inquiry

【速读】：该论文试图解决外科视觉问答（Surgical Visual Question Answering, Surgical VQA）中多对象推理和场景理解不足的问题。解决方案的关键在于提出了一个名为SCAN的简单而有效的记忆增强框架，该框架利用多模态大语言模型（Multimodal LLMs）通过自包含询问（Self-Contained Inquiry）来提升外科场景的理解能力。SCAN自主生成两种类型的记忆：直接记忆（Direct Memory, DM）和间接记忆（Indirect Memory, IM）。DM提供多个候选答案或提示，直接辅助回答问题；IM则包含自包含的问题-提示对，用于捕捉更广泛的场景上下文。通过在对象感知记忆上进行推理，SCAN能够准确解释图像并回答问题，从而在多个公开的Surgical VQA数据集上实现了最先进的性能，提高了在各种外科场景中的准确性和鲁棒性。

链接: https://arxiv.org/abs/2411.10937
作者: Wenjun Hou,Yi Cheng,Kaishuai Xu,Yan Hu,Wenjie Li,Jiang Liu
关键词-EN: Comprehensively understanding surgical, Visual Question Answering, Surgical Visual Question, Surgical Visual, Comprehensively understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Comprehensively understanding surgical scenes in Surgical Visual Question Answering (Surgical VQA) requires reasoning over multiple objects. Previous approaches address this task using cross-modal fusion strategies to enhance reasoning ability. However, these methods often struggle with limited scene understanding and question comprehension, and some rely on external resources (e.g., pre-extracted object features), which can introduce errors and generalize poorly across diverse surgical environments. To address these challenges, we propose SCAN, a simple yet effective memory-augmented framework that leverages Multimodal LLMs to improve surgical context comprehension via Self-Contained Inquiry. SCAN operates autonomously, generating two types of memory for context augmentation: Direct Memory (DM), which provides multiple candidates (or hints) to the final answer, and Indirect Memory (IM), which consists of self-contained question-hint pairs to capture broader scene context. DM directly assists in answering the question, while IM enhances understanding of the surgical scene beyond the immediate query. Reasoning over these object-aware memories enables the model to accurately interpret images and respond to questions. Extensive experiments on three publicly available Surgical VQA datasets demonstrate that SCAN achieves state-of-the-art performance, offering improved accuracy and robustness across various surgical scenarios.
摘要：全面理解手术场景在手术视觉问答（Surgical Visual Question Answering, Surgical VQA）中需要对多个对象进行推理。以往的方法通过跨模态融合策略来增强推理能力，但这些方法往往在场景理解和问题理解方面存在局限，且部分依赖外部资源（如预提取的对象特征），这可能导致错误并难以在多样化的手术环境中泛化。为解决这些问题，我们提出了 SCAN，一个简单而有效的记忆增强框架，利用多模态大语言模型（Multimodal LLMs）通过自包含探究（Self-Contained Inquiry）来提升手术情境理解。SCAN 自主运行，生成两种类型的记忆以增强上下文：直接记忆（Direct Memory, DM），提供多个候选答案（或提示），以及间接记忆（Indirect Memory, IM），由自包含的问题-提示对组成，以捕捉更广泛的场景上下文。DM 直接辅助回答问题，而 IM 则增强对手术场景的全面理解，超越了即时查询的范畴。通过对这些对象感知的记忆进行推理，模型能够准确解读图像并回应问题。在三个公开的 Surgical VQA 数据集上的广泛实验表明，SCAN 达到了最先进的性能，在各种手术场景中提供了更高的准确性和鲁棒性。

[NLP-40] Analyzing Pokemon and Mario Streamers Twitch Chat with LLM -based User Embeddings

【速读】：该论文试图解决的问题是如何在数字人文领域中有效地表示和分类Twitch聊天室中的用户行为。解决方案的关键在于利用大型语言模型（LLM）生成用户嵌入（user embeddings），并通过亲和传播（affinity propagation）进行自动聚类，随后通过手动分析进一步细化聚类结果。研究结果表明，尽管每个主播的聊天室用户行为有所不同，但所有主播的聊天室中都存在两类主要用户：支持性观众（supportive viewers）和表情及反应发送者（emoji and reaction senders），而重复消息发送者（repetitive message spammers）则是两位主播共享的用户类别。

链接: https://arxiv.org/abs/2411.10934
作者: Mika Hämäläinen,Jack Rueter,Khalid Alnajjar
关键词-EN: large language model, digital humanities method, user embeddings created, language model, digital humanities
类目: Computation and Language (cs.CL)
备注: NLP4DH 2024

点击查看摘要

Abstract:We present a novel digital humanities method for representing our Twitch chatters as user embeddings created by a large language model (LLM). We cluster these embeddings automatically using affinity propagation and further narrow this clustering down through manual analysis. We analyze the chat of one stream by each Twitch streamer: SmallAnt, DougDoug and PointCrow. Our findings suggest that each streamer has their own type of chatters, however two categories emerge for all of the streamers: supportive viewers and emoji and reaction senders. Repetitive message spammers is a shared chatter category for two of the streamers.
摘要：我们提出了一种新颖的数字人文方法，通过大语言模型 (LLM) 生成 Twitch 聊天用户的用户嵌入 (user embeddings)。我们使用亲和传播 (affinity propagation) 自动对这些嵌入进行聚类，并通过手动分析进一步细化聚类结果。我们分析了每位 Twitch 主播（SmallAnt、DougDoug 和 PointCrow）的单次直播聊天记录。研究结果表明，每位主播都有其特有的聊天用户类型，但所有主播的聊天用户都可以归为两大类：支持性观众和表情符号及反应发送者。重复消息发送者是两位主播共有的聊天用户类别。

[NLP-41] Learn from Downstream and Be Yourself in Multimodal Large Language Model Fine-Tuning

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Model, MLLM) 在微调过程中可能遗忘预训练阶段获得的知识，从而导致泛化能力下降的问题。解决方案的关键在于提出了一种基于参数重要性的权重分配策略，通过测量预训练权重的大小和微调过程中累积的梯度值来评估参数的重要性。论文进一步应用了重要性感知的权重更新策略，选择性地更新对下游任务相对重要的参数，从而在增强下游任务性能的同时，缓解了泛化能力的下降。实验结果表明，该方法在图像描述和视觉问答任务中显著提高了微调效率和性能。

链接: https://arxiv.org/abs/2411.10928
作者: Wenke Huang,Jian Liang,Zekun Shi,Didi Zhu,Guancheng Wan,He Li,Bo Du,Dacheng Tao,Mang Ye
关键词-EN: Large Language Model, Multimodal Large Language, Multimodal Large, Language Model, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Model (MLLM) have demonstrated strong generalization capabilities across diverse distributions and tasks, largely due to extensive pre-training datasets. Fine-tuning MLLM has become a common practice to improve performance on specific downstream tasks. However, during fine-tuning, MLLM often faces the risk of forgetting knowledge acquired during pre-training, which can result in a decline in generalization abilities. To balance the trade-off between generalization and specialization, we propose measuring the parameter importance for both pre-trained and fine-tuning distributions, based on frozen pre-trained weight magnitude and accumulated fine-tuning gradient values. We further apply an importance-aware weight allocation strategy, selectively updating relatively important parameters for downstream tasks. We conduct empirical evaluations on both image captioning and visual question-answering tasks using various MLLM architectures. The comprehensive experimental analysis demonstrates the effectiveness of the proposed solution, highlighting the efficiency of the crucial modules in enhancing downstream specialization performance while mitigating generalization degradation in MLLM Fine-Tuning.
摘要：多模态大语言模型 (Multimodal Large Language Model, MLLM) 在跨多种分布和任务中展示了强大的泛化能力，这主要归功于广泛的预训练数据集。微调 MLLM 已成为提升其在特定下游任务中性能的常见做法。然而，在微调过程中，MLLM 常常面临遗忘预训练阶段所获得知识的危险，这可能导致泛化能力的下降。为了平衡泛化与特化之间的权衡，我们提出基于冻结的预训练权重幅度和累积的微调梯度值，来衡量预训练和微调分布的参数重要性。我们进一步应用一种重要性感知的权重分配策略，选择性地更新对下游任务相对重要的参数。我们在图像描述和视觉问答任务上使用多种 MLLM 架构进行了实证评估。综合实验分析表明，所提出的解决方案是有效的，强调了关键模块在提升下游特化性能的同时，减轻 MLLM 微调中泛化能力下降的效率。

[NLP-42] Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation ACL NAACL2025

【速读】：该论文试图解决第二语言学习者（L2）在发音时无意识地将不熟悉的L2音素替换为母语（L1）中相似音素的问题，这种现象导致L2发音偏离标准音系模式，增加了准确掌握L2发音的难度。解决方案的关键是提出了一种名为“跨语言音素合成”（Inter-linguistic Phonetic Composition, IPC）的新型计算方法，通过将L2音素重构为多个L1音素的复合音来最小化不正确的音系迁移。实验结果显示，使用IPC生成的复合音后，自动语音识别模型对目标L2音素的识别率提高了20%，且在较短时间内显示出快速掌握复合音的效果。

链接: https://arxiv.org/abs/2411.10927
作者: Jisang Park,Minu Kim,DaYoung Hong,Jongha Lee
关键词-EN: unconsciously substitute unfamiliar, native language, Inter-linguistic Phonetic Composition, substitute unfamiliar, distinct and non-interchangeable
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 10 pages, 6 Figures, submitted to ACL ARR October 2024 for NAACL 2025

点击查看摘要

Abstract:Learners of a second language (L2) often unconsciously substitute unfamiliar L2 phonemes with similar phonemes from their native language (L1), even though native speakers of the L2 perceive these sounds as distinct and non-interchangeable. This phonemic substitution leads to deviations from the standard phonological patterns of the L2, creating challenges for learners in acquiring accurate L2 pronunciation. To address this, we propose Inter-linguistic Phonetic Composition (IPC), a novel computational method designed to minimize incorrect phonological transfer by reconstructing L2 phonemes as composite sounds derived from multiple L1 phonemes. Tests with two automatic speech recognition models demonstrated that when L2 speakers produced IPC-generated composite sounds, the recognition rate of target L2 phonemes improved by 20% compared to when their pronunciation was influenced by original phonological transfer patterns. The improvement was observed within a relatively shorter time frame, demonstrating rapid acquisition of the composite sound.
摘要：第二语言（L2）学习者常常无意识地将不熟悉的 L2 音素替换为其母语（L1）中相似的音素，尽管 L2 的母语者认为这些音素是不同的且不可互换的。这种音素替换导致学习者的发音偏离了 L2 的标准音系模式，从而在学习准确 L2 发音时面临挑战。为解决这一问题，我们提出了跨语言音素合成（Inter-linguistic Phonetic Composition, IPC），这是一种新颖的计算方法，旨在通过将 L2 音素重构为由多个 L1 音素合成的复合音来最小化错误的音系迁移。通过对两个自动语音识别模型的测试表明，当 L2 说话者使用 IPC 生成的复合音时，目标 L2 音素的识别率比受原始音系迁移模式影响的发音提高了 20%。这种改进在相对较短的时间内观察到，表明复合音的快速习得。

[NLP-43] Bias in Large Language Models : Origin Evaluation and Mitigation

【速读】：该论文试图解决大型语言模型（LLMs）中的偏见问题，这些偏见在自然语言处理任务中表现明显，且具有显著的挑战性。解决方案的关键在于全面审查偏见的来源、分类、评估方法以及缓解策略。论文将偏见分为内在偏见（intrinsic bias）和外在偏见（extrinsic bias），并分析了它们在不同NLP任务中的表现。评估方法包括数据级、模型级和输出级的方法，为研究人员提供了检测偏见的工具包。缓解策略则分为模型前、模型中和模型后技术，论文强调了这些策略的有效性和局限性。此外，论文还讨论了偏见在实际应用中的伦理和法律影响，特别是在医疗和刑事司法领域。通过综合当前关于LLMs偏见的知识，该论文旨在促进公平和负责任的AI系统的发展。

链接: https://arxiv.org/abs/2411.10915
作者: Yufei Guo,Muzhe Guo,Juntao Su,Zhou Yang,Mengqiu Zhu,Hongfei Li,Mengyang Qiu,Shuo Shuo Liu
关键词-EN: Large Language Models, natural language processing, revolutionized natural language, poses significant challenges, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing, but their susceptibility to biases poses significant challenges. This comprehensive review examines the landscape of bias in LLMs, from its origins to current mitigation strategies. We categorize biases as intrinsic and extrinsic, analyzing their manifestations in various NLP tasks. The review critically assesses a range of bias evaluation methods, including data-level, model-level, and output-level approaches, providing researchers with a robust toolkit for bias detection. We further explore mitigation strategies, categorizing them into pre-model, intra-model, and post-model techniques, highlighting their effectiveness and limitations. Ethical and legal implications of biased LLMs are discussed, emphasizing potential harms in real-world applications such as healthcare and criminal justice. By synthesizing current knowledge on bias in LLMs, this review contributes to the ongoing effort to develop fair and responsible AI systems. Our work serves as a comprehensive resource for researchers and practitioners working towards understanding, evaluating, and mitigating bias in LLMs, fostering the development of more equitable AI technologies.
摘要：大语言模型 (LLM) 已经彻底改变了自然语言处理领域，但其对偏见的敏感性带来了重大挑战。本综述全面探讨了 LLM 中偏见的现状，从其根源到当前的缓解策略。我们将偏见分为内在偏见和外在偏见，分析了它们在各种 NLP 任务中的表现。综述还批判性地评估了一系列偏见评估方法，包括数据层面的、模型层面的和输出层面的方法，为研究人员提供了一个强大的偏见检测工具包。我们进一步探讨了缓解策略，将其分为模型前、模型内和模型后技术，并强调了它们的效果和局限性。此外，我们还讨论了偏见 LLM 的伦理和法律影响，强调了在医疗和刑事司法等实际应用中可能造成的潜在危害。通过综合当前关于 LLM 中偏见的知识，本综述为开发公平和负责任的 AI 系统做出了贡献。我们的工作为致力于理解、评估和缓解 LLM 中偏见的研究人员和实践者提供了一个全面的资源，促进了更公平的 AI 技术的发展。

[NLP-44] BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment

【速读】：该论文试图解决在强化学习与人类反馈 (Reinforcement Learning with Human Feedback, RLHF) 过程中，由于指令与响应数量不平衡导致的知识广度与深度学习不均衡的问题。解决方案的关键在于提出了平衡偏好优化 (Balanced Preference Optimization, BPO) 方法，该方法通过动态增强每个样本的知识深度，利用基于梯度的聚类技术来估计每个增强样本的知识信息量和有用性，从而在保持训练效率的同时，显著提升对齐调优的效果。

链接: https://arxiv.org/abs/2411.10914
作者: Sizhe Wang,Yongqi Tong,Hengyuan Zhang,Dawei Li,Xin Zhang,Tianlong Chen
关键词-EN: Human Feedback, large language models, Reinforcement Learning, recent years, success of large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) is the key to the success of large language models (LLMs) in recent years. In this work, we first introduce the concepts of knowledge breadth and knowledge depth, which measure the comprehensiveness and depth of an LLM or knowledge source respectively. We reveal that the imbalance in the number of prompts and responses can lead to a potential disparity in breadth and depth learning within alignment tuning datasets by showing that even a simple uniform method for balancing the number of instructions and responses can lead to significant improvements. Building on this, we further propose Balanced Preference Optimization (BPO), designed to dynamically augment the knowledge depth of each sample. BPO is motivated by the observation that the usefulness of knowledge varies across samples, necessitating tailored learning of knowledge depth. To achieve this, we introduce gradient-based clustering, estimating the knowledge informativeness and usefulness of each augmented sample based on the model’s optimization direction. Our experimental results across various benchmarks demonstrate that BPO outperforms other baseline methods in alignment tuning while maintaining training efficiency. Furthermore, we conduct a detailed analysis of each component of BPO, providing guidelines for future research in preference data optimization.
摘要：近年来，基于人类反馈的强化学习 (Reinforcement Learning with Human Feedback, RLHF) 是大语言模型 (Large Language Model, LLM) 取得成功的关键。在本研究中，我们首先引入了知识广度 (knowledge breadth) 和知识深度 (knowledge depth) 的概念，分别衡量一个 LLM 或知识源的全面性和深度。我们揭示了提示 (prompt) 和响应 (response) 数量不平衡可能导致对齐调优数据集中广度和深度学习的不均衡，并通过展示简单的均匀方法来平衡指令和响应的数量可以显著提升效果。在此基础上，我们进一步提出了平衡偏好优化 (Balanced Preference Optimization, BPO)，旨在动态增强每个样本的知识深度。BPO 的动机在于观察到知识的有用性在不同样本间存在差异，因此需要针对性地学习知识深度。为此，我们引入了基于梯度的聚类方法，根据模型的优化方向估计每个增强样本的知识信息量和有用性。我们在多个基准测试上的实验结果表明，BPO 在对齐调优中优于其他基线方法，同时保持了训练效率。此外，我们对 BPO 的每个组成部分进行了详细分析，为未来在偏好数据优化方面的研究提供了指导。

[NLP-45] SPICA: Retrieving Scenarios for Pluralistic In-Context Alignment

【速读】：该论文试图解决大型语言模型（LLMs）在社会价值对齐过程中未能充分考虑多元群体价值差异的问题。解决方案的关键在于提出了SPICA框架，该框架通过三种设计来实现多元对齐：场景库（scenario banks）、群体感知度量（group-informed metrics）和上下文对齐提示（in-context alignment prompts）。SPICA在上下文示例检索过程中考虑了群体层面的差异，从而更准确地匹配不同群体的偏好。实验结果表明，SPICA对齐的模型在多元群体中的表现优于仅基于相似性检索的基线方法，所有群体均受益于这种对齐方式，而非仅限于某些群体。

链接: https://arxiv.org/abs/2411.10912
作者: Quan Ze Chen,K.J. Kevin Feng,Chan Young Park,Amy X. Zhang
关键词-EN: large language models, large language, Alignment, SPICA, groups
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Alignment of large language models (LLMs) to societal values should account for pluralistic values from diverse groups. One technique uses in-context learning for inference-time alignment, but only considers similarity when drawing few-shot examples, not accounting for cross-group differences in value prioritization. We propose SPICA, a framework for pluralistic alignment that accounts for group-level differences during in-context example retrieval. SPICA introduces three designs to facilitate pluralistic alignment: scenario banks, group-informed metrics, and in-context alignment prompts. From an evaluation of SPICA on an alignment task collecting inputs from four demographic groups ( n = 544 ), our metrics retrieve in-context examples that more closely match observed preferences, with the best prompt configuration using multiple contrastive responses to demonstrate examples. In an end-to-end evaluation ( n = 80 ), we observe that SPICA-aligned models are higher rated than a baseline similarity-only retrieval approach, with groups seeing up to a +0.16 point improvement on a 5 point scale. Additionally, gains from SPICA were more uniform, with all groups benefiting from alignment rather than only some. Finally, we find that while a group-agnostic approach can effectively align to aggregated values, it is not most suited for aligning to divergent groups.
摘要：大语言模型 (LLM) 与社会价值的对齐应考虑来自不同群体的多元价值观。一种技术使用上下文学习进行推理时对齐，但在抽取少样本示例时仅考虑相似性，未考虑跨群体在价值优先级上的差异。我们提出了 SPICA，这是一个考虑群体层面差异的多元对齐框架，用于上下文示例检索。SPICA 引入了三种设计以促进多元对齐：场景库、群体导向的度量标准和上下文对齐提示。通过对 SPICA 在一个收集了四个不同人口群体输入的对齐任务（n = 544）上的评估，我们的度量标准检索到的上下文示例更接近观察到的偏好，最佳提示配置使用多个对比响应来展示示例。在端到端评估（n = 80）中，我们观察到 SPICA 对齐的模型评分高于仅基于相似性检索的基线方法，各群体在 5 分制上最高提升了 0.16 分。此外，SPICA 带来的收益更为均匀，所有群体都从对齐中受益，而非仅部分群体。最后，我们发现尽管群体无关的方法可以有效对齐到聚合价值，但它并不最适合对齐到分歧较大的群体。

[NLP-46] BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization

【速读】：该论文试图解决孟加拉语方言识别及多样化的孟加拉语口音转换为标准孟加拉语的问题。解决方案的关键在于构建一个端到端的处理流程，包括创建一个大规模多样化的方言语音数据集，并通过微调自动语音识别（ASR）模型和大型语言模型（LLM）来实现方言语音到方言文本的转录，以及方言文本到标准孟加拉语文本的翻译。具体来说，研究中微调了Whisper ASR模型以实现低字符错误率（CER）和词错误率（WER），同时微调了BanglaT5模型以实现高BLEU分数的方言到标准文本翻译。

链接: https://arxiv.org/abs/2411.10879
作者: Md. Nazmus Sadat Samin,Jawad Ibn Ahad,Tanjila Ahmed Medha,Fuad Rahman,Mohammad Ruhul Amin,Nabeel Mohammed,Shafin Rahman
关键词-EN: standardized formal Bengali, formal Bengali speech, recognizing Bangladeshi dialects, diverse Bengali accents, Bengali accents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted in 2024 IEEE International Conference on Big Data (IEEE BigData)

点击查看摘要

Abstract:This study focuses on recognizing Bangladeshi dialects and converting diverse Bengali accents into standardized formal Bengali speech. Dialects, often referred to as regional languages, are distinctive variations of a language spoken in a particular location and are identified by their phonetics, pronunciations, and lexicon. Subtle changes in pronunciation and intonation are also influenced by geographic location, educational attainment, and socioeconomic status. Dialect standardization is needed to ensure effective communication, educational consistency, access to technology, economic opportunities, and the preservation of linguistic resources while respecting cultural diversity. Being the fifth most spoken language with around 55 distinct dialects spoken by 160 million people, addressing Bangla dialects is crucial for developing inclusive communication tools. However, limited research exists due to a lack of comprehensive datasets and the challenges of handling diverse dialects. With the advancement in multilingual Large Language Models (mLLMs), emerging possibilities have been created to address the challenges of dialectal Automated Speech Recognition (ASR) and Machine Translation (MT). This study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech. This investigation includes constructing a large-scale diverse dataset with dialectal speech signals that tailored the fine-tuning process in ASR and LLM for transcribing the dialect speech to dialect text and translating the dialect text to standard Bangla text. Our experiments demonstrated that fine-tuning the Whisper ASR model achieved a CER of 0.8% and WER of 1.5%, while the BanglaT5 model attained a BLEU score of 41.6% for dialect-to-standard text translation.
摘要：本研究专注于识别孟加拉语方言，并将多种孟加拉语口音转换为标准化的正式孟加拉语语音。方言，通常被称为地区语言，是特定地区内语言的独特变体，通过其语音、发音和词汇来识别。发音和语调的细微变化也受到地理位置、教育程度和社会经济地位的影响。方言标准化对于确保有效沟通、教育一致性、技术接入、经济机会以及在尊重文化多样性的同时保护语言资源至关重要。作为全球第五大使用语言，拥有约55种不同方言，由1.6亿人使用，解决孟加拉语方言问题对于开发包容性沟通工具至关重要。然而，由于缺乏全面的数据集和处理多样方言的挑战，相关研究有限。随着多语言大语言模型（mLLMs）的进步，出现了应对方言自动语音识别（ASR）和机器翻译（MT）挑战的新可能性。本研究提出了一种端到端的管道，用于将诺阿卡利方言语音转换为标准孟加拉语语音。该研究包括构建一个大规模多样化的数据集，其中包含方言语音信号，这些信号针对ASR和LLM的微调过程进行了定制，以将方言语音转录为方言文本，并将方言文本翻译为标准孟加拉语文本。我们的实验表明，微调Whisper ASR模型实现了0.8%的字符错误率（CER）和1.5%的词错误率（WER），而BanglaT5模型在方言到标准文本翻译中达到了41.6%的BLEU分数。

[NLP-47] Empowering Meta-Analysis: Leveraging Large Language Models for Scientific Synthesis

【速读】：该论文试图解决手动进行元分析（meta-analysis）过程中存在的劳动密集、耗时且易出错的问题。解决方案的关键在于利用大型语言模型（LLMs）进行自动化和优化。具体来说，研究通过在广泛的科学数据集上微调LLM，并结合检索增强生成（Retrieval Augmented Generation, RAG）技术，以及通过提示工程（prompt engineering）和新设计的损失度量——逆余弦距离（Inverse Cosine Distance, ICD），来高效生成结构化的元分析内容。这种方法显著提高了生成内容的准确性和相关性，使微调后的LLM在生成相关元分析摘要方面达到了87.6%的准确率，同时将不相关性从4.56%降低到1.9%。

链接: https://arxiv.org/abs/2411.10878
作者: Jawad Ibn Ahad,Rafeed Mohammad Sultan,Abraham Kaikobad,Fuad Rahman,Mohammad Ruhul Amin,Nabeel Mohammed,Shafin Rahman
关键词-EN: Retrieval Augmented Generation, meta-analysis, Inverse Cosine Distance, large language models, scientific documents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted in 2024 IEEE International Conference on Big Data (IEEE BigData)

点击查看摘要

Abstract:This study investigates the automation of meta-analysis in scientific documents using large language models (LLMs). Meta-analysis is a robust statistical method that synthesizes the findings of multiple studies support articles to provide a comprehensive understanding. We know that a meta-article provides a structured analysis of several articles. However, conducting meta-analysis by hand is labor-intensive, time-consuming, and susceptible to human error, highlighting the need for automated pipelines to streamline the process. Our research introduces a novel approach that fine-tunes the LLM on extensive scientific datasets to address challenges in big data handling and structured data extraction. We automate and optimize the meta-analysis process by integrating Retrieval Augmented Generation (RAG). Tailored through prompt engineering and a new loss metric, Inverse Cosine Distance (ICD), designed for fine-tuning on large contextual datasets, LLMs efficiently generate structured meta-analysis content. Human evaluation then assesses relevance and provides information on model performance in key metrics. This research demonstrates that fine-tuned models outperform non-fine-tuned models, with fine-tuned LLMs generating 87.6% relevant meta-analysis abstracts. The relevance of the context, based on human evaluation, shows a reduction in irrelevancy from 4.56% to 1.9%. These experiments were conducted in a low-resource environment, highlighting the study’s contribution to enhancing the efficiency and reliability of meta-analysis automation.
摘要：本研究探讨了使用大语言模型 (LLM) 对科学文献进行元分析的自动化。元分析是一种稳健的统计方法，它综合了多篇研究支持文章的发现，以提供全面的理解。我们知道，元文章提供了对多篇文章的结构化分析。然而，手工进行元分析既费时又费力，且容易出现人为错误，这凸显了自动化流程以简化该过程的必要性。我们的研究引入了一种新颖的方法，即在广泛的科学数据集上对 LLM 进行微调，以应对大数据处理和结构化数据提取的挑战。通过集成检索增强生成 (RAG)，我们自动化并优化了元分析过程。通过提示工程和一种新的损失度量——逆余弦距离 (ICD)，该度量专为在大规模上下文数据集上的微调设计，LLM 能够高效生成结构化的元分析内容。随后，通过人工评估来评估相关性，并提供模型在关键指标上的性能信息。研究表明，经过微调的模型优于未经过微调的模型，经过微调的 LLM 生成的元分析摘要中有 87.6% 是相关的。基于人工评估的上下文相关性显示，不相关性从 4.56% 降低到 1.9%。这些实验在低资源环境下进行，突显了本研究对提高元分析自动化效率和可靠性的贡献。

[NLP-48] Large Language Models (LLM s) as Traffic Control Systems at Urban Intersections: A New Paradigm

【速读】：该论文试图解决交通控制系统中的优化问题，特别是通过利用大型语言模型 (Large Language Models, LLMs) 作为交通控制器来提高交通流量效率和实时反馈。解决方案的关键在于利用LLMs的逻辑推理、场景理解和决策能力，将传统的分散交通控制过程集中化，并整合来自不同来源的交通数据以提供上下文感知的决策。此外，LLMs能够通过无线信号和视觉等多种方式向驾驶员、基础设施和自动驾驶车辆提供定制化的输出。论文通过四阶段的评估方法，包括数据创建与环境初始化、提示工程、冲突识别和微调，验证了LLMs在交通控制中的有效性，特别是GPT-mini模型在冲突识别、决策制定、优先级分配和等待时间优化方面表现出色，达到了83%的准确率和0.84的F1-score。

链接: https://arxiv.org/abs/2411.10869
作者: Sari Masri,Huthaifa I. Ashqar,Mohammed Elhenawy
关键词-EN: Large Language Models, Large Language, Language Models, traffic control systems, traffic
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: The data and code that support the findings of this study are openly available in Zenodo at this https URL , reference number 14171745

点击查看摘要

Abstract:This study introduces a novel approach for traffic control systems by using Large Language Models (LLMs) as traffic controllers. The study utilizes their logical reasoning, scene understanding, and decision-making capabilities to optimize throughput and provide feedback based on traffic conditions in real-time. LLMs centralize traditionally disconnected traffic control processes and can integrate traffic data from diverse sources to provide context-aware decisions. LLMs can also deliver tailored outputs using various means such as wireless signals and visuals to drivers, infrastructures, and autonomous vehicles. To evaluate LLMs ability as traffic controllers, this study proposed a four-stage methodology. The methodology includes data creation and environment initialization, prompt engineering, conflict identification, and fine-tuning. We simulated multi-lane four-leg intersection scenarios and generates detailed datasets to enable conflict detection using LLMs and Python simulation as a ground truth. We used chain-of-thought prompts to lead LLMs in understanding the context, detecting conflicts, resolving them using traffic rules, and delivering context-sensitive traffic management solutions. We evaluated the prformance GPT-mini, Gemini, and Llama as traffic controllers. Results showed that the fine-tuned GPT-mini achieved 83% accuracy and an F1-score of 0.84. GPT-mini model exhibited a promising performance in generating actionable traffic management insights, with high ROUGE-L scores across conflict identification of 0.95, decision-making of 0.91, priority assignment of 0.94, and waiting time optimization of 0.92. We demonstrated that LLMs can offer precise recommendations to drivers in real-time including yielding, slowing, or stopping based on vehicle dynamics.
摘要：本研究提出了一种利用大语言模型 (LLM) 作为交通控制器的新方法。研究利用其逻辑推理、场景理解和决策能力，实时优化交通流量并根据交通状况提供反馈。LLM 将传统上分散的交通控制过程集中化，并能整合来自不同来源的交通数据，以提供情境感知的决策。LLM 还可以通过无线信号和视觉等多种方式向驾驶员、基础设施和自动驾驶车辆提供定制化的输出。为了评估 LLM 作为交通控制器的能力，本研究提出了一种四阶段方法论。该方法论包括数据创建和环境初始化、提示工程、冲突识别和微调。我们模拟了多车道四向交叉口场景，并生成了详细的数据集，以使用 LLM 和 Python 模拟作为基准进行冲突检测。我们使用思维链提示引导 LLM 理解上下文、检测冲突、利用交通规则解决冲突，并提供情境敏感的交通管理解决方案。我们评估了 GPT-mini、Gemini 和 Llama 作为交通控制器的性能。结果显示，经过微调的 GPT-mini 达到了 83% 的准确率和 0.84 的 F1 分数。GPT-mini 模型在生成可操作的交通管理洞察方面表现出色，冲突识别、决策制定、优先级分配和等待时间优化的 ROUGE-L 分数分别为 0.95、0.91、0.94 和 0.92。我们证明了 LLM 可以实时向驾驶员提供精确的建议，包括让行、减速或停车，基于车辆动态。

[NLP-49] Large Vision-Language Models for Remote Sensing Visual Question Answering

【速读】：该论文试图解决遥感视觉问答 (Remote Sensing Visual Question Answering, RSVQA) 任务中传统方法依赖于单独的视觉特征提取器和语言处理模型，导致计算量大且难以处理开放性问题的挑战。解决方案的关键在于提出了一种利用生成式大视觉-语言模型 (Large Vision-Language Model, LVLM) 的新方法，通过领域自适应预训练和基于提示的微调两步训练策略，使模型能够基于视觉和文本输入生成自然语言答案，无需预定义答案类别。该方法在RSVQAxBEN数据集上表现优于现有最先进基线，并在人类评估中显示出更高的准确性、相关性和流畅性。

链接: https://arxiv.org/abs/2411.10857
作者: Surasakdi Siripong,Apirak Chaiyapan,Thanakorn Phonchai
关键词-EN: Visual Question Answering, involves interpreting complex, interpreting complex satellite, complex satellite imagery, Question Answering
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Remote Sensing Visual Question Answering (RSVQA) is a challenging task that involves interpreting complex satellite imagery to answer natural language questions. Traditional approaches often rely on separate visual feature extractors and language processing models, which can be computationally intensive and limited in their ability to handle open-ended questions. In this paper, we propose a novel method that leverages a generative Large Vision-Language Model (LVLM) to streamline the RSVQA process. Our approach consists of a two-step training strategy: domain-adaptive pretraining and prompt-based finetuning. This method enables the LVLM to generate natural language answers by conditioning on both visual and textual inputs, without the need for predefined answer categories. We evaluate our model on the RSVQAxBEN dataset, demonstrating superior performance compared to state-of-the-art baselines. Additionally, a human evaluation study shows that our method produces answers that are more accurate, relevant, and fluent. The results highlight the potential of generative LVLMs in advancing the field of remote sensing analysis.
摘要：遥感视觉问答 (RSVQA) 是一项具有挑战性的任务，涉及解释复杂的卫星图像以回答自然语言问题。传统方法通常依赖于独立的视觉特征提取器和语言处理模型，这些方法在计算上较为密集，并且在处理开放式问题时能力有限。本文提出了一种新颖的方法，利用生成式大视觉-语言模型 (LVLM) 来简化 RSVQA 过程。我们的方法包括两步训练策略：领域自适应预训练和基于提示的微调。这种方法使 LVLM 能够在不需要预定义答案类别的情况下，通过视觉和文本输入生成自然语言答案。我们在 RSVQAxBEN 数据集上评估了我们的模型，结果显示其性能优于最先进的基线模型。此外，一项人类评估研究显示，我们的方法生成的答案在准确性、相关性和流畅性方面表现更佳。这些结果突显了生成式 LVLM 在推进遥感分析领域中的潜力。

[NLP-50] Information Anxiety in Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在处理高频实体查询时出现的“信息焦虑”问题，即模型在面对高度流行的查询时，难以从其参数化记忆中分离出基于不同关系的具体事实。解决方案的关键在于深入分析LLMs的内部推理和检索机制，特别是实体流行度、查询词汇变异敏感性以及隐藏状态表示在模型层间的演化。论文通过案例研究揭示了这些潜在问题，并强调需要对高频实体进行更全面的评估，以应对语言变异带来的对抗性挑战。

链接: https://arxiv.org/abs/2411.10813
作者: Prasoon Bajpai,Sarah Masud,Tanmoy Chakraborty
关键词-EN: Large Language Models, Large Language, demonstrated strong performance, Language Models, enabling models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong performance as knowledge repositories, enabling models to understand user queries and generate accurate and context-aware responses. Extensive evaluation setups have corroborated the positive correlation between the retrieval capability of LLMs and the frequency of entities in their pretraining corpus. We take the investigation further by conducting a comprehensive analysis of the internal reasoning and retrieval mechanisms of LLMs. Our work focuses on three critical dimensions - the impact of entity popularity, the models’ sensitivity to lexical variations in query formulation, and the progression of hidden state representations across LLM layers. Our preliminary findings reveal that popular questions facilitate early convergence of internal states toward the correct answer. However, as the popularity of a query increases, retrieved attributes across lexical variations become increasingly dissimilar and less accurate. Interestingly, we find that LLMs struggle to disentangle facts, grounded in distinct relations, from their parametric memory when dealing with highly popular subjects. Through a case study, we explore these latent strains within LLMs when processing highly popular queries, a phenomenon we term information anxiety. The emergence of information anxiety in LLMs underscores the adversarial injection in the form of linguistic variations and calls for a more holistic evaluation of frequently occurring entities.
摘要：大语言模型 (LLM) 作为知识库展示了强大的性能，使模型能够理解用户查询并生成准确且上下文感知的响应。广泛的评估设置证实了 LLM 的检索能力与其预训练语料库中实体频率之间的正相关关系。我们进一步通过全面分析 LLM 的内部推理和检索机制来深入研究这一问题。我们的工作聚焦于三个关键维度——实体流行度的影响、模型对查询表述中词汇变化的敏感性以及 LLM 层间隐藏状态表示的演变。我们的初步发现表明，流行问题有助于内部状态向正确答案的早期收敛。然而，随着查询的流行度增加，跨词汇变化的检索属性变得愈发不相似且不准确。有趣的是，我们发现 LLM 在处理高度流行的主题时，难以从其参数化记忆中区分基于不同关系的具体事实。通过案例研究，我们探讨了 LLM 在处理高度流行查询时存在的潜在压力，这种现象我们称之为信息焦虑。LLM 中信息焦虑的出现突显了语言变体形式的对抗性注入，并呼吁对频繁出现的实体进行更全面的评估。

[NLP-51] Can Generic LLM s Help Analyze Child-adult Interactions Involving Children with Autism in Clinical Observation? NEURIPS2024 ALT

【速读】：该论文试图解决在临床环境中，大型语言模型（Large Language Models, LLMs）在分析儿童与成人互动（特别是涉及自闭症谱系障碍儿童）方面的应用问题。解决方案的关键在于评估LLMs在四个任务中的表现：分类儿童与成人的对话、预测参与的活动、识别语言技能以及理解临床相关的特质。研究结果表明，LLMs在分析长且复杂的临床观察对话中表现出色，通常超过非专家人类评估者的表现，显示出其在分割感兴趣的互动、辅助语言技能评估、识别参与活动以及提供临床相关评估背景方面的潜力。

链接: https://arxiv.org/abs/2411.10761
作者: Tiantian Feng,Anfeng Xu,Rimita Lahiri,Helen Tager-Flusberg,So Hyun Kim,Somer Bishop,Catherine Lord,Shrikanth Narayanan
关键词-EN: Large Language Models, shown significant potential, Large Language, Language Models, shown significant
类目: Computation and Language (cs.CL)
备注: GenAI for Health Workshop, NeurIPS 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant potential in understanding human communication and interaction. However, their performance in the domain of child-inclusive interactions, including in clinical settings, remains less explored. In this work, we evaluate generic LLMs’ ability to analyze child-adult dyadic interactions in a clinically relevant context involving children with ASD. Specifically, we explore LLMs in performing four tasks: classifying child-adult utterances, predicting engaged activities, recognizing language skills and understanding traits that are clinically relevant. Our evaluation shows that generic LLMs are highly capable of analyzing long and complex conversations in clinical observation sessions, often surpassing the performance of non-expert human evaluators. The results show their potential to segment interactions of interest, assist in language skills evaluation, identify engaged activities, and offer clinical-relevant context for assessments.
摘要：大语言模型 (Large Language Models, LLMs) 在理解人类交流和互动方面展现了显著的潜力。然而，在涉及儿童的互动领域，尤其是在临床环境中，其表现尚未得到充分探索。在本研究中，我们评估了通用 LLMs 在分析涉及自闭症谱系障碍 (ASD) 儿童的儿童-成人二元互动中的能力。具体而言，我们探讨了 LLMs 在执行以下四项任务中的表现：分类儿童-成人话语、预测参与的活动、识别语言技能以及理解与临床相关的特质。我们的评估结果显示，通用 LLMs 在分析临床观察会话中的长而复杂的对话时，通常能超越非专家人类评估者的表现。这些结果表明，LLMs 具有分割感兴趣的互动、辅助语言技能评估、识别参与活动以及提供临床相关背景信息以进行评估的潜力。

[NLP-52] Chain-of-Programming (CoP) : Empowering Large Language Models for Geospatial Code Generation

【速读】：该论文试图解决大型语言模型（LLMs）在地理空间代码生成过程中由于用户需求不完整或不清晰以及对特定平台语法规则知识不足而导致的“代码幻觉”问题，即生成不可执行的代码。解决方案的关键在于提出了一个名为“编程链（Chain of Programming, CoP）”的框架，该框架将代码生成过程分解为需求分析、算法设计、代码实现、代码调试和代码注释五个步骤。CoP框架通过引入共享信息池、知识库检索和用户反馈机制，形成了一个从需求到代码的端到端生成流程，无需模型微调。该策略显著提高了生成代码的逻辑清晰性、语法正确性和可执行性，并通过实验验证了其在地理空间任务中的优越性和关键组件的合理性与必要性。

链接: https://arxiv.org/abs/2411.10753
作者: Shuyang Hou,Haoyue Jiao,Zhangxiao Shen,Jianyuan Liang,Anqi Zhao,Xiaopu Zhang,Jianxun Wang,Huayi Wu
关键词-EN: code generation, geospatial code generation, large language models, code generation technology, code generation process
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid growth of interdisciplinary demands for geospatial modeling and the rise of large language models (LLMs), geospatial code generation technology has seen significant advancements. However, existing LLMs often face challenges in the geospatial code generation process due to incomplete or unclear user requirements and insufficient knowledge of specific platform syntax rules, leading to the generation of non-executable code, a phenomenon known as “code hallucination.” To address this issue, this paper proposes a Chain of Programming (CoP) framework, which decomposes the code generation process into five steps: requirement analysis, algorithm design, code implementation, code debugging, and code annotation. The framework incorporates a shared information pool, knowledge base retrieval, and user feedback mechanisms, forming an end-to-end code generation flow from requirements to code without the need for model fine-tuning. Based on a geospatial problem classification framework and evaluation benchmarks, the CoP strategy significantly improves the logical clarity, syntactical correctness, and executability of the generated code, with improvements ranging from 3.0% to 48.8%. Comparative and ablation experiments further validate the superiority of the CoP strategy over other optimization approaches and confirm the rationality and necessity of its key components. Through case studies on building data visualization and fire data analysis, this paper demonstrates the application and effectiveness of CoP in various geospatial scenarios. The CoP framework offers a systematic, step-by-step approach to LLM-based geospatial code generation tasks, significantly enhancing code generation performance in geospatial tasks and providing valuable insights for code generation in other vertical domains.
摘要：随着地理空间建模跨学科需求的快速增长以及大语言模型（Large Language Models, LLMs）的兴起，地理空间代码生成技术取得了显著进展。然而，现有的 LLMs 在地理空间代码生成过程中常常面临用户需求不完整或不清晰，以及对特定平台语法规则了解不足的问题，导致生成的代码无法执行，这种现象被称为“代码幻觉”（code hallucination）。为解决这一问题，本文提出了一种编程链（Chain of Programming, CoP）框架，将代码生成过程分解为五个步骤：需求分析、算法设计、代码实现、代码调试和代码注释。该框架集成了共享信息池、知识库检索和用户反馈机制，形成了一个从需求到代码的端到端生成流程，无需模型微调。基于地理空间问题分类框架和评估基准，CoP 策略显著提升了生成代码的逻辑清晰度、语法正确性和可执行性，改进幅度从 3.0% 到 48.8% 不等。对比实验和消融实验进一步验证了 CoP 策略相对于其他优化方法的优越性，并确认了其关键组件的合理性和必要性。通过构建数据可视化和火灾数据分析的案例研究，本文展示了 CoP 在多种地理空间场景中的应用和有效性。CoP 框架为基于 LLM 的地理空间代码生成任务提供了一种系统化、逐步推进的方法，显著提升了地理空间任务中的代码生成性能，并为其他垂直领域的代码生成提供了有价值的参考。

[NLP-53] Comparison of Multilingual and Bilingual Models for Satirical News Detection of Arabic and English ALT

【速读】：该论文试图解决讽刺新闻与真实新闻的区分问题，尤其是在不同文化和社交背景下，讽刺新闻容易被误解为虚假信息。解决方案的关键在于利用多语言讽刺检测方法，并通过零样本学习和链式思维（Chain-of-Thought, CoT）提示技术来提升语言模型在讽刺检测任务中的表现。研究结果表明，链式思维提示技术显著提升了Jais-chat模型在英语讽刺检测中的性能，F1-score达到80%，突显了结构化推理在复杂任务中的重要性。

链接: https://arxiv.org/abs/2411.10730
作者: Omar W. Abdalla,Aditya Joshi,Rahat Masood,Salil S. Kanhere
关键词-EN: real news combined, exaggerated content, humorous comment, comment or exaggerated, mimics the format
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: ALTA 2024 (Selected for publication)

点击查看摘要

Abstract:Satirical news is real news combined with a humorous comment or exaggerated content, and it often mimics the format and style of real news. However, satirical news is often misunderstood as misinformation, especially by individuals from different cultural and social backgrounds. This research addresses the challenge of distinguishing satire from truthful news by leveraging multilingual satire detection methods in English and Arabic. We explore both zero-shot and chain-of-thought (CoT) prompting using two language models, Jais-chat(13B) and LLaMA-2-chat(7B). Our results show that CoT prompting offers a significant advantage for the Jais-chat model over the LLaMA-2-chat model. Specifically, Jais-chat achieved the best performance, with an F1-score of 80% in English when using CoT prompting. These results highlight the importance of structured reasoning in CoT, which enhances contextual understanding and is vital for complex tasks like satire detection.
摘要：讽刺新闻是将真实新闻与幽默评论或夸张内容结合的产物，通常模仿真实新闻的格式和风格。然而，讽刺新闻往往被误解为虚假信息，尤其是在不同文化和社交背景的人群中。本研究通过利用英语和阿拉伯语的多语言讽刺检测方法，来应对区分讽刺新闻与真实新闻的挑战。我们探索了使用两种语言模型——Jais-chat(13B) 和 LLaMA-2-chat(7B)——进行零样本和思维链 (Chain-of-Thought, CoT) 提示的方法。结果显示，CoT 提示为 Jais-chat 模型提供了显著优势，超过了 LLaMA-2-chat 模型。具体而言，Jais-chat 在使用 CoT 提示时在英语中达到了最佳表现，F1 分数为 80%。这些结果突显了 CoT 中结构化推理的重要性，这种推理增强了上下文理解，对于讽刺检测这类复杂任务至关重要。

[NLP-54] HJ-Ky-0.1: an Evaluation Dataset for Kyrgyz Word Embeddings

【速读】：该论文试图解决在现代应用计算语言学中构建词向量表示（word embeddings）的关键任务，特别是在吉尔吉斯语（Kyrgyz language）中的应用。解决方案的关键在于引入首个“银标准”数据集，用于评估词向量的质量，并通过专家评估的词相似性计算向量间的距离。论文还训练了相应的模型，并通过质量评估指标验证了数据集的适用性。

链接: https://arxiv.org/abs/2411.10724
作者: Anton Alekseev,Gulnara Kabaeva
关键词-EN: modern applied computational, applied computational linguistics, information extraction, address natural language, natural language processing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:One of the key tasks in modern applied computational linguistics is constructing word vector representations (word embeddings), which are widely used to address natural language processing tasks such as sentiment analysis, information extraction, and more. To choose an appropriate method for generating these word embeddings, quality assessment techniques are often necessary. A standard approach involves calculating distances between vectors for words with expert-assessed ‘similarity’. This work introduces the first ‘silver standard’ dataset for such tasks in the Kyrgyz language, alongside training corresponding models and validating the dataset’s suitability through quality evaluation metrics.
摘要：现代应用计算语言学中的一个关键任务是构建词向量表示（word embeddings），这些表示广泛用于解决情感分析、信息提取等自然语言处理任务。为了选择合适的生成这些词向量的方法，通常需要进行质量评估。一种标准的方法是计算专家评估的“相似性”词语之间的向量距离。本文首次为吉尔吉斯语引入了此类任务的“银标准”数据集，并训练了相应的模型，通过质量评估指标验证了数据集的适用性。

[NLP-55] A Regularized LSTM Method for Detecting Fake News Articles

【速读】：该论文试图解决虚假新闻快速传播带来的问题，关键在于开发一种先进的机器学习解决方案来检测虚假新闻文章。解决方案的核心是通过利用一个包含23,502篇虚假新闻和21,417篇准确新闻的综合数据集，实现并评估三种机器学习模型。这些模型包括长短期记忆网络（LSTM）、经过正则化和超参数调优的改进模型，以及结合前两种模型优势并采用高级优化策略的最终模型。最终模型达到了98%的峰值准确率，展示了其在高精度识别虚假新闻方面的有效性。该研究强调了在自然语言处理和机器学习技术方面的显著进步，并为对抗虚假新闻提供了有价值的工具。

链接: https://arxiv.org/abs/2411.10713
作者: Tanjina Sultana Camelia,Faizur Rahman Fahim,Md. Musfique Anwar
关键词-EN: rapid diffusion, Nowadays, fake, articles, fake news articles
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 6 pages, 7 figures, 2024 IEEE International Conference on Signal Processing, Information, Communication and Systems (SPICSCON)

点击查看摘要

Abstract:Nowadays, the rapid diffusion of fake news poses a significant problem, as it can spread misinformation and confusion. This paper aims to develop an advanced machine learning solution for detecting fake news articles. Leveraging a comprehensive dataset of news articles, including 23,502 fake news articles and 21,417 accurate news articles, we implemented and evaluated three machine-learning models. Our dataset, curated from diverse sources, provides rich textual content categorized into title, text, subject, and Date features. These features are essential for training robust classification models to distinguish between fake and authentic news articles. The initial model employed a Long Short-Term Memory (LSTM) network, achieving an accuracy of 94%. The second model improved upon this by incorporating additional regularization techniques and fine-tuning hyperparameters, resulting in a 97% accuracy. The final model combined the strengths of previous architectures with advanced optimization strategies, achieving a peak accuracy of 98%. These results demonstrate the effectiveness of our approach in identifying fake news with high precision. Implementing these models showcases significant advancements in natural language processing and machine learning techniques, contributing valuable tools for combating misinformation. Our work highlights the potential for deploying such models in real-world applications, providing a reliable method for automated fake news detection and enhancing the credibility of news dissemination.
摘要：当前，虚假新闻的快速传播已成为一个重大问题，因为它能够散布错误信息和引起混乱。本文旨在开发一种先进的机器学习解决方案，用于检测虚假新闻文章。我们利用了一个包含新闻文章的综合数据集，其中包括23,502篇虚假新闻文章和21,417篇准确新闻文章，并实施和评估了三种机器学习模型。我们的数据集从多个来源精心筛选，提供了丰富的文本内容，分为标题、文本、主题和日期等特征。这些特征对于训练稳健的分类模型以区分虚假和真实新闻文章至关重要。初始模型采用了长短期记忆网络（LSTM），达到了94%的准确率。第二个模型通过引入额外的正则化技术和微调超参数，将准确率提升至97%。最终模型结合了先前架构的优势与先进的优化策略，达到了98%的峰值准确率。这些结果展示了我们方法在识别虚假新闻方面的高效性。实施这些模型展示了自然语言处理和机器学习技术的显著进步，为对抗错误信息提供了宝贵的工具。我们的工作突显了在实际应用中部署此类模型的潜力，提供了一种可靠的自动化虚假新闻检测方法，并增强了新闻传播的可信度。

[NLP-56] Structured Dialogue System for Mental Health: An LLM Chatbot Leveraging the PM Guidelines

【速读】：该论文试图解决在多轮心理咨询中，现有基于大型语言模型（LLM）的聊天机器人通常忽视咨询过程中动态阶段变化的问题。解决方案的关键在于提出了一个名为SuDoSys的结构化对话系统，该系统不仅利用LLM生成对话，还通过一个阶段感知的指令生成器、响应解包器、主题数据库和阶段控制器来管理对话流程。SuDoSys能够识别并适应咨询的不同阶段，存储关键信息以确保对话的连贯性和方向性，并通过模拟咨询客户与系统交互的自动评估技术来验证其性能。

链接: https://arxiv.org/abs/2411.10681
作者: Yixiang Chen,Xinyu Zhang,Jinran Wang,Xurong Xie,Nan Yan,Hui Chen,Lan Wang
关键词-EN: Large Language Model, innovative Large Language, Language Model, Large Language, based chatbot designed
类目: Computation and Language (cs.CL)
备注: Accepted to the 16th International Conference on Social Robotic (ICSR 2024)

点击查看摘要

Abstract:The Structured Dialogue System, referred to as SuDoSys, is an innovative Large Language Model (LLM)-based chatbot designed to provide psychological counseling. SuDoSys leverages the World Health Organization (WHO)'s Problem Management Plus (PM+) guidelines to deliver stage-aware multi-turn dialogues. Existing methods for employing an LLM in multi-turn psychological counseling typically involve direct fine-tuning using generated dialogues, often neglecting the dynamic stage shifts of counseling sessions. Unlike previous approaches, SuDoSys considers the different stages of counseling and stores essential information throughout the counseling process, ensuring coherent and directed conversations. The system employs an LLM, a stage-aware instruction generator, a response unpacker, a topic database, and a stage controller to maintain dialogue flow. In addition, we propose a novel technique that simulates counseling clients to interact with the evaluated system and evaluate its performance automatically. When assessed using both objective and subjective evaluations, SuDoSys demonstrates its effectiveness in generating logically coherent responses. The system’s code and program scripts for evaluation are open-sourced.
摘要：结构化对话系统，简称 SuDoSys，是一种基于大语言模型 (LLM) 的创新型聊天机器人，旨在提供心理咨询服务。SuDoSys 利用世界卫生组织 (WHO) 的问题管理加 (PM+) 指南，提供阶段感知的多次对话。现有的将 LLM 应用于多次心理咨询的方法通常涉及使用生成的对话进行直接微调，往往忽视了咨询过程中动态的阶段转换。与以往的方法不同，SuDoSys 考虑了咨询的不同阶段，并在整个咨询过程中存储关键信息，确保对话的连贯性和导向性。该系统采用 LLM、阶段感知指令生成器、响应解包器、主题数据库和阶段控制器来维持对话流程。此外，我们提出了一种新技术，模拟咨询客户与评估系统进行交互，并自动评估其性能。通过客观和主观评估，SuDoSys 展示了其在生成逻辑连贯响应方面的有效性。系统的代码和评估程序脚本已开源。

[NLP-57] IntentGPT: Few-shot Intent Discovery with Large Language Models ICLR2024

链接: https://arxiv.org/abs/2411.10670
作者: Juan A. Rodriguez,Nicholas Botzer,David Vazquez,Christopher Pal,Marco Pedersoli,Issam Laradji
关键词-EN:
类目: Computation and Language (cs.CL)
备注: ICLR 2024 Workshop on LLM Agents

点击查看摘要

[NLP-58] SAM Decoding: Speculative Decoding via Suffix Automaton

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在推理速度上的瓶颈问题，特别是由于其庞大的参数规模和自回归 (autoregressive) 特性导致的推理速度限制。解决方案的关键在于引入了一种基于检索的推测解码方法 (SAM-Decoding)，该方法利用后缀自动机 (suffix automaton) 进行高效且准确的草稿生成。与现有的基于 n-gram 匹配的方法不同，SAM-Decoding 在生成文本和文本语料库中寻找最长的后缀匹配，从而在每个生成步骤中实现平均时间复杂度为 O(1)。SAM-Decoding 分别构建了静态和动态后缀自动机用于文本语料库和输入提示，从而实现快速且精确的草稿生成。此外，SAM-Decoding 设计为可与现有方法结合，能够根据匹配长度自适应选择草稿生成策略，从而提高 LLM 的推理速度。在结合 Token Recycling 和 EAGLE2 方法的评估中，SAM-Decoding 分别实现了 2.27 倍和 2.49 倍的加速，超越了所有当前的方法。

链接: https://arxiv.org/abs/2411.10666
作者: Yuxuan Hu,Ke Wang,Jing Zhang,Cuiping Li,Hong Chen
关键词-EN: Large Language Models, revolutionized natural language, natural language processing, large parameter sizes, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing by unifying tasks into text generation, yet their large parameter sizes and autoregressive nature limit inference speed. SAM-Decoding addresses this by introducing a novel retrieval-based speculative decoding method that uses a suffix automaton for efficient and accurate draft generation. Unlike n-gram matching used by the existing method, SAM-Decoding finds the longest suffix match in generating text and text corpuss, achieving an average time complexity of O(1) per generation step. SAM-Decoding constructs static and dynamic suffix automatons for the text corpus and input prompts, respectively, enabling fast and precise draft generation. Meanwhile, it is designed as an approach that can be combined with existing methods, allowing SAM-Decoding to adaptively select a draft generation strategy based on the matching length, thus increasing the inference speed of the LLM. When combined with Token Recycling, evaluations show SAM-Decoding outperforms existing model-free methods, achieving a speedup of 2.27\times over autoregressive decoding on Spec-Bench. When combined with EAGLE2, it reaches a speedup of 2.49\times , surpassing all current approaches. Our code is available at this https URL.
摘要：大语言模型 (LLMs) 通过将任务统一为文本生成，彻底改变了自然语言处理领域，但其庞大的参数规模和自回归特性限制了推理速度。SAM-解码通过引入一种新颖的基于检索的推测解码方法来解决这一问题，该方法使用后缀自动机进行高效且准确的草稿生成。与现有方法使用的 n-gram 匹配不同，SAM-解码在生成文本和文本语料库中找到最长后缀匹配，实现每个生成步骤的平均时间复杂度为 O(1)。SAM-解码分别构建了文本语料库和输入提示的静态和动态后缀自动机，从而实现快速且精确的草稿生成。同时，它被设计为一种可以与现有方法结合的策略，使得 SAM-解码能够根据匹配长度自适应选择草稿生成策略，从而提高大语言模型的推理速度。结合 Token 回收技术，评估显示 SAM-解码优于现有的无模型方法，在 Spec-Bench 上比自回归解码实现了 2.27 倍的加速。当与 EAGLE2 结合时，其加速效果达到 2.49 倍，超越了所有当前的方法。我们的代码可在以下链接获取：https URL。

[NLP-59] BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

链接: https://arxiv.org/abs/2411.10640
作者: Xudong Lu,Yinghao Chen,Cheng Chen,Hui Tan,Boheng Chen,Yina Xie,Rui Hu,Guanxin Tan,Renshou Wu,Yan Hu,Yi Zeng,Lei Wu,Liuyang Bian,Zhaoxiong Wang,Long Liu,Yanzhou Yang,Han Xiao,Aojun Zhou,Yafei Wen,Xiaoxin Chen,Shuai Ren,Hongsheng Li
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 21 pages

点击查看摘要

[NLP-60] MTA: Multimodal Task Alignment for BEV Perception and Captioning

链接: https://arxiv.org/abs/2411.10639
作者: Yunsheng Ma,Burhaneddin Yaman,Xin Ye,Feng Tao,Abhirup Mallik,Ziran Wang,Liu Ren
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

[NLP-61] Gender Bias Mitigation for Bangla Classification Tasks

【速读】：该论文试图解决孟加拉语预训练语言模型中的性别偏见问题，这是一个在低资源语言中尚未充分探索的领域。解决方案的关键在于通过性别名称互换技术创建了四个手动标注的任务特定数据集，用于情感分析、毒性检测、仇恨言论检测和讽刺检测，以确保这些数据集适合检测和缓解性别偏见。随后，论文提出了一种联合损失优化技术，用于在任务特定预训练模型中缓解性别偏见。该方法在评估中不仅有效地减少了偏见，而且在与其他基线方法相比时保持了竞争性的准确性。

链接: https://arxiv.org/abs/2411.10636
作者: Sajib Kumar Saha Joy,Arman Hassan Mahy,Meherin Sultana,Azizah Mamun Abha,MD Piyal Ahmmed,Yue Dong,G M Shahariar
关键词-EN: Bangla pretrained language, Bangla pretrained, low-resource languages, pretrained language models, largely under explored
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this study, we investigate gender bias in Bangla pretrained language models, a largely under explored area in low-resource languages. To assess this bias, we applied gender-name swapping techniques to existing datasets, creating four manually annotated, task-specific datasets for sentiment analysis, toxicity detection, hate speech detection, and sarcasm detection. By altering names and gender-specific terms, we ensured these datasets were suitable for detecting and mitigating gender bias. We then proposed a joint loss optimization technique to mitigate gender bias across task-specific pretrained models. Our approach was evaluated against existing bias mitigation methods, with results showing that our technique not only effectively reduces bias but also maintains competitive accuracy compared to other baseline approaches. To promote further research, we have made both our implementation and datasets publicly available this https URL
摘要：在本研究中，我们探讨了孟加拉语预训练语言模型中的性别偏见问题，这是一个在低资源语言领域中尚未充分探索的领域。为了评估这种偏见，我们采用了性别名称互换技术对现有数据集进行了处理，创建了四个手动标注的、针对特定任务的数据集，用于情感分析、毒性检测、仇恨言论检测和讽刺检测。通过更改名称和性别特定术语，我们确保这些数据集适合用于检测和缓解性别偏见。随后，我们提出了一种联合损失优化技术，以缓解特定任务预训练模型中的性别偏见。我们的方法在与现有偏见缓解方法的对比评估中表现出色，结果显示，我们的技术不仅有效减少了偏见，而且在与其他基线方法相比时，保持了竞争性的准确率。为了促进进一步的研究，我们已将我们的实现和数据集公开发布，详见此 https URL。

[NLP-62] Leveraging large language models for efficient representation learning for entity resolution

链接: https://arxiv.org/abs/2411.10629
作者: Xiaowei Xu,Bi T. Foua,Xingqiao Wang,Vivek Gunasekaran,John R. Talburt
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages and 12 figures

点击查看摘要

[NLP-63] A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

链接: https://arxiv.org/abs/2411.10588
作者: Caspar Oesterheld,Emery Cooper,Miles Kodama,Linh Chi Nguyen,Ethan Perez
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 48 pages, 15 figures; code and data at this https URL

点击查看摘要

[NLP-64] On the Shortcut Learning in Multilingual Neural Machine Translation

链接: https://arxiv.org/abs/2411.10581
作者: Wenxuan Wang,Wenxiang Jiao,Jen-tse Huang,Zhaopeng Tu,Michael R. Lyu
关键词-EN:
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by Neurocomputing 2024

点击查看摘要

[NLP-65] Hysteresis Activation Function for Efficient Inference NEURIPS

链接: https://arxiv.org/abs/2411.10573
作者: Moshe Kimhi,Idan Kashani,Avi Mendelson,Chaim Baskin
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: Accepted to 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024)

点击查看摘要

[NLP-66] mlan: language-based instruction tuning improves zero-shot generalization of multimodal large language models

链接: https://arxiv.org/abs/2411.10557
作者: Jianhong Tu,Zhuohao Ni,Nicholas Crispino,Zihao Yu,Michael Bendersky,Beliz Gunel,Ruoxi Jia,Xin Liu,Lingjuan Lyu,Dawn Song,Chenguang Wang
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-67] Efficient Alignment of Large Language Models via Data Sampling

链接: https://arxiv.org/abs/2411.10545
作者: Amrit Khera,Rajat Ghosh,Debojyoti Dutta
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-68] SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism

链接: https://arxiv.org/abs/2411.10543
作者: Priyansh Bhatnagar,Linfeng Wen,Mingu Kang
关键词-EN:
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-69] Does Prompt Formatting Have Any Impact on LLM Performance? NAACL2025

链接: https://arxiv.org/abs/2411.10541
作者: Jia He,Mukund Rungta,David Koleczek,Arshdeep Sekhon,Franklin X Wang,Sadid Hasan
关键词-EN:
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to NAACL 2025

点击查看摘要

[NLP-70] “On the goals of linguistic theory”: Revisiting Chomskyan theories in the era of AI

链接: https://arxiv.org/abs/2411.10533
作者: Eva Portelance,Masoud Jasbi
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-71] Everything is a Video: Unifying Modalities through Next-Frame Prediction

链接: https://arxiv.org/abs/2411.10503
作者: G. Thomas Hudson,Dean Slack,Thomas Winterbottom,Jamie Sterling,Chenghao Xiao,Junjie Shentu,Noura Al Moubayed
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 10 figures

点击查看摘要

[NLP-72] Hateful Meme Detection through Context-Sensitive Prompting and Fine-Grained Labeling AAAI-25

链接: https://arxiv.org/abs/2411.10480
作者: Rongxin Ouyang,Kokil Jaidka,Subhayan Mukerjee,Guangyu Cui
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: AAAI-25 Student Abstract, Oral Presentation

点击查看摘要

[NLP-73] A Survey on Importance of Homophones Spelling Correction Model for Khmer Authors

链接: https://arxiv.org/abs/2411.10477
作者: Seanghort Born,Madeth May,Claudine Piau-Toffolon,Sébastien Iksal
关键词-EN:
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-74] PhDGPT: Introducing a psychometric and linguistic dataset about how large language models perceive graduate students and professors in psychology

链接: https://arxiv.org/abs/2411.10473
作者: Edoardo Sebastiano De Duro,Enrique Taietta,Riccardo Improta,Massimo Stella
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 20 pages, 8 figures. Edoardo Sebastiano De Duro and Enrique Taietta equally contributed to this work

点击查看摘要

[NLP-75] Debiasing Watermarks for Large Language Models via Maximal Coupling

链接: https://arxiv.org/abs/2411.11203
作者: Yangxinyu Xie,Xiang Li,Tanwi Mallick,Weijie J. Su,Ruixun Zhang
关键词-EN:
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

[NLP-76] Bilingual Text-dependent Speaker Verification with Pre-trained Models for TdSV Challenge 2024

链接: https://arxiv.org/abs/2411.10828
作者: Seyed Ali Farokh
关键词-EN:
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages, no figures

点击查看摘要

人工智能

[AI-0] LightFFDNets: Lightweight Convolutional Neural Networks for Rapid Facial Forgery Detection

链接: https://arxiv.org/abs/2411.11826
作者: Günel Jabbarlı,Murat Kurt
关键词-EN: issue of great, great importance, facial, Deep learning, facial imagery
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages, 6 figures, 10 tables

点击查看摘要

Abstract:Accurate and fast recognition of forgeries is an issue of great importance in the fields of artificial intelligence, image processing and object detection. Recognition of forgeries of facial imagery is the process of classifying and defining the faces in it by analyzing real-world facial images. This process is usually accomplished by extracting features from an image, using classifier algorithms, and correctly interpreting the results. Recognizing forgeries of facial imagery correctly can encounter many different challenges. For example, factors such as changing lighting conditions, viewing faces from different angles can affect recognition performance, and background complexity and perspective changes in facial images can make accurate recognition difficult. Despite these difficulties, significant progress has been made in the field of forgery detection. Deep learning algorithms, especially Convolutional Neural Networks (CNNs), have significantly improved forgery detection performance. This study focuses on image processing-based forgery detection using Fake-Vs-Real-Faces (Hard) [10] and 140k Real and Fake Faces [61] data sets. Both data sets consist of two classes containing real and fake facial images. In our study, two lightweight deep learning models are proposed to conduct forgery detection using these images. Additionally, 8 different pretrained CNN architectures were tested on both data sets and the results were compared with newly developed lightweight CNN models. It’s shown that the proposed lightweight deep learning models have minimum number of layers. It’s also shown that the proposed lightweight deep learning models detect forgeries of facial imagery accurately, and computationally efficiently. Although the data set consists only of face images, the developed models can also be used in other two-class object recognition problems. Comments: 13 pages, 6 figures, 10 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) ACMclasses: I.4.9; I.2.10 Cite as: arXiv:2411.11826 [cs.CV] (or arXiv:2411.11826v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.11826 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-1] COST CA20120 INTERACT Framework of Artificial Intelligence Based Channel Modeling

链接: https://arxiv.org/abs/2411.11798
作者: Ruisi He,Nicola D. Cicco,Bo Ai,Mi Yang,Yang Miao,Mate Boban
关键词-EN: Accurate channel models, Channel modeling, prerequisite for communication-theoretic, communication-theoretic investigations, channel
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: to appear in IEEE Wireless Communications Magazine

点击查看摘要

Abstract:Accurate channel models are the prerequisite for communication-theoretic investigations as well as system design. Channel modeling generally relies on statistical and deterministic approaches. However, there are still significant limits for the traditional modeling methods in terms of accuracy, generalization ability, and computational complexity. The fundamental reason is that establishing a quantified and accurate mapping between physical environment and channel characteristics becomes increasing challenging for modern communication systems. Here, in the context of COST CA20120 Action, we evaluate and discuss the feasibility and implementation of using artificial intelligence (AI) for channel modeling, and explore where the future of this field lies. Firstly, we present a framework of AI-based channel modeling to characterize complex wireless channels. Then, we highlight in detail some major challenges and present the possible solutions: i) estimating the uncertainty of AI-based channel predictions, ii) integrating prior knowledge of propagation to improve generalization capabilities, and iii) interpretable AI for channel modeling. We present and discuss illustrative numerical results to showcase the capabilities of AI-based channel modeling.

[AI-2] Exploring the Requirements of Clinicians for Explainable AI Decision Support Systems in Intensive Care

链接: https://arxiv.org/abs/2411.11774
作者: Jeffrey N. Clark,Matthew Wragg,Emily Nielsen,Miquel Perello-Nieto,Nawid Keshtmand,Michael Ambler,Shiv Sharma,Christopher P. Bourdeaux,Amberly Brigden,Raul Santos-Rodriguez
关键词-EN: artificial intelligence, models become increasingly, understand how digital, increasingly complex, Abstract
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:There is a growing need to understand how digital systems can support clinical decision-making, particularly as artificial intelligence (AI) models become increasingly complex and less human-interpretable. This complexity raises concerns about trustworthiness, impacting safe and effective adoption of such technologies. Improved understanding of decision-making processes and requirements for explanations coming from decision support tools is a vital component in providing effective explainable solutions. This is particularly relevant in the data-intensive, fast-paced environments of intensive care units (ICUs). To explore these issues, group interviews were conducted with seven ICU clinicians, representing various roles and experience levels. Thematic analysis revealed three core themes: (T1) ICU decision-making relies on a wide range of factors, (T2) the complexity of patient state is challenging for shared decision-making, and (T3) requirements and capabilities of AI decision support systems. We include design recommendations from clinical input, providing insights to inform future AI systems for intensive care.

[AI-3] AdaptLIL: A Gaze-Adaptive Visualization for Ontology Mapping

链接: https://arxiv.org/abs/2411.11768
作者: Nicholas Chow,Bo Fu
关键词-EN: paper showcases AdaptLIL, primary input source, link-indented list ontology, adaptive link-indented list, showcases AdaptLIL
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper showcases AdaptLIL, a real-time adaptive link-indented list ontology mapping visualization that uses eye gaze as the primary input source. Through a multimodal combination of real-time systems, deep learning, and web development applications, this system uniquely curtails graphical overlays (adaptations) to pairwise mappings of link-indented list ontology visualizations for individual users based solely on their eye gaze.

[AI-4] QARM: Quantitative Alignment Multi-Modal Recommendation at Kuaishou

链接: https://arxiv.org/abs/2411.11739
作者: Xinchen Luo,Jiangxia Cao,Tianyu Sun,Jinkai Yu,Rui Huang,Wei Yuan,Hezheng Lin,Yichen Zheng,Shiyao Wang,Qigen Hu,Changqing Qiu,Jiaqi Zhang,Xu Zhang,Zhiheng Yan,Jingming Zhang,Simin Zhang,Mingxing Wen,Zhaojie Liu,Kun Gai,Guorui Zhou
关键词-EN: recommender researchers realized, user interest modeling, multi-modal large models, recent years, significant evolution
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Work in progress

点击查看摘要

Abstract:In recent years, with the significant evolution of multi-modal large models, many recommender researchers realized the potential of multi-modal information for user interest modeling. In industry, a wide-used modeling architecture is a cascading paradigm: (1) first pre-training a multi-modal model to provide omnipotent representations for downstream services; (2) The downstream recommendation model takes the multi-modal representation as additional input to fit real user-item behaviours. Although such paradigm achieves remarkable improvements, however, there still exist two problems that limit model performance: (1) Representation Unmatching: The pre-trained multi-modal model is always supervised by the classic NLP/CV tasks, while the recommendation models are supervised by real user-item interaction. As a result, the two fundamentally different tasks’ goals were relatively separate, and there was a lack of consistent objective on their representations; (2) Representation Unlearning: The generated multi-modal representations are always stored in cache store and serve as extra fixed input of recommendation model, thus could not be updated by recommendation model gradient, further unfriendly for downstream training. Inspired by the two difficulties challenges in downstream tasks usage, we introduce a quantitative multi-modal framework to customize the specialized and trainable multi-modal information for different downstream models.

[AI-5] WoodYOLO: A Novel Object Detector for Wood Species Detection in Microscopic Images

链接: https://arxiv.org/abs/2411.11738
作者: Lars Nieradzik,Henrike Stephani,Jördis Sieburg-Rockel,Stephanie Helmling,Andrea Olbrich,Stephanie Wrage,Janis Keuper
关键词-EN: species identification plays, Wood species identification, advancing ecological conservation, species identification, identification plays
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Wood species identification plays a crucial role in various industries, from ensuring the legality of timber products to advancing ecological conservation efforts. This paper introduces WoodYOLO, a novel object detection algorithm specifically designed for microscopic wood fiber analysis. Our approach adapts the YOLO architecture to address the challenges posed by large, high-resolution microscopy images and the need for high recall in localization of the cell type of interest (vessel elements). Our results show that WoodYOLO significantly outperforms state-of-the-art models, achieving performance gains of 12.9% and 6.5% in F2 score over YOLOv10 and YOLOv7, respectively. This improvement in automated wood cell type localization capabilities contributes to enhancing regulatory compliance, supporting sustainable forestry practices, and promoting biodiversity conservation efforts globally.

[AI-6] Lifted Model Construction without Normalisation: A Vectorised Approach to Exploit Symmetries in Factor Graphs

链接: https://arxiv.org/abs/2411.11730
作者: Malte Luttermann,Ralf Möller,Marcel Gehrke
关键词-EN: logical variables, respect to domain, domain sizes, sizes of logical, parametric factor graph
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to the Proceedings of the 3rd Learning on Graphs Conference (LoG 2024)

点击查看摘要

Abstract:Lifted probabilistic inference exploits symmetries in a probabilistic model to allow for tractable probabilistic inference with respect to domain sizes of logical variables. We found that the current state-of-the-art algorithm to construct a lifted representation in form of a parametric factor graph misses symmetries between factors that are exchangeable but scaled differently, thereby leading to a less compact representation. In this paper, we propose a generalisation of the advanced colour passing (ACP) algorithm, which is the state of the art to construct a parametric factor graph. Our proposed algorithm allows for potentials of factors to be scaled arbitrarily and efficiently detects more symmetries than the original ACP algorithm. By detecting strictly more symmetries than ACP, our algorithm significantly reduces online query times for probabilistic inference when the resulting model is applied, which we also confirm in our experiments.

[AI-7] Semantic-Geometric-Physical-Driven Robot Manipulation Skill Transfer via Skill Library and Tactile Representation

链接: https://arxiv.org/abs/2411.11714
作者: Mingchao Qi,Yuanjin Li,Xing Liu,Zhengxiong Liu,Panfeng Huang
关键词-EN: involves complex tasks, complex tasks characterized, Deploying robots, environments involves complex, necessitating efficient transfer
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deploying robots in open-world environments involves complex tasks characterized by long sequences and rich interactions, necessitating efficient transfer of robotic skills across diverse and complex scenarios. To address this challenge, we propose a skill library framework based on knowledge graphs, which endows robots with high-level skill awareness and spatial semantic understanding. The framework hierarchically organizes operational knowledge by constructing a “task graph” and a “scene graph” to represent task and scene semantic information, respectively. We introduce a “state graph” to facilitate interaction between high-level task planning and low-level scene information. Furthermore, we propose a hierarchical transfer framework for operational skills. At the task level, the framework integrates contextual learning and chain-of-thought prompting within a four-stage prompt paradigm, leveraging large language models’ (LLMs) reasoning and generalization capabilities to achieve task-level subtask sequence transfer. At the motion level, an adaptive trajectory transfer method is developed using the A* algorithm and the skill library, enabling motion-level adaptive trajectory transfer. At the physical level, we introduce an adaptive contour extraction and posture perception method based on tactile perception. This method dynamically obtains high-precision contour and posture information from visual-tactile texture data and adjusts transferred skills, such as contact positions and postures, to ensure effectiveness in new environments. Experimental results validate the effectiveness of the proposed methods. Project website:this https URL

[AI-8] MC-LLaVA: Multi-Concept Personalized Vision-Language Model

链接: https://arxiv.org/abs/2411.11706
作者: Ruichuan An,Sihan Yang,Ming Lu,Kai Zeng,Yulin Luo,Ying Chen,Jiajun Cao,Hao Liang,Qi She,Shanghang Zhang,Wentao Zhang
关键词-EN: Current vision-language models, show exceptional abilities, Current vision-language, visual question answering, tasks including visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current vision-language models (VLMs) show exceptional abilities across diverse tasks including visual question answering. To enhance user experience in practical applications, recent studies investigate VLM personalization to understand user-provided concepts. However, existing studies mainly focus on single-concept personalization, neglecting the existence and interplay of multiple concepts, which limits the real-world applicability of personalized VLMs. In this paper, we propose the first multi-concept personalization method named MC-LLaVA along with a high-quality multi-concept personalization dataset. Specifically, MC-LLaVA uses a joint training strategy incorporating multiple concepts in a single training step, allowing VLMs to perform accurately in multi-concept personalization. To reduce the cost of joint training, MC-LLaVA leverages visual token information for concept token initialization, yielding improved concept representation and accelerating joint training. To advance multi-concept personalization research, we further contribute a high-quality dataset. We carefully collect images from various movies that contain multiple characters and manually generate the multi-concept question-answer samples. Our dataset features diverse movie types and question-answer types. We conduct comprehensive qualitative and quantitative experiments to demonstrate that MC-LLaVA can achieve impressive multi-concept personalized responses, paving the way for VLMs to become better user-specific assistants. The code and dataset will be publicly available at this https URL.

[AI-9] Conceptwm: A Diffusion Model Watermark for Concept Protection

链接: https://arxiv.org/abs/2411.11688
作者: Liangqi Lei,Keke Gai,Jing Yu,Liehuang Zhu,Qi Wu
关键词-EN: diffusion models succeed, succeed in generating, pose threats, diffusion models, generating specific concepts
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:The personalization techniques of diffusion models succeed in generating specific concepts but also pose threats to copyright protection and illegal use. Model Watermarking is an effective method to prevent the unauthorized use of subject-driven or style-driven image generation, safeguarding concept copyrights. However, under the goal of concept-oriented protection, current watermarking schemes typically add watermarks to all images rather than applying them in a refined manner targeted at specific concepts. Additionally, the personalization techniques of diffusion models can easily remove watermarks. Existing watermarking methods struggle to achieve fine-grained watermark embedding with a few images of specific concept and prevent removal of watermarks through personalized fine-tuning. Therefore, we introduce a novel concept-oriented watermarking framework that seamlessly embeds imperceptible watermarks into the concept of diffusion models. We conduct extensive experiments and ablation studies to verify our framework. Our code is available at this https URL.

[AI-10] rojanRobot: Backdoor Attacks Against Robotic Manipulation in the Physical World

链接: https://arxiv.org/abs/2411.11683
作者: Xianlong Wang,Hewen Pan,Hangtao Zhang,Minghui Li,Shengshan Hu,Ziqi Zhou,Lulu Xue,Peijin Guo,Yichen Wang,Wei Wan,Aishan Liu,Leo Yu Zhang
关键词-EN: Robotic manipulation refers, autonomous handling, handling and interaction, objects using advanced, advanced techniques
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Initial version with preliminary results. We welcome any feedback or suggestions

点击查看摘要

Abstract:Robotic manipulation refers to the autonomous handling and interaction of robots with objects using advanced techniques in robotics and artificial intelligence. The advent of powerful tools such as large language models (LLMs) and large vision-language models (LVLMs) has significantly enhanced the capabilities of these robots in environmental perception and decision-making. However, the introduction of these intelligent agents has led to security threats such as jailbreak attacks and adversarial attacks. In this research, we take a further step by proposing a backdoor attack specifically targeting robotic manipulation and, for the first time, implementing backdoor attack in the physical world. By embedding a backdoor visual language model into the visual perception module within the robotic system, we successfully mislead the robotic arm’s operation in the physical world, given the presence of common items as triggers. Experimental evaluations in the physical world demonstrate the effectiveness of the proposed backdoor attack. Comments: Initial version with preliminary results. We welcome any feedback or suggestions Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.11683 [cs.RO] (or arXiv:2411.11683v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2411.11683 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-11] PSPO*: An Effective Process-supervised Policy Optimization for Reasoning Alignment

链接: https://arxiv.org/abs/2411.11681
作者: Jiawei Li,Xinyue Liang,Yizhe Yang,Chong Feng,Yang Gao
关键词-EN: large language models, Process supervision, Process supervision enhances, large language, advanced large language
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Process supervision enhances the performance of large language models in reasoning tasks by providing feedback at each step of chain-of-thought reasoning. However, due to the lack of effective process supervision methods, even advanced large language models are prone to logical errors and redundant reasoning. We claim that the effectiveness of process supervision significantly depends on both the accuracy and the length of reasoning chains. Moreover, we identify that these factors exhibit a nonlinear relationship with the overall reward score of the reasoning process. Inspired by these insights, we propose a novel process supervision paradigm, PSPO*, which systematically outlines the workflow from reward model training to policy optimization, and highlights the importance of nonlinear rewards in process supervision. Based on PSPO*, we develop the PSPO-WRS, which considers the number of reasoning steps in determining reward scores and utilizes an adjusted Weibull distribution for nonlinear reward shaping. Experimental results on six mathematical reasoning datasets demonstrate that PSPO-WRS consistently outperforms current mainstream models.

[AI-12] Artificial Scientific Discovery

链接: https://arxiv.org/abs/2411.11672
作者: Antonio Norelli
关键词-EN: autonomously generate original, generate original research, fundamental concepts needed, AlphaGo Zero-like agent, discovers Othello knowledge
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: PhD thesis, 123 pages

点击查看摘要

Abstract:Rooted in the explosion of deep learning over the past decade, this thesis spans from AlphaGo to ChatGPT to empirically examine the fundamental concepts needed to realize the vision of an artificial scientist: a machine with the capacity to autonomously generate original research and contribute to the expansion of human knowledge. The investigation begins with \sc Olivaw, an AlphaGo Zero-like agent that discovers Othello knowledge from scratch but is unable to communicate it. This realization leads to the development of the Explanatory Learning (EL) framework, a formalization of the problem faced by a scientist when trying to explain a new phenomenon to their peers. The effective EL prescriptions allow us to crack Zendo, a board game simulating the scientific endeavor. This success comes with a fundamental insight: an artificial scientist must develop its own interpretation of the language used to explain its findings. This perspective then leads us to see modern multimodal models as interpreters, and to devise a new way to build interpretable and cost-effective CLIP-like models: by coupling two unimodal models using little multimodal data and no further training. Finally, we discuss what ChatGPT and its siblings are still missing to become artificial scientists, and introduce Odeen, a benchmark about interpreting explanations that sees LLMs going no further than random chance while being instead fully solved by humans.

[AI-13] Dissecting Misalignment of Multimodal Large Language Models via Influence Function

链接: https://arxiv.org/abs/2411.11667
作者: Lijie Hu,Chenyang Ren,Huanyi Xie,Khouloud Saadi,Shu Yang,Jingfeng Zhang,Di Wang
关键词-EN: Multi-modal Large Language, Large Language models, Multi-modal Large, Large Language, mislabeled text-image pairs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 34 pages

点击查看摘要

Abstract:Multi-modal Large Language models (MLLMs) are always trained on data from diverse and unreliable sources, which may contain misaligned or mislabeled text-image pairs. This frequently causes robustness issues and hallucinations, leading to performance degradation. Data valuation is an efficient way to detect and trace these misalignments. Nevertheless, existing methods are computationally expensive for MLLMs. While computationally efficient, the classical influence functions are inadequate for contrastive learning models because they were originally designed for pointwise loss. Additionally, contrastive learning involves minimizing the distance between the modalities of positive samples and maximizing the distance between the modalities of negative samples. This requires us to evaluate the influence of samples from both perspectives. To tackle these challenges, we introduce the Extended Influence Function for Contrastive Loss (ECIF), an influence function crafted for contrastive loss. ECIF considers both positive and negative samples and provides a closed-form approximation of contrastive learning models, eliminating the need for retraining. Building upon ECIF, we develop a series of algorithms for data evaluation in MLLM, misalignment detection, and misprediction trace-back tasks. Experimental results demonstrate our ECIF advances the transparency and interpretability of MLLMs by offering a more accurate assessment of data impact and model alignment compared to traditional baseline methods.

[AI-14] No-regret Exploration in Shuffle Private Reinforcement Learning

链接: https://arxiv.org/abs/2411.11647
作者: Shaojie Bai,Mohammad Sadegh Talebi,Chengcheng Zhao,Peng Cheng,Jiming Chen
关键词-EN: formally address user, episodic reinforcement learning, Differential privacy, user privacy concerns, address user privacy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Differential privacy (DP) has recently been introduced into episodic reinforcement learning (RL) to formally address user privacy concerns in personalized services. Previous work mainly focuses on two trust models of DP: the central model, where a central agent is responsible for protecting users’ sensitive data, and the (stronger) local model, where the protection occurs directly on the user side. However, they either require a trusted central agent or incur a significantly higher privacy cost, making it unsuitable for many scenarios. This work introduces a trust model stronger than the central model but with a lower privacy cost than the local model, leveraging the emerging \emphshuffle model of privacy. We present the first generic algorithm for episodic RL under the shuffle model, where a trusted shuffler randomly permutes a batch of users’ data before sending it to the central agent. We then instantiate the algorithm using our proposed shuffle Privatizer, relying on a shuffle private binary summation mechanism. Our analysis shows that the algorithm achieves a near-optimal regret bound comparable to that of the centralized model and significantly outperforms the local model in terms of privacy cost.

[AI-15] SINR: Capturing Temporal Continuity via Implicit Neural Representations for Time Series Anomaly Detection KDD2025

链接: https://arxiv.org/abs/2411.11641
作者: Mengxuan Li,Ke Liu,Hongyang Chen,Jiajun Bu,Hongwei Wang,Haishuai Wang
关键词-EN: systems’ expected behavior, identify unusual patterns, Time series anomaly, series anomaly detection, Time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by SIGKDD 2025

点击查看摘要

Abstract:Time series anomaly detection aims to identify unusual patterns in data or deviations from systems’ expected behavior. The reconstruction-based methods are the mainstream in this task, which learn point-wise representation via unsupervised learning. However, the unlabeled anomaly points in training data may cause these reconstruction-based methods to learn and reconstruct anomalous data, resulting in the challenge of capturing normal patterns. In this paper, we propose a time series anomaly detection method based on implicit neural representation (INR) reconstruction, named TSINR, to address this challenge. Due to the property of spectral bias, TSINR enables prioritizing low-frequency signals and exhibiting poorer performance on high-frequency abnormal data. Specifically, we adopt INR to parameterize time series data as a continuous function and employ a transformer-based architecture to predict the INR of given data. As a result, the proposed TSINR method achieves the advantage of capturing the temporal continuity and thus is more sensitive to discontinuous anomaly data. In addition, we further design a novel form of INR continuous function to learn inter- and intra-channel information, and leverage a pre-trained large language model to amplify the intense fluctuations in anomalies. Extensive experiments demonstrate that TSINR achieves superior overall performance on both univariate and multivariate time series anomaly detection benchmarks compared to other state-of-the-art reconstruction-based methods. Our codes are available.

[AI-16] SP 3 : Superpixel-propagated pseudo-label learning for weakly semi-supervised medical image segmentation

链接: https://arxiv.org/abs/2411.11636
作者: Shiman Li,Jiayue Zhao,Shaolei Liu,Xiaokun Dai,Chenxi Zhang,Zhijian Song
关键词-EN: Deep learning-based medical, medical image segmentation, learning-based medical image, Deep learning-based, semi-supervised medical image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 7 figures. Under Review

点击查看摘要

Abstract:Deep learning-based medical image segmentation helps assist diagnosis and accelerate the treatment process while the model training usually requires large-scale dense annotation datasets. Weakly semi-supervised medical image segmentation is an essential application because it only requires a small amount of scribbles and a large number of unlabeled data to train the model, which greatly reduces the clinician’s effort to fully annotate images. To handle the inadequate supervisory information challenge in weakly semi-supervised segmentation (WSSS), a SuperPixel-Propagated Pseudo-label (SP ^3 ) learning method is proposed, using the structural information contained in superpixel for supplemental information. Specifically, the annotation of scribbles is propagated to superpixels and thus obtains a dense annotation for supervised training. Since the quality of pseudo-labels is limited by the low-quality annotation, the beneficial superpixels selected by dynamic thresholding are used to refine pseudo-labels. Furthermore, aiming to alleviate the negative impact of noise in pseudo-label, superpixel-level uncertainty is incorporated to guide the pseudo-label supervision for stable learning. Our method achieves state-of-the-art performance on both tumor and organ segmentation datasets under the WSSS setting, using only 3% of the annotation workload compared to fully supervised methods and attaining approximately 80% Dice score. Additionally, our method outperforms eight weakly and semi-supervised methods under both weakly supervised and semi-supervised settings. Results of extensive experiments validate the effectiveness and annotation efficiency of our weakly semi-supervised segmentation, which can assist clinicians in achieving automated segmentation for organs or tumors quickly and ultimately benefit patients.

[AI-17] ST-Tree with Interpretability for Multivariate Time Series Classification

链接: https://arxiv.org/abs/2411.11620
作者: Mingsen Du,Yanxuan Wei,Yingxia Tang,Xiangwei Zheng,Shoushui Wei,Cun Ji
关键词-EN: Multivariate time series, time series classification, time series, Multivariate time, series classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted on May 15, 2024, major revisions on Aug 31, 2024

点击查看摘要

Abstract:Multivariate time series classification is of great importance in practical applications and is a challenging task. However, deep neural network models such as Transformers exhibit high accuracy in multivariate time series classification but lack interpretability and fail to provide insights into the decision-making process. On the other hand, traditional approaches based on decision tree classifiers offer clear decision processes but relatively lower accuracy. Swin Transformer (ST) addresses these issues by leveraging self-attention mechanisms to capture both fine-grained local patterns and global patterns. It can also model multi-scale feature representation learning, thereby providing a more comprehensive representation of time series features. To tackle the aforementioned challenges, we propose ST-Tree with interpretability for multivariate time series classification. Specifically, the ST-Tree model combines ST as the backbone network with an additional neural tree model. This integration allows us to fully leverage the advantages of ST in learning time series context while providing interpretable decision processes through the neural tree. This enables researchers to gain clear insights into the model’s decision-making process and extract meaningful interpretations. Through experimental evaluations on 10 UEA datasets, we demonstrate that the ST-Tree model improves accuracy in multivariate time series classification tasks and provides interpretability through visualizing the decision-making process across different datasets.

[AI-18] Signaling and Social Learning in Swarms of Robots

链接: https://arxiv.org/abs/2411.11616
作者: Leo Cazenille,Maxime Toquebiau,Nicolas Lobato-Dauzier,Alessia Loi,Loona Macabre,Nathanael Aubert-Kato,Anthony Genot,Nicolas Bredeche
关键词-EN: execution occur simultaneously, decentralized manner, improving coordination, execution occur, occur simultaneously
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 17 pages, 3 Figures

点击查看摘要

Abstract:This paper investigates the role of communication in improving coordination within robot swarms, focusing on a paradigm where learning and execution occur simultaneously in a decentralized manner. We highlight the role communication can play in addressing the credit assignment problem (individual contribution to the overall performance), and how it can be influenced by it. We propose a taxonomy of existing and future works on communication, focusing on information selection and physical abstraction as principal axes for classification: from low-level lossless compression with raw signal extraction and processing to high-level lossy compression with structured communication models. The paper reviews current research from evolutionary robotics, multi-agent (deep) reinforcement learning, language models, and biophysics models to outline the challenges and opportunities of communication in a collective of robots that continuously learn from one another through local message exchanges, illustrating a form of social learning.

[AI-19] opology-aware Preemptive Scheduling for Co-located LLM Workloads

链接: https://arxiv.org/abs/2411.11560
作者: Ping Zhang,Lei Su,Jinjie Yang,Xin Chen
关键词-EN: Hosting diverse large, diverse large language, large language model, Hosting diverse, language model workloads
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: 17 Pages, 11 Figures, 5 Tables

点击查看摘要

Abstract:Hosting diverse large language model workloads in a unified resource pool through co-location is cost-effective. For example, long-running chat services generally follow diurnal traffic patterns, which inspire co-location of batch jobs to fulfill resource valleys between successive peaks, and thus to saturate resource allocation in cluster-wide scope. These heterogeneous workloads often have different business priorities, and therefore preemption can be leveraged for resource elasticity. However, workloads often have distinct topology preferences as well. The resources released by lower-priority instances may fail to meet the requirements of high-priority online services which are usually latency-sensitive. The root cause behind such mis-match is a lack of topology awareness of resource scheduler, especially during preemption. To bridge this gap, we develop a fine-grained topology-aware method for preemptive scheduling of hybrid workloads. The method ensures that the resources freed by preempted tasks adhere to the topological affinity needs of high-priority preemptors in a guaranteed or best-effort manner. This dynamic alignment significantly increases the efficiency of preemption and improves overall scheduled performance for LLM workloads by 55% .

[AI-20] Real-Time Fitness Exercise Classification and Counting from Video Frames

链接: https://arxiv.org/abs/2411.11548
作者: Riccardo Riccio
关键词-EN: Long Short-Term Memory, Bidirectional Long Short-Term, Bidirectional Long, Short-Term Memory, Long Short-Term
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel method for real-time exercise classification using a Bidirectional Long Short-Term Memory (BiLSTM) neural network. Existing exercise recognition approaches often rely on synthetic datasets, raw coordinate inputs sensitive to user and camera variations, and fail to fully exploit the temporal dependencies in exercise movements. These issues limit their generalizability and robustness in real-world conditions, where lighting, camera angles, and user body types vary. To address these challenges, we propose a BiLSTM-based model that leverages invariant features, such as joint angles, alongside raw coordinates. By using both angles and (x, y, z) coordinates, the model adapts to changes in perspective, user positioning, and body differences, improving generalization. Training on 30-frame sequences enables the BiLSTM to capture the temporal context of exercises and recognize patterns evolving over time. We compiled a dataset combining synthetic data from the InfiniteRep dataset and real-world videos from Kaggle and other sources. This dataset includes four common exercises: squat, push-up, shoulder press, and bicep curl. The model was trained and validated on these diverse datasets, achieving an accuracy of over 99% on the test set. To assess generalizability, the model was tested on 2 separate test sets representative of typical usage conditions. Comparisons with the previous approach from the literature are present in the result section showing that the proposed model is the best-performing one. The classifier is integrated into a web application providing real-time exercise classification and repetition counting without manual exercise selection. Demo and datasets are available at the following GitHub Repository: this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2411.11548 [cs.CV] (or arXiv:2411.11548v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.11548 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Riccardo Riccio [view email] [v1] Mon, 18 Nov 2024 13:06:29 UTC (3,546 KB) Full-text links: Access Paper: View a PDF of the paper titled Real-Time Fitness Exercise Classification and Counting from Video Frames, by Riccardo RiccioView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2024-11 Change to browse by: cs cs.AI cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-21] Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment

链接: https://arxiv.org/abs/2411.11543
作者: Zhendong Liu,Yuanbi Nie,Yingshui Tan,Xiangyu Yue,Qiushi Cui,Chongjun Wang,Xiaoyong Zhu,Bo Zheng
关键词-EN: form Vision Language, LLMs form Vision, Large Language Models, Vision Language Models, pre-trained visual encoder
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to LLMs form Vision Language Models (VLMs). However, recent research shows that the visual modality in VLMs is highly vulnerable, allowing attackers to bypass safety alignment in LLMs through visually transmitted content, launching harmful attacks. To address this challenge, we propose a progressive concept-based alignment strategy, PSA-VLM, which incorporates safety modules as concept bottlenecks to enhance visual modality safety alignment. By aligning model predictions with specific safety concepts, we improve defenses against risky images, enhancing explainability and controllability while minimally impacting general performance. Our method is obtained through two-stage training. The low computational cost of the first stage brings very effective performance improvement, and the fine-tuning of the language model in the second stage further improves the safety performance. Our method achieves state-of-the-art results on popular VLM safety benchmark.

[AI-22] A Pre-Trained Graph-Based Model for Adaptive Sequencing of Educational Documents NEURIPS2024

链接: https://arxiv.org/abs/2411.11520
作者: Jean Vassoyan(CB),Anan Schütt(UNIA),Jill-Jênn Vie(SODA),Arun-Balajiee Lekshmi-Narayanan(PITT),Elisabeth André(UNIA),Nicolas Vayatis(CB)
关键词-EN: http URL opens, http URL approaches, http URL path, http URL method, http URL experiments
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Workshop on Large Foundation Models for Educational Assessment (FM-Assess), Dec 2024, Vancouver, Canada

点击查看摘要

Abstract:Massive Open Online Courses (MOOCs) have greatly contributed to making education more this http URL, many MOOCs maintain a rigid, one-size-fits-all structure that fails to address the diverse needs and backgrounds of individual this http URL path personalization aims to address this limitation, by tailoring sequences of educational content to optimize individual student learning this http URL approaches, however, often require either massive student interaction data or extensive expert annotation, limiting their broad this http URL this study, we introduce a novel data-efficient framework for learning path personalization that operates without expert this http URL method employs a flexible recommender system pre-trained with reinforcement learning on a dataset of raw course this http URL experiments on semi-synthetic data, we show that this pre-training stage substantially improves data-efficiency in a range of adaptive learning scenarios featuring new educational this http URL opens up new perspectives for the design of foundation models for adaptive learning.

[AI-23] Structure learning with Temporal Gaussian Mixture for model-based Reinforcement Learning

链接: https://arxiv.org/abs/2411.11511
作者: Théophile Champion,Marek Grześ,Howard Bowman
关键词-EN: Model-based reinforcement learning, Model-based reinforcement, reinforcement learning refers, Gaussian Mixture Model, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Model-based reinforcement learning refers to a set of approaches capable of sample-efficient decision making, which create an explicit model of the environment. This model can subsequently be used for learning optimal policies. In this paper, we propose a temporal Gaussian Mixture Model composed of a perception model and a transition model. The perception model extracts discrete (latent) states from continuous observations using a variational Gaussian mixture likelihood. Importantly, our model constantly monitors the collected data searching for new Gaussian components, i.e., the perception model performs a form of structure learning (Smith et al., 2020; Friston et al., 2018; Neacsu et al., 2022) as it learns the number of Gaussian components in the mixture. Additionally, the transition model learns the temporal transition between consecutive time steps by taking advantage of the Dirichlet-categorical conjugacy. Both the perception and transition models are able to forget part of the data points, while integrating the information they provide within the prior, which ensure fast variational inference. Finally, decision making is performed with a variant of Q-learning which is able to learn Q-values from beliefs over states. Empirically, we have demonstrated the model’s ability to learn the structure of several mazes: the model discovered the number of states and the transition probabilities between these states. Moreover, using its learned Q-values, the agent was able to successfully navigate from the starting position to the maze’s exit.

[AI-24] Closed-loop multi-step planning with innate physics knowledge

链接: https://arxiv.org/abs/2411.11510
作者: Giulia Lafratta,Bernd Porr,Christopher Chandler,Alice Miller
关键词-EN: input control problem, solve robot planning, present a hierarchical, control problem, closed control loops
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We present a hierarchical framework to solve robot planning as an input control problem. At the lowest level are temporary closed control loops, (“tasks”), each representing a behaviour, contingent on a specific sensory input and therefore temporary. At the highest level, a supervising “Configurator” directs task creation and termination. Here resides “core” knowledge as a physics engine, where sequences of tasks can be simulated. The Configurator encodes and interprets simulation results,based on which it can choose a sequence of tasks as a plan. We implement this framework on a real robot and test it in an overtaking scenario as proof-of-concept.

[AI-25] Alien Recombination: Exploring Concept Blends Beyond Human Cognitive Availability in Visual Art NEURIPS2024

链接: https://arxiv.org/abs/2411.11494
作者: Alejandro Hernandez,Levin Brinkmann,Ignacio Serna,Nasim Rahaman,Hassan Abu Alhaija,Hiromu Yakura,Mar Canet Sola,Bernhard Schölkopf,Iyad Rahwan
关键词-EN: demonstrated remarkable capabilities, art remains debated, game strategy, remains debated, demonstrated remarkable
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: NeurIPS 2024 Workshop on Creativity Generative AI, 13 pages, 11 figures

点击查看摘要

Abstract:While AI models have demonstrated remarkable capabilities in constrained domains like game strategy, their potential for genuine creativity in open-ended domains like art remains debated. We explore this question by examining how AI can transcend human cognitive limitations in visual art creation. Our research hypothesizes that visual art contains a vast unexplored space of conceptual combinations, constrained not by inherent incompatibility, but by cognitive limitations imposed by artists’ cultural, temporal, geographical and social contexts. To test this hypothesis, we present the Alien Recombination method, a novel approach utilizing fine-tuned large language models to identify and generate concept combinations that lie beyond human cognitive availability. The system models and deliberately counteracts human availability bias, the tendency to rely on immediately accessible examples, to discover novel artistic combinations. This system not only produces combinations that have never been attempted before within our dataset but also identifies and generates combinations that are cognitively unavailable to all artists in the domain. Furthermore, we translate these combinations into visual representations, enabling the exploration of subjective perceptions of novelty. Our findings suggest that cognitive unavailability is a promising metric for optimizing artistic novelty, outperforming merely temperature scaling without additional evaluation criteria. This approach uses generative models to connect previously unconnected ideas, providing new insight into the potential of framing AI-driven creativity as a combinatorial problem. Comments: NeurIPS 2024 Workshop on Creativity Generative AI, 13 pages, 11 figures Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2411.11494 [cs.AI] (or arXiv:2411.11494v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.11494 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-26] Robust Markov Decision Processes: A Place Where AI and Formal Methods Meet

链接: https://arxiv.org/abs/2411.11451
作者: Marnix Suilen,Thom Badings,Eline M. Bovy,David Parker,Nils Jansen
关键词-EN: Markov decision processes, sequential decision-making problems, Markov decision, decision processes, scientific areas
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Markov decision processes (MDPs) are a standard model for sequential decision-making problems and are widely used across many scientific areas, including formal methods and artificial intelligence (AI). MDPs do, however, come with the restrictive assumption that the transition probabilities need to be precisely known. Robust MDPs (RMDPs) overcome this assumption by instead defining the transition probabilities to belong to some uncertainty set. We present a gentle survey on RMDPs, providing a tutorial covering their fundamentals. In particular, we discuss RMDP semantics and how to solve them by extending standard MDP methods such as value iteration and policy iteration. We also discuss how RMDPs relate to other models and how they are used in several contexts, including reinforcement learning and abstraction techniques. We conclude with some challenges for future work on RMDPs.

[AI-27] Unveiling the Inflexibility of Adaptive Embedding in Traffic Forecasting

链接: https://arxiv.org/abs/2411.11448
作者: Hongjun Wang,Jiyuan Chen,Lingyu Zhang,Renhe Jiang,Xuan Song
关键词-EN: Graph Neural Networks, Neural Networks, shown significant promise, Spatiotemporal Graph Neural, effectively modeling temporal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spatiotemporal Graph Neural Networks (ST-GNNs) and Transformers have shown significant promise in traffic forecasting by effectively modeling temporal and spatial correlations. However, rapid urbanization in recent years has led to dynamic shifts in traffic patterns and travel demand, posing major challenges for accurate long-term traffic prediction. The generalization capability of ST-GNNs in extended temporal scenarios and cross-city applications remains largely unexplored. In this study, we evaluate state-of-the-art models on an extended traffic benchmark and observe substantial performance degradation in existing ST-GNNs over time, which we attribute to their limited inductive capabilities. Our analysis reveals that this degradation stems from an inability to adapt to evolving spatial relationships within urban environments. To address this limitation, we reconsider the design of adaptive embeddings and propose a Principal Component Analysis (PCA) embedding approach that enables models to adapt to new scenarios without retraining. We incorporate PCA embeddings into existing ST-GNN and Transformer architectures, achieving marked improvements in performance. Notably, PCA embeddings allow for flexibility in graph structures between training and testing, enabling models trained on one city to perform zero-shot predictions on other cities. This adaptability demonstrates the potential of PCA embeddings in enhancing the robustness and generalization of spatiotemporal models.

[AI-28] Implicit Regularization for Multi-label Feature Selection

链接: https://arxiv.org/abs/2411.11436
作者: Dou El Kefel Mansouri,Khalid Benabdeslem,Seif-Eddine Benkabou
关键词-EN: feature selection, address the problem, based on implicit, MCP or SCAD, label embedding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, 7 figures, My paper is currently under review at TPAMI journal

点击查看摘要

Abstract:In this paper, we address the problem of feature selection in the context of multi-label learning, by using a new estimator based on implicit regularization and label embedding. Unlike the sparse feature selection methods that use a penalized estimator with explicit regularization terms such as l_2,1 -norm, MCP or SCAD, we propose a simple alternative method via Hadamard product parameterization. In order to guide the feature selection process, a latent semantic of multi-label information method is adopted, as a label embedding. Experimental results on some known benchmark datasets suggest that the proposed estimator suffers much less from extra bias, and may lead to benign overfitting.

[AI-29] IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos NEURIPS2024

链接: https://arxiv.org/abs/2411.11409
作者: Yunong Liu,Cristobal Eyzaguirre,Manling Li,Shubh Khanna,Juan Carlos Niebles,Vineeth Ravi,Saumitra Mishra,Weiyu Liu,Jiajun Wu
关键词-EN: IKEA Video Manuals, Video Manuals, Shape assembly, assembly, IKEA Video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: NeurIPS 2024 Datasets and Benchmarks Track

点击查看摘要

Abstract:Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio-temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present five applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, part-conditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly, including handling occlusions, varying viewpoints, and extended assembly sequences.

[AI-30] he GECo algorithm for Graph Neural Networks Explanation

链接: https://arxiv.org/abs/2411.11391
作者: Salvatore Calderaro,Domenico Amato,Giosuè Lo Bosco,Riccardo Rizzo,Filippo Vella
关键词-EN: manage complex data, complex data sources, Graph Neural Networks, Neural Networks, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are powerful models that can manage complex data sources and their interconnection links. One of GNNs’ main drawbacks is their lack of interpretability, which limits their application in sensitive fields. In this paper, we introduce a new methodology involving graph communities to address the interpretability of graph classification problems. The proposed method, called GECo, exploits the idea that if a community is a subset of graph nodes densely connected, this property should play a role in graph classification. This is reasonable, especially if we consider the message-passing mechanism, which is the basic mechanism of GNNs. GECo analyzes the contribution to the classification result of the communities in the graph, building a mask that highlights graph-relevant structures. GECo is tested for Graph Convolutional Networks on six artificial and four real-world graph datasets and is compared to the main explainability methods such as PGMExplainer, PGExplainer, GNNExplainer, and SubgraphX using four different metrics. The obtained results outperform the other methods for artificial graph datasets and most real-world datasets.

[AI-31] Continual Task Learning through Adaptive Policy Self-Composition

链接: https://arxiv.org/abs/2411.11364
作者: Shengchao Hu,Yuhang Zhou,Ziqing Fan,Jifeng Hu,Li Shen,Ya Zhang,Dacheng Tao
关键词-EN: Training a generalizable, current offline reinforcement, continually learn, natural requirement, requirement for long-lived
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, 8 figures

点击查看摘要

Abstract:Training a generalizable agent to continually learn a sequence of tasks from offline trajectories is a natural requirement for long-lived agents, yet remains a significant challenge for current offline reinforcement learning (RL) algorithms. Specifically, an agent must be able to rapidly adapt to new tasks using newly collected trajectories (plasticity), while retaining knowledge from previously learned tasks (stability). However, systematic analyses of this setting are scarce, and it remains unclear whether conventional continual learning (CL) methods are effective in continual offline RL (CORL) scenarios. In this study, we develop the Offline Continual World benchmark and demonstrate that traditional CL methods struggle with catastrophic forgetting, primarily due to the unique distribution shifts inherent to CORL scenarios. To address this challenge, we introduce CompoFormer, a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network. Upon encountering a new task, CompoFormer leverages semantic correlations to selectively integrate relevant prior policies alongside newly trained parameters, thereby enhancing knowledge sharing and accelerating the learning process. Our experiments reveal that CompoFormer outperforms conventional CL methods, particularly in longer task sequences, showcasing a promising balance between plasticity and stability.

[AI-32] A comprehensive survey of oracle character recognition: challenges benchmarks and beyond

链接: https://arxiv.org/abs/2411.11354
作者: Jing Li,Xueke Chi,Qiufeng Wang,Dahan Wang,Kaizhu Huang,Yongge Liu,Cheng-lin Liu
关键词-EN: Chinese inscriptions found, ancient Chinese inscriptions, historical cultural studies, oracle character recognition, Chinese inscriptions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Oracle character recognition-an analysis of ancient Chinese inscriptions found on oracle bones-has become a pivotal field intersecting archaeology, paleography, and historical cultural studies. Traditional methods of oracle character recognition have relied heavily on manual interpretation by experts, which is not only labor-intensive but also limits broader accessibility to the general public. With recent breakthroughs in pattern recognition and deep learning, there is a growing movement towards the automation of oracle character recognition (OrCR), showing considerable promise in tackling the challenges inherent to these ancient scripts. However, a comprehensive understanding of OrCR still remains elusive. Therefore, this paper presents a systematic and structured survey of the current landscape of OrCR research. We commence by identifying and analyzing the key challenges of OrCR. Then, we provide an overview of the primary benchmark datasets and digital resources available for OrCR. A review of contemporary research methodologies follows, in which their respective efficacies, limitations, and applicability to the complex nature of oracle characters are critically highlighted and examined. Additionally, our review extends to ancillary tasks associated with OrCR across diverse disciplines, providing a broad-spectrum analysis of its applications. We conclude with a forward-looking perspective, proposing potential avenues for future investigations that could yield significant advancements in the field.

[AI-33] Syllabus: Portable Curricula for Reinforcement Learning Agents

链接: https://arxiv.org/abs/2411.11318
作者: Ryan Sullivan,Ryan Pégoud,Ameen Ur Rahmen,Xinchen Yang,Junyun Huang,Aayush Verma,Nistha Mitra,John P. Dickerson
关键词-EN: Curriculum learning, Curriculum, learning, quiet yet crucial, high-profile successes
类目: Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Curriculum learning has been a quiet yet crucial component of many of the high-profile successes of reinforcement learning. Despite this, none of the major reinforcement learning libraries directly support curriculum learning or include curriculum learning implementations. These methods can improve the capabilities and robustness of RL agents, but often require significant, complex changes to agent training code. We introduce Syllabus, a library for training RL agents with curriculum learning, as a solution to this problem. Syllabus provides a universal API for curriculum learning algorithms, implementations of popular curriculum learning methods, and infrastructure for easily integrating them with distributed training code written in nearly any RL library. Syllabus provides a minimal API for each of the core components of curriculum learning, dramatically simplifying the process of designing new algorithms and applying existing algorithms to new environments. We demonstrate that the same Syllabus code can be used to train agents written in multiple different RL libraries on numerous domains. In doing so, we present the first examples of curriculum learning in NetHack and Neural MMO, two of the premier challenges for single-agent and multi-agent RL respectively, achieving strong results compared to state of the art baselines.

[AI-34] Study of the Performance of CEEMDAN in Underdetermined Speech Separation

链接: https://arxiv.org/abs/2411.11312
作者: Rawad Melhem,Riad Hamadeh,Assef Jafar
关键词-EN: analysis of non-stationary, modern methods, non-stationary signals, CEEMDAN, speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: in Arabic language

点击查看摘要

Abstract:The CEEMDAN algorithm is one of the modern methods used in the analysis of non-stationary signals. This research presents a study of the effectiveness of this method in audio source separation to know the limits of its work. It concluded two conditions related to frequencies and amplitudes of mixed signals to be separated by CEEMDAN. The performance of the algorithm in separating noise from speech and separating speech signals from each other is studied. The research reached a conclusion that CEEMDAN can remove some types of noise from speech (speech improvement), and it cannot separate speech signals from each other (cocktail party). Simulation is done using Matlab environment and Noizeus database.

[AI-35] P-UNet: Temporal Prompt Guided UNet for Medical Image Segmentation

链接: https://arxiv.org/abs/2411.11305
作者: Ranmin Wang,Limin Zhuang,Hongkun Chen,Boyan Xu,Ruichu Cai
关键词-EN: deep learning techniques, medical image segmentation, image segmentation techniques, adoption of deep, improve the accuracy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of medical image segmentation techniques has been propelled by the adoption of deep learning techniques, particularly UNet-based approaches, which exploit semantic information to improve the accuracy of segmentations. However, the order of organs in scanned images has been disregarded by current medical image segmentation approaches based on UNet. Furthermore, the inherent network structure of UNet does not provide direct capabilities for integrating temporal information. To efficiently integrate temporal information, we propose TP-UNet that utilizes temporal prompts, encompassing organ-construction relationships, to guide the segmentation UNet model. Specifically, our framework is featured with cross-attention and semantic alignment based on unsupervised contrastive learning to combine temporal prompts and image features effectively. Extensive evaluations on two medical image segmentation datasets demonstrate the state-of-the-art performance of TP-UNet. Our implementation will be open-sourced after acceptance.

[AI-36] Recurrent Stochastic Configuration Networks with Incremental Blocks

链接: https://arxiv.org/abs/2411.11303
作者: Gang Dang,Dainhui Wang
关键词-EN: Recurrent stochastic configuration, order uncertainty due, stochastic configuration networks, strong approximation capability, Recurrent stochastic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recurrent stochastic configuration networks (RSCNs) have shown promise in modelling nonlinear dynamic systems with order uncertainty due to their advantages of easy implementation, less human intervention, and strong approximation capability. This paper develops the original RSCNs with block increments, termed block RSCNs (BRSCNs), to further enhance the learning capacity and efficiency of the network. BRSCNs can simultaneously add multiple reservoir nodes (subreservoirs) during the construction. Each subreservoir is configured with a unique structure in the light of a supervisory mechanism, ensuring the universal approximation property. The reservoir feedback matrix is appropriately scaled to guarantee the echo state property of the network. Furthermore, the output weights are updated online using a projection algorithm, and the persistent excitation conditions that facilitate parameter convergence are also established. Numerical results over a time series prediction, a nonlinear system identification task, and two industrial data predictive analyses demonstrate that the proposed BRSCN performs favourably in terms of modelling efficiency, learning, and generalization performance, highlighting their significant potential for coping with complex dynamics.

[AI-37] owards Personalized Brain-Computer Interface Application Based on Endogenous EEG Paradigms

链接: https://arxiv.org/abs/2411.11302
作者: Heon-Gyu Kwak,Gi-Hwan Shin,Yeon-Woo Choi,Dong-Hoon Lee,Yoo-In Jeon,Jun-Su Kang,Seong-Whan Lee
关键词-EN: including motor imagery, personalized brain-computer interface, enhanced user experience, speech imagery, paradigms including motor
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Submissoion version for IEEE International BCI Winter Conference 2025

点击查看摘要

Abstract:In this paper, we propose a conceptual framework for personalized brain-computer interface (BCI) applications, which can offer an enhanced user experience by customizing services to individual preferences and needs, based on endogenous electroencephalography (EEG) paradigms including motor imagery (MI), speech imagery (SI), and visual imagery. The framework includes two essential components: user identification and intention classification, which enable personalized services by identifying individual users and recognizing their intended actions through EEG signals. We validate the feasibility of our framework using a private EEG dataset collected from eight subjects, employing the ShallowConvNet architecture to decode EEG features. The experimental results demonstrate that user identification achieved an average classification accuracy of 0.995, while intention classification achieved 0.47 accuracy across all paradigms, with MI demonstrating the best performance. These findings indicate that EEG signals can effectively support personalized BCI applications, offering robust identification and reliable intention decoding, especially for MI and SI.

[AI-38] Zero-Shot Automatic Annotation and Instance Segmentation using LLM -Generated Datasets: Eliminating Field Imaging and Manual Annotation for Deep Learning Model Development

链接: https://arxiv.org/abs/2411.11285
作者: Ranjan Sapkota,Achyut Paudel,Manoj Karkee
关键词-EN: process involving extensive, involving extensive field, presenting significant logistical, deep learning-based instance, labor-intensive process involving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Currently, deep learning-based instance segmentation for various applications (e.g., Agriculture) is predominantly performed using a labor-intensive process involving extensive field data collection using sophisticated sensors, followed by careful manual annotation of images, presenting significant logistical and financial challenges to researchers and organizations. The process also slows down the model development and training process. In this study, we presented a novel method for deep learning-based instance segmentation of apples in commercial orchards that eliminates the need for labor-intensive field data collection and manual annotation. Utilizing a Large Language Model (LLM), we synthetically generated orchard images and automatically annotated them using the Segment Anything Model (SAM) integrated with a YOLO11 base model. This method significantly reduces reliance on physical sensors and manual data processing, presenting a major advancement in “Agricultural AI”. The synthetic, auto-annotated dataset was used to train the YOLO11 model for Apple instance segmentation, which was then validated on real orchard images. The results showed that the automatically generated annotations achieved a Dice Coefficient of 0.9513 and an IoU of 0.9303, validating the accuracy and overlap of the mask annotations. All YOLO11 configurations, trained solely on these synthetic datasets with automated annotations, accurately recognized and delineated apples, highlighting the method’s efficacy. Specifically, the YOLO11m-seg configuration achieved a mask precision of 0.902 and a mask mAP@50 of 0.833 on test images collected from a commercial orchard. Additionally, the YOLO11l-seg configuration outperformed other models in validation on 40 LLM-generated images, achieving the highest mask precision and mAP@50 metrics. Keywords: YOLO, SAM, SAMv2, YOLO11, YOLOv11, Segment Anything, YOLO-SAM Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.11285 [cs.CV] (or arXiv:2411.11285v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.11285 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-39] Multi-Hyperbolic Space-based Heterogeneous Graph Attention Network ICDM2024

链接: https://arxiv.org/abs/2411.11283
作者: Jongmin Park,Seunghoon Han,Jong-Ryul Lee,Sungsu Lim
关键词-EN: constant negative curvature, heterogeneous graph embedding, diverse power-law structures, heterogeneous graphs, exponentially increasing space
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in IEEE ICDM 2024

点击查看摘要

Abstract:To leverage the complex structures within heterogeneous graphs, recent studies on heterogeneous graph embedding use a hyperbolic space, characterized by a constant negative curvature and exponentially increasing space, which aligns with the structural properties of heterogeneous graphs. However, despite heterogeneous graphs inherently possessing diverse power-law structures, most hyperbolic heterogeneous graph embedding models use a single hyperbolic space for the entire heterogeneous graph, which may not effectively capture the diverse power-law structures within the heterogeneous graph. To address this limitation, we propose Multi-hyperbolic Space-based heterogeneous Graph Attention Network (MSGAT), which uses multiple hyperbolic spaces to effectively capture diverse power-law structures within heterogeneous graphs. We conduct comprehensive experiments to evaluate the effectiveness of MSGAT. The experimental results demonstrate that MSGAT outperforms state-of-the-art baselines in various graph machine learning tasks, effectively capturing the complex structures of heterogeneous graphs.

[AI-40] Cross-Patient Pseudo Bags Generation and Curriculum Contrastive Learning for Imbalanced Multiclassification of Whole Slide Image

链接: https://arxiv.org/abs/2411.11262
作者: Yonghuang Wu,Xuan Xie,Xinyuan Niu,Chengqian Zhao,Jinhua Yu
关键词-EN: dramatically improved pathologists’, improved pathologists’ workflow, diagnostic decision-making processes, Pathology computing, decision-making processes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Pathology computing has dramatically improved pathologists’ workflow and diagnostic decision-making processes. Although computer-aided diagnostic systems have shown considerable value in whole slide image (WSI) analysis, the problem of multi-classification under sample imbalance remains an intractable challenge. To address this, we propose learning fine-grained information by generating sub-bags with feature distributions similar to the original WSIs. Additionally, we utilize a pseudo-bag generation algorithm to further leverage the abundant and redundant information in WSIs, allowing efficient training in unbalanced-sample multi-classification tasks. Furthermore, we introduce an affinity-based sample selection and curriculum contrastive learning strategy to enhance the stability of model representation learning. Unlike previous approaches, our framework transitions from learning bag-level representations to understanding and exploiting the feature distribution of multi-instance bags. Our method demonstrates significant performance improvements on three datasets, including tumor classification and lymph node metastasis. On average, it achieves a 4.39-point improvement in F1 score compared to the second-best method across the three tasks, underscoring its superior performance.

[AI-41] EXCON: Extreme Instance-based Contrastive Representation Learning of Severely Imbalanced Multivariate Time Series for Solar Flare Prediction

链接: https://arxiv.org/abs/2411.11249
作者: Onur Vural,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi
关键词-EN: Earth infrastructure substantially, systems and Earth, Earth infrastructure, solar flare prediction, multivariate time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This work has been accepted at the 2024 IEEE International Conference on Big Data (IEEE BigData 2024) on October 27, 2024, as a main conference paper

点击查看摘要

Abstract:In heliophysics research, predicting solar flares is crucial due to their potential to impact both space-based systems and Earth’s infrastructure substantially. Magnetic field data from solar active regions, recorded by solar imaging observatories, are transformed into multivariate time series to enable solar flare prediction using temporal window-based analysis. In the realm of multivariate time series-driven solar flare prediction, addressing severe class imbalance with effective strategies for multivariate time series representation learning is key to developing robust predictive models. Traditional methods often struggle with overfitting to the majority class in prediction tasks where major solar flares are infrequent. This work presents EXCON, a contrastive representation learning framework designed to enhance classification performance amidst such imbalances. EXCON operates through four stages: obtaining core features from multivariate time series data; selecting distinctive contrastive representations for each class to maximize inter-class separation; training a temporal feature embedding module with a custom extreme reconstruction loss to minimize intra-class variation; and applying a classifier to the learned embeddings for robust classification. The proposed method leverages contrastive learning principles to map similar instances closer in the feature space while distancing dissimilar ones, a strategy not extensively explored in solar flare prediction tasks. This approach not only addresses class imbalance but also offers a versatile solution applicable to univariate and multivariate time series across binary and multiclass classification problems. Experimental results, including evaluations on the benchmark solar flare dataset and multiple time series archive datasets with binary and multiclass labels, demonstrate EXCON’s efficacy in enhancing classification performance.

[AI-42] MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

链接: https://arxiv.org/abs/2411.11217
作者: Shiyi Cao,Shu Liu,Tyler Griggs,Peter Schafhalter,Xiaoxuan Liu,Ying Sheng,Joseph E. Gonzalez,Matei Zaharia,Ion Stoica
关键词-EN: Mixture of Experts, presents significant challenges, resource-constrained platforms presents, platforms presents significant, significant challenges
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient deployment of large language models, particularly Mixture of Experts (MoE), on resource-constrained platforms presents significant challenges, especially in terms of computational efficiency and memory utilization. The MoE architecture, renowned for its ability to increase model capacity without a proportional increase in inference cost, greatly reduces the token generation latency compared with dense models. However, the large model size makes MoE models inaccessible to individuals without high-end GPUs. In this paper, we propose a high-throughput MoE batch inference system, that significantly outperforms past work. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization, and a performance model, HRM, based on a Hierarchical Roofline Model we introduce to help find policies with higher throughput than existing systems. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB). When the theoretical system throughput is bounded by the GPU memory, MoE-Lightning can reach the throughput upper bound with 2-3x less CPU memory, significantly increasing resource utilization. MoE-Lightning also supports efficient batch inference for much larger MoEs (e.g., Mixtral 8x22B and DBRX) on multiple low-cost GPUs (e.g., 2-4 T4).

[AI-43] Making Sigmoid-MSE Great Again: Output Reset Challenges Softmax Cross-Entropy in Neural Network Classification

链接: https://arxiv.org/abs/2411.11213
作者: Kanishka Tyagi,Chinmay Rane,Ketaki Vaidya,Jeshwanth Challgundla,Soumitro Swapan Auddy,Michael Manry
关键词-EN: Squared Error, objective functions, Softmax Cross-Entropy, study presents, presents a comparative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study presents a comparative analysis of two objective functions, Mean Squared Error (MSE) and Softmax Cross-Entropy (SCE) for neural network classification tasks. While SCE combined with softmax activation is the conventional choice for transforming network outputs into class probabilities, we explore an alternative approach using MSE with sigmoid activation. We introduce the Output Reset algorithm, which reduces inconsistent errors and enhances classifier robustness. Through extensive experiments on benchmark datasets (MNIST, CIFAR-10, and Fashion-MNIST), we demonstrate that MSE with sigmoid activation achieves comparable accuracy and convergence rates to SCE, while exhibiting superior performance in scenarios with noisy data. Our findings indicate that MSE, despite its traditional association with regression tasks, serves as a viable alternative for classification problems, challenging conventional wisdom about neural network training strategies.

[AI-44] PickScan: Object discovery and reconstruction from handheld interactions IROS2024

链接: https://arxiv.org/abs/2411.11196
作者: Vincent van der Brugge,Marc Pollefeys,Joshua B. Tenenbaum,Ayush Tewari,Krishna Murthy Jatavallabhula
关键词-EN: highly desirable capability, Reconstructing compositional, augmented reality, highly desirable, desirable capability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 7 pages, 8 figures, published in the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:Reconstructing compositional 3D representations of scenes, where each object is represented with its own 3D model, is a highly desirable capability in robotics and augmented reality. However, most existing methods rely heavily on strong appearance priors for object discovery, therefore only working on those classes of objects on which the method has been trained, or do not allow for object manipulation, which is necessary to scan objects fully and to guide object discovery in challenging scenarios. We address these limitations with a novel interaction-guided and class-agnostic method based on object displacements that allows a user to move around a scene with an RGB-D camera, hold up objects, and finally outputs one 3D model per held-up object. Our main contribution to this end is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects. On a custom-captured dataset, our pipeline discovers manipulated objects with 78.3% precision at 100% recall and reconstructs them with a mean chamfer distance of 0.90cm. Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73% while detecting 99% fewer false positives.

[AI-45] Improving User Experience in Preference-Based Optimization of Reward Functions for Assistive Robots

链接: https://arxiv.org/abs/2411.11182
作者: Nathaniel Dennler,Zhonghao Shi,Stefanos Nikolaidis,Maja Matarić
关键词-EN: Assistive robots interact, Assistive robots, interact with humans, users’ preferences, preference learning process
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted to ISRR

点击查看摘要

Abstract:Assistive robots interact with humans and must adapt to different users’ preferences to be effective. An easy and effective technique to learn non-expert users’ preferences is through rankings of robot behaviors, for example, robot movement trajectories or gestures. Existing techniques focus on generating trajectories for users to rank that maximize the outcome of the preference learning process. However, the generated trajectories do not appear to reflect the user’s preference over repeated interactions. In this work, we design an algorithm to generate trajectories for users to rank that we call Covariance Matrix Adaptation Evolution Strategies with Information Gain (CMA-ES-IG). CMA-ES-IG prioritizes the user’s experience of the preference learning process. We show that users find our algorithm more intuitive and easier to use than previous approaches across both physical and social robot tasks. This project’s code is hosted at this http URL

[AI-46] Enhanced Anime Image Generation Using USE-CMHSA-GAN

链接: https://arxiv.org/abs/2411.11179
作者: J. Lu
关键词-EN: important research topic, high-quality anime character, anime character images, generating high-quality anime, anime character
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the growing popularity of ACG (Anime, Comics, and Games) culture, generating high-quality anime character images has become an important research topic. This paper introduces a novel Generative Adversarial Network model, USE-CMHSA-GAN, designed to produce high-quality anime character images. The model builds upon the traditional DCGAN framework, incorporating USE and CMHSA modules to enhance feature extraction capabilities for anime character images. Experiments were conducted on the anime-face-dataset, and the results demonstrate that USE-CMHSA-GAN outperforms other benchmark models, including DCGAN, VAE-GAN, and WGAN, in terms of FID and IS scores, indicating superior image quality. These findings suggest that USE-CMHSA-GAN is highly effective for anime character image generation and provides new insights for further improving the quality of generative models.

[AI-47] RPN 2: On Interdependence Function Learning Towards Unifying and Advancing CNN RNN GNN and Transformer

链接: https://arxiv.org/abs/2411.11162
作者: Jiawei Zhang
关键词-EN: Reconciled Polynomial Network, Polynomial Network, Reconciled Polynomial, previous work, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 105 pages, 37 figures, 6 tables, preprint version

点击查看摘要

Abstract:This paper builds upon our previous work on the Reconciled Polynomial Network (RPN). The original RPN model was designed under the assumption of input data independence, presuming the independence among both individual instances within data batches and attributes in each data instance. However, this assumption often proves invalid for function learning tasks involving complex, interdependent data such as language, images, time series, and graphs. Ignoring such data interdependence may inevitably lead to significant performance degradation. To overcome these limitations, we introduce the new Reconciled Polynomial Network (version 2), namely RPN 2, in this paper. By incorporating data and structural interdependence functions, RPN 2 explicitly models data interdependence via new component functions in its architecture. This enhancement not only significantly improves RPN 2’s learning performance but also substantially expands its unifying potential, enabling it to encompass a broader range of contemporary dominant backbone models within its canonical representation. These backbones include, but are not limited to, convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph neural networks (GNNs), and Transformers. Our analysis reveals that the fundamental distinctions among these backbone models primarily stem from their diverse approaches to defining the interdependence functions. Furthermore, this unified representation opens up new opportunities for designing innovative architectures with the potential to surpass the performance of these dominant backbones. Comments: 105 pages, 37 figures, 6 tables, preprint version Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (stat.ML) Cite as: arXiv:2411.11162 [cs.LG] (or arXiv:2411.11162v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.11162 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-48] MPLite: Multi-Aspect Pretraining for Mining Clinical Health Records

链接: https://arxiv.org/abs/2411.11161
作者: Eric Yang,Pengfei Hu,Xiaoxue Han,Yue Ning
关键词-EN: vast electronic health, machine learning methods, offering valuable data, electronic health records, offering valuable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The adoption of digital systems in healthcare has resulted in the accumulation of vast electronic health records (EHRs), offering valuable data for machine learning methods to predict patient health outcomes. However, single-visit records of patients are often neglected in the training process due to the lack of annotations of next-visit information, thereby limiting the predictive and expressive power of machine learning models. In this paper, we present a novel framework MPLite that utilizes Multi-aspect Pretraining with Lab results through a light-weight neural network to enhance medical concept representation and predict future health outcomes of individuals. By incorporating both structured medical data and additional information from lab results, our approach fully leverages patient admission records. We design a pretraining module that predicts medical codes based on lab results, ensuring robust prediction by fusing multiple aspects of features. Our experimental evaluation using both MIMIC-III and MIMIC-IV datasets demonstrates improvements over existing models in diagnosis prediction and heart failure prediction tasks, achieving a higher weighted-F1 and recall with MPLite. This work reveals the potential of integrating diverse aspects of data to advance predictive modeling in healthcare.

[AI-49] abDeco: A Comprehensive Contrastive Framework for Decoupled Representations in Tabular Data

链接: https://arxiv.org/abs/2411.11148
作者: Suiyao Chen,Jing Wu,Yunxiao Wang,Cheng Ji,Tianpei Xie,Daniel Cociorva,Michael Sharps,Cecile Levasseur,Hakan Brunzell
关键词-EN: modern artificial intelligence, driving substantial improvements, artificial intelligence, driving substantial, diverse applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Representation learning is a fundamental aspect of modern artificial intelligence, driving substantial improvements across diverse applications. While selfsupervised contrastive learning has led to significant advancements in fields like computer vision and natural language processing, its adaptation to tabular data presents unique challenges. Traditional approaches often prioritize optimizing model architecture and loss functions but may overlook the crucial task of constructing meaningful positive and negative sample pairs from various perspectives like feature interactions, instance-level patterns and batch-specific contexts. To address these challenges, we introduce TabDeco, a novel method that leverages attention-based encoding strategies across both rows and columns and employs contrastive learning framework to effectively disentangle feature representations at multiple levels, including features, instances and data batches. With the innovative feature decoupling hierarchies, TabDeco consistently surpasses existing deep learning methods and leading gradient boosting algorithms, including XG-Boost, CatBoost, and LightGBM, across various benchmark tasks, underscoring its effectiveness in advancing tabular data representation learning.

[AI-50] CLMIA: Membership Inference Attacks via Unsupervised Contrastive Learning

链接: https://arxiv.org/abs/2411.11144
作者: Depeng Chen,Xiao Liu,Jie Cui,Hong Zhong(School of Computer Science and Technology, Anhui University)
关键词-EN: trained multiple times, limited data set, machine learning model, trained multiple, multiple times
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Since machine learning model is often trained on a limited data set, the model is trained multiple times on the same data sample, which causes the model to memorize most of the training set data. Membership Inference Attacks (MIAs) exploit this feature to determine whether a data sample is used for training a machine learning model. However, in realistic scenarios, it is difficult for the adversary to obtain enough qualified samples that mark accurate identity information, especially since most samples are non-members in real world applications. To address this limitation, in this paper, we propose a new attack method called CLMIA, which uses unsupervised contrastive learning to train an attack model without using extra membership status information. Meanwhile, in CLMIA, we require only a small amount of data with known membership status to fine-tune the attack model. Experimental results demonstrate that CLMIA performs better than existing attack methods for different datasets and model structures, especially with data with less marked identity information. In addition, we experimentally find that the attack performs differently for different proportions of labeled identity information for member and non-member data. More analysis proves that our attack method performs better with less labeled identity information, which applies to more realistic scenarios.

[AI-51] Label Sharing Incremental Learning Framework for Independent Multi-Label Segmentation Tasks

链接: https://arxiv.org/abs/2411.11105
作者: Deepa Anand,Bipul Das,Vyshnav Dangeti,Antony Jerald,Rakesh Mullick,Uday Patil,Pakhi Sharma,Prasad Sudhakar
关键词-EN: label, segmentation, label sets, labels, multiple datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In a setting where segmentation models have to be built for multiple datasets, each with its own corresponding label set, a straightforward way is to learn one model for every dataset and its labels. Alternatively, multi-task architectures with shared encoders and multiple segmentation heads or shared weights with compound labels can also be made use of. This work proposes a novel label sharing framework where a shared common label space is constructed and each of the individual label sets are systematically mapped to the common labels. This transforms multiple datasets with disparate label sets into a single large dataset with shared labels, and therefore all the segmentation tasks can be addressed by learning a single model. This eliminates the need for task specific adaptations in network architectures and also results in parameter and data efficient models. Furthermore, label sharing framework is naturally amenable for incremental learning where segmentations for new datasets can be easily learnt. We experimentally validate our method on various medical image segmentation datasets, each involving multi-label segmentation. Furthermore, we demonstrate the efficacy of the proposed method in terms of performance and incremental learning ability vis-a-vis alternative methods.

[AI-52] Different Horses for Different Courses: Comparing Bias Mitigation Algorithms in ML NEURIPS2024

链接: https://arxiv.org/abs/2411.11101
作者: Prakhar Ganeesh,Usman Gohar,Lu Cheng,Golnoosh Farnadi
关键词-EN: bias mitigation techniques, attention in Machine, concerns gaining significant, gaining significant attention, fairness concerns gaining
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: To appear at AFME@NeurIPS 2024

点击查看摘要

Abstract:With fairness concerns gaining significant attention in Machine Learning (ML), several bias mitigation techniques have been proposed, often compared against each other to find the best method. These benchmarking efforts tend to use a common setup for evaluation under the assumption that providing a uniform environment ensures a fair comparison. However, bias mitigation techniques are sensitive to hyperparameter choices, random seeds, feature selection, etc., meaning that comparison on just one setting can unfairly favour certain algorithms. In this work, we show significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores. We highlight that most bias mitigation techniques can achieve comparable performance, given the freedom to perform hyperparameter optimization, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another. We hope our work encourages future research on how various choices in the lifecycle of developing an algorithm impact fairness, and trends that guide the selection of appropriate algorithms.

[AI-53] Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2411.11099
作者: Ting Zhu,Yue Jin,Jeremie Houssineau,Giovanni Montana
关键词-EN: decentralized multi-agent reinforcement, multi-agent reinforcement learning, relative over-generalization, decentralized multi-agent, multi-agent reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Published in Transactions on Machine Learning Research (11/2024)

点击查看摘要

Abstract:In decentralized multi-agent reinforcement learning, agents learning in isolation can lead to relative over-generalization (RO), where optimal joint actions are undervalued in favor of suboptimal ones. This hinders effective coordination in cooperative tasks, as agents tend to choose actions that are individually rational but collectively suboptimal. To address this issue, we introduce MaxMax Q-Learning (MMQ), which employs an iterative process of sampling and evaluating potential next states, selecting those with maximal Q-values for learning. This approach refines approximations of ideal state transitions, aligning more closely with the optimal joint policy of collaborating agents. We provide theoretical analysis supporting MMQ’s potential and present empirical evaluations across various environments susceptible to RO. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.

[AI-54] Reinforcing Competitive Multi-Agents for Playing So Long Sucker

链接: https://arxiv.org/abs/2411.11057
作者: Medant Sharan,Chandranath Adak
关键词-EN: Long Sucker, deep reinforcement learning, classical deep reinforcement, Dueling DQN, diplomacy-driven game defined
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper examines the use of classical deep reinforcement learning (DRL) algorithms, DQN, DDQN, and Dueling DQN, in the strategy game So Long Sucker (SLS), a diplomacy-driven game defined by coalition-building and strategic betrayal. SLS poses unique challenges due to its blend of cooperative and adversarial dynamics, making it an ideal platform for studying multi-agent learning and game theory. The study’s primary goal is to teach autonomous agents the game’s rules and strategies using classical DRL methods. To support this effort, the authors developed a novel, publicly available implementation of SLS, featuring a graphical user interface (GUI) and benchmarking tools for DRL algorithms. Experimental results reveal that while considered basic by modern DRL standards, DQN, DDQN, and Dueling DQN agents achieved roughly 50% of the maximum possible game reward. This suggests a baseline understanding of the game’s mechanics, with agents favoring legal moves over illegal ones. However, a significant limitation was the extensive training required, around 2000 games, for agents to reach peak performance, compared to human players who grasp the game within a few rounds. Even after prolonged training, agents occasionally made illegal moves, highlighting both the potential and limitations of these classical DRL methods in semi-complex, socially driven games. The findings establish a foundational benchmark for training agents in SLS and similar negotiation-based environments while underscoring the need for advanced or hybrid DRL approaches to improve learning efficiency and adaptability. Future research could incorporate game-theoretic strategies to enhance agent decision-making in dynamic multi-agent contexts.

[AI-55] Knowledge-enhanced Transformer for Multivariate Long Sequence Time-series Forecasting

链接: https://arxiv.org/abs/2411.11046
作者: Shubham Tanaji Kakde,Rony Mitra,Jasashwi Mandal,Manoj Kumar Tiwari
关键词-EN: Long Sequence Time-series, Sequence Time-series Forecasting, Sequence Time-series, Multivariate Long Sequence, Long Sequence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Multivariate Long Sequence Time-series Forecasting (LSTF) has been a critical task across various real-world applications. Recent advancements focus on the application of transformer architectures attributable to their ability to capture temporal patterns effectively over extended periods. However, these approaches often overlook the inherent relationships and interactions between the input variables that could be drawn from their characteristic properties. In this paper, we aim to bridge this gap by integrating information-rich Knowledge Graph Embeddings (KGE) with state-of-the-art transformer-based architectures. We introduce a novel approach that encapsulates conceptual relationships among variables within a well-defined knowledge graph, forming dynamic and learnable KGEs for seamless integration into the transformer architecture. We investigate the influence of this integration into seminal architectures such as PatchTST, Autoformer, Informer, and Vanilla Transformer. Furthermore, we thoroughly investigate the performance of these knowledge-enhanced architectures along with their original implementations for long forecasting horizons and demonstrate significant improvement in the benchmark results. This enhancement empowers transformer-based architectures to address the inherent structural relation between variables. Our knowledge-enhanced approach improves the accuracy of multivariate LSTF by capturing complex temporal and relational dynamics across multiple domains. To substantiate the validity of our model, we conduct comprehensive experiments using Weather and Electric Transformer Temperature (ETT) datasets.

[AI-56] Wafer Map Defect Classification Using Autoencoder-Based Data Augmentation and Convolutional Neural Network

链接: https://arxiv.org/abs/2411.11029
作者: Yin-Yin Bao,Er-Chao Li,Hong-Qiang Yang,Bin-Bin Jia
关键词-EN: enhancing process yields, revealing critical defect, critical defect patterns, semiconductor manufacturing, play a crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 26 pages, 11 figures, including dataset preprocessing, proposed methods, and experimental results

点击查看摘要

Abstract:In semiconductor manufacturing, wafer defect maps (WDMs) play a crucial role in diagnosing issues and enhancing process yields by revealing critical defect patterns. However, accurately categorizing WDM defects presents significant challenges due to noisy data, unbalanced defect classes, and the complexity of failure modes. To address these challenges, this study proposes a novel method combining a self-encoder-based data augmentation technique with a convolutional neural network (CNN). By introducing noise into the latent space, the self-encoder enhances data diversity and mitigates class imbalance, thereby improving the model’s generalization capabilities. The augmented dataset is subsequently used to train the CNN, enabling it to deliver precise classification of both common and rare defect patterns. Experimental results on the WM-811K dataset demonstrate that the proposed method achieves a classification accuracy of 98.56%, surpassing Random Forest, SVM, and Logistic Regression by 19%, 21%, and 27%, respectively. These findings highlight the robustness and effectiveness of the proposed approach, offering a reliable solution for wafer defect detection and classification.

[AI-57] me Step Generating: A Universal Synthesized Deepfake Image Detector CVPR2025

链接: https://arxiv.org/abs/2411.11016
作者: Ziyue Zeng,Haoyuan Liu,Dingjie Peng,Luoxu Jing,Hiroshi Watanabe
关键词-EN: accelerating pace, pre-trained generative model, diffusion model, pre-trained diffusion model, model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Submitted to CVPR 2025, 9 pages, 7 figures

点击查看摘要

Abstract:Currently, high-fidelity text-to-image models are developed in an accelerating pace. Among them, Diffusion Models have led to a remarkable improvement in the quality of image generation, making it vary challenging to distinguish between real and synthesized images. It simultaneously raises serious concerns regarding privacy and security. Some methods are proposed to distinguish the diffusion model generated images through reconstructing. However, the inversion and denoising processes are time-consuming and heavily reliant on the pre-trained generative model. Consequently, if the pre-trained generative model meet the problem of out-of-domain, the detection performance declines. To address this issue, we propose a universal synthetic image detector Time Step Generating (TSG), which does not rely on pre-trained models’ reconstructing ability, specific datasets, or sampling algorithms. Our method utilizes a pre-trained diffusion model’s network as a feature extractor to capture fine-grained details, focusing on the subtle differences between real and synthetic images. By controlling the time step t of the network input, we can effectively extract these distinguishing detail features. Then, those features can be passed through a classifier (i.e. Resnet), which efficiently detects whether an image is synthetic or real. We test the proposed TSG on the large-scale GenImage benchmark and it achieves significant improvements in both accuracy and generalizability.

[AI-58] BackdoorMBTI: A Backdoor Learning Multimodal Benchmark Tool Kit for Backdoor Defense Evaluation

链接: https://arxiv.org/abs/2411.11006
作者: Haiyang Yu,Tian Xie,Jiaping Gui,Pengyang Wang,Ping Yi,Yue Wu
关键词-EN: backdoor learning toolkit, toolkit and benchmark, benchmark designed, eleven commonly, backdoor learning
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce BackdoorMBTI, the first backdoor learning toolkit and benchmark designed for multimodal evaluation across three representative modalities from eleven commonly used datasets. BackdoorMBTI provides a systematic backdoor learning pipeline, encompassing data processing, data poisoning, backdoor training, and evaluation. The generated poison datasets and backdoor models enable detailed evaluation of backdoor defense methods. Given the diversity of modalities, BackdoorMBTI facilitates systematic evaluation across different data types. Furthermore, BackdoorMBTI offers a standardized approach to handling practical factors in backdoor learning, such as issues related to data quality and erroneous labels. We anticipate that BackdoorMBTI will expedite future research in backdoor defense methods within a multimodal context. Code is available at this https URL.

[AI-59] Unveiling the Hidden: Online Vectorized HD Map Construction with Clip-Level Token Interaction and Propagation NEURIPS2024

链接: https://arxiv.org/abs/2411.11002
作者: Nayeon Kim,Hongje Seong,Daehyun Ji,Sujin Jang
关键词-EN: Predicting and constructing, lane lines, safe autonomous driving, constructing road geometric, crucial task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 18 pages, 9 figures, NeurIPS 2024

点击查看摘要

Abstract:Predicting and constructing road geometric information (e.g., lane lines, road markers) is a crucial task for safe autonomous driving, while such static map elements can be repeatedly occluded by various dynamic objects on the road. Recent studies have shown significantly improved vectorized high-definition (HD) map construction performance, but there has been insufficient investigation of temporal information across adjacent input frames (i.e., clips), which may lead to inconsistent and suboptimal prediction results. To tackle this, we introduce a novel paradigm of clip-level vectorized HD map construction, MapUnveiler, which explicitly unveils the occluded map elements within a clip input by relating dense image representations with efficient clip tokens. Additionally, MapUnveiler associates inter-clip information through clip token propagation, effectively utilizing long-term temporal map information. MapUnveiler runs efficiently with the proposed clip-level pipeline by avoiding redundant computation with temporal stride while building a global map relationship. Our extensive experiments demonstrate that MapUnveiler achieves state-of-the-art performance on both the nuScenes and Argoverse2 benchmark datasets. We also showcase that MapUnveiler significantly outperforms state-of-the-art approaches in a challenging setting, achieving +10.7% mAP improvement in heavily occluded driving road scenes. The project page can be found at this https URL.

[AI-60] Modulating Reservoir Dynamics via Reinforcement Learning for Efficient Robot Skill Synthesis

链接: https://arxiv.org/abs/2411.10991
作者: Zahra Koulaeizadeh,Erhan Oztop
关键词-EN: recurrent neural network, random recurrent neural, encode task goals, Learning, neural network
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:A random recurrent neural network, called a reservoir, can be used to learn robot movements conditioned on context inputs that encode task goals. The Learning is achieved by mapping the random dynamics of the reservoir modulated by context to desired trajectories via linear regression. This makes the reservoir computing (RC) approach computationally efficient as no iterative gradient descent learning is needed. In this work, we propose a novel RC-based Learning from Demonstration (LfD) framework that not only learns to generate the demonstrated movements but also allows online modulation of the reservoir dynamics to generate movement trajectories that are not covered by the initial demonstration set. This is made possible by using a Reinforcement Learning (RL) module that learns a policy to output context as its actions based on the robot state. Considering that the context dimension is typically low, learning with the RL module is very efficient. We show the validity of the proposed model with systematic experiments on a 2 degrees-of-freedom (DOF) simulated robot that is taught to reach targets, encoded as context, with and without obstacle avoidance constraint. The initial data set includes a set of reaching demonstrations which are learned by the reservoir system. To enable reaching out-of-distribution targets, the RL module is engaged in learning a policy to generate dynamic contexts so that the generated trajectory achieves the desired goal without any learning in the reservoir system. Overall, the proposed model uses an initial learned motor primitive set to efficiently generate diverse motor behaviors guided by the designed reward function. Thus the model can be used as a flexible and effective LfD system where the action repertoire can be extended without new data collection.

[AI-61] VidComposition: Can MLLM s Analyze Compositions in Compiled Videos?

链接: https://arxiv.org/abs/2411.10979
作者: Yunlong Tang,Junjia Guo,Hang Hua,Susan Liang,Mingqian Feng,Xinyang Li,Rui Mao,Chao Huang,Jing Bi,Zeliang Zhang,Pooyan Fazli,Chenliang Xu
关键词-EN: Multimodal Large Language, Large Language Models, Large Language, Multimodal Large, analyze video content
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. The leaderboard and evaluation code are available at this https URL.

[AI-62] SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

链接: https://arxiv.org/abs/2411.10958
作者: Jintao Zhang,Haofeng Huang,Pengle Zhang,Jia Wei,Jun Zhu,Jianfei Chen
关键词-EN: process remains limited, attention process remains, matrix multiplication, remains limited, application to accelerate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. SageAttention utilizes 8-bit matrix multiplication, 16-bit matrix multiplication with 16-bit accumulator, and precision-enhancing methods, implementing an accurate and 2x speedup kernel compared to FlashAttention2. To further enhance the efficiency of attention computation while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrixes (Q, K) to INT4 in a warp-level granularity and quantize matrixes (\widetilde P, V) to FP8. Second, we propose a method to smooth Q and V , enhancing the accuracy of attention with INT4 QK and FP8 PV . Third, we analyze the quantization accuracy across timesteps and layers, then propose an adaptive quantization method to ensure the end-to-end metrics over various models. The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 5x on RTX4090, respectively. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation. The codes are available at this https URL.

[AI-63] IMPaCT GNN: Imposing invariance with Message Passing in Chronological split Temporal Graphs

链接: https://arxiv.org/abs/2411.10957
作者: Sejun Park,Joo Young Park,Hyunwoo Park
关键词-EN: paper addresses domain, paper addresses, Semi-Supervised Node Classification, domain adaptation challenges, addresses domain adaptation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 11 pages (without appendix), 35 pages (with appendix), 14 figures

点击查看摘要

Abstract:This paper addresses domain adaptation challenges in graph data resulting from chronological splits. In a transductive graph learning setting, where each node is associated with a timestamp, we focus on the task of Semi-Supervised Node Classification (SSNC), aiming to classify recent nodes using labels of past nodes. Temporal dependencies in node connections create domain shifts, causing significant performance degradation when applying models trained on historical data into recent data. Given the practical relevance of this scenario, addressing domain adaptation in chronological split data is crucial, yet underexplored. We propose Imposing invariance with Message Passing in Chronological split Temporal Graphs (IMPaCT), a method that imposes invariant properties based on realistic assumptions derived from temporal graph structures. Unlike traditional domain adaptation approaches which rely on unverifiable assumptions, IMPaCT explicitly accounts for the characteristics of chronological splits. The IMPaCT is further supported by rigorous mathematical analysis, including a derivation of an upper bound of the generalization error. Experimentally, IMPaCT achieves a 3.8% performance improvement over current SOTA method on the ogbn-mag graph dataset. Additionally, we introduce the Temporal Stochastic Block Model (TSBM), which replicates temporal graphs under varying conditions, demonstrating the applicability of our methods to general spatial GNNs.

[AI-64] Hyperspectral Imaging-Based Grain Quality Assessment With Limited Labelled Data

链接: https://arxiv.org/abs/2411.10924
作者: Priyabrata Karmakar,Manzur Murshed,Shyh Wei Teng
关键词-EN: gained research attention, Recently hyperspectral imaging, Recently hyperspectral, research attention, gained research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:Recently hyperspectral imaging (HSI)-based grain quality assessment has gained research attention. However, unlike other imaging modalities, HSI data lacks sufficient labelled samples required to effectively train deep convolutional neural network (DCNN)-based classifiers. In this paper, we present a novel approach to grain quality assessment using HSI combined with few-shot learning (FSL) techniques. Traditional methods for grain quality evaluation, while reliable, are invasive, time-consuming, and costly. HSI offers a non-invasive, real-time alternative by capturing both spatial and spectral information. However, a significant challenge in applying DCNNs for HSI-based grain classification is the need for large labelled databases, which are often difficult to obtain. To address this, we explore the use of FSL, which enables models to perform well with limited labelled data, making it a practical solution for real-world applications where rapid deployment is required. We also explored the application of FSL for the classification of hyperspectral images of bulk grains to enable rapid quality assessment at various receival points in the grain supply chain. We evaluated the performance of few-shot classifiers in two scenarios: first, classification of grain types seen during training, and second, generalisation to unseen grain types, a crucial feature for real-world applications. In the first scenario, we introduce a novel approach using pre-computed collective class prototypes (CCPs) to enhance inference efficiency and robustness. In the second scenario, we assess the model’s ability to classify novel grain types using limited support examples. Our experimental results show that despite using very limited labelled data for training, our FSL classifiers accuracy is comparable to that of a fully trained classifier trained using a significantly larger labelled database.

[AI-65] Multi-Modal Self-Supervised Learning for Surgical Feedback Effectiveness Assessment ALT

链接: https://arxiv.org/abs/2411.10919
作者: Arushi Gupta,Rafal Kocielnik,Jiayun Wang,Firdavs Nasriddinov,Cherine Yang,Elyssa Wong,Anima Anandkumar,Andrew Hung
关键词-EN: long-term skill acquisition, enhancing long-term skill, skill acquisition, important for preventing, preventing errors
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted as a spotlight proceedings paper at Machine Learning for Health 2024

点击查看摘要

Abstract:During surgical training, real-time feedback from trainers to trainees is important for preventing errors and enhancing long-term skill acquisition. Accurately predicting the effectiveness of this feedback, specifically whether it leads to a change in trainee behavior, is crucial for developing methods for improving surgical training and education. However, relying on human annotations to assess feedback effectiveness is laborious and prone to biases, underscoring the need for an automated, scalable, and objective method. Creating such an automated system poses challenges, as it requires an understanding of both the verbal feedback delivered by the trainer and the visual context of the real-time surgical scene. To address this, we propose a method that integrates information from transcribed verbal feedback and corresponding surgical video to predict feedback effectiveness. Our findings show that both transcribed feedback and surgical video are individually predictive of trainee behavior changes, and their combination achieves an AUROC of 0.70+/-0.02, improving prediction accuracy by up to 6.6%. Additionally, we introduce self-supervised fine-tuning as a strategy for enhancing surgical video representation learning, which is scalable and further enhances prediction performance. Our results demonstrate the potential of multi-modal learning to advance the automated assessment of surgical feedback.

[AI-66] LLM -assisted Physical Invariant Extraction for Cyber-Physical Systems Anomaly Detection

链接: https://arxiv.org/abs/2411.10918
作者: Danial Abshari,Chenglong Fu,Meera Sridhar
关键词-EN: Modern industrial infrastructures, potentially catastrophic effects, industrial infrastructures rely, infrastructures rely heavily, Modern industrial
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modern industrial infrastructures rely heavily on Cyber-Physical Systems (CPS), but these are vulnerable to cyber-attacks with potentially catastrophic effects. To reduce these risks, anomaly detection methods based on physical invariants have been developed. However, these methods often require domain-specific expertise to manually define invariants, making them costly and difficult to scale. To address this limitation, we propose a novel approach to extract physical invariants from CPS testbeds for anomaly detection. Our insight is that CPS design documentation often contains semantically rich descriptions of physical procedures, which can profile inter-correlated dynamics among system components. Leveraging the built-in physics and engineering knowledge of recent generative AI models, we aim to automate this traditionally manual process, improving scalability and reducing costs. This work focuses on designing and optimizing a Retrieval-Augmented-Generation (RAG) workflow with a customized prompting system tailored for CPS documentation, enabling accurate extraction of semantic information and inference of physical invariants from complex, multimodal content. Then, rather than directly applying the inferred invariants for anomaly detection, we introduce an innovative statistics-based learning approach that integrates these invariants into the training dataset. This method addresses limitations such as hallucination and concept drift, enhancing the reliability of the model. We evaluate our approach on real-world public CPS security dataset which contains 86 data points and 58 attacking cases. The results show that our approach achieves a high precision of 0.923, accurately detecting anomalies while minimizing false alarms.

[AI-67] Evolution of IVR building techniques: from code writing to AI-powered automation

链接: https://arxiv.org/abs/2411.10895
作者: Khushbu Mehboob Shaikh,Georgios Giannakopoulos
关键词-EN: Interactive Voice Response, undergone significant transformation, user-friendly approaches leveraging, approaches leveraging widgets, Interactive Voice
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Interactive Voice Response (IVR) systems have undergone significant transformation in recent years, moving from traditional code-based development to more user-friendly approaches leveraging widgets and, most recently, harnessing the power of Artificial Intelligence (AI) for automated IVR flow creation. This paper explores the evolution of IVR building techniques, highlighting the industry’s revolution and shaping the future of IVR systems. The authors delve into the historical context, current trends, and future prospects of IVR development, elucidating the impact of AI on simplifying IVR creation processes and enhancing customer experiences.

[AI-68] MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation

链接: https://arxiv.org/abs/2411.10886
作者: Ansh Shah,K Madhava Krishna
关键词-EN: Recovering metric depth, single image remains, Recovering metric, metric depth, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
*备注: arXiv admin note: substantial text overlap with arXiv:2312.02145 by other authors

点击查看摘要

Abstract:Recovering metric depth from a single image remains a fundamental challenge in computer vision, requiring both scene understanding and accurate scaling. While deep learning has advanced monocular depth estimation, current models often struggle with unfamiliar scenes and layouts, particularly in zero-shot scenarios and when predicting scale-ergodic metric depth. We present MetricGold, a novel approach that harnesses generative diffusion model’s rich priors to improve metric depth estimation. Building upon recent advances in MariGold, DDVM and Depth Anything V2 respectively, our method combines latent diffusion, log-scaled metric depth representation, and synthetic data training. MetricGold achieves efficient training on a single RTX 3090 within two days using photo-realistic synthetic data from HyperSIM, VirtualKitti, and TartanAir. Our experiments demonstrate robust generalization across diverse datasets, producing sharper and higher quality metric depth estimates compared to existing approaches.

[AI-69] Developer Perspectives on Licensing and Copyright Issues Arising from Generative AI for Coding

链接: https://arxiv.org/abs/2411.10877
作者: Trevor Stalnaker,Nathan Wintersgill,Oscar Chaparro,Laura A. Heymann,Massimiliano Di Penta,Daniel M German,Denys Poshyvanyk
关键词-EN: software development practices, transform software development, started to transform, development practices, Generative
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI (GenAI) tools have already started to transform software development practices. Despite their utility in tasks such as writing code, the use of these tools raises important legal questions and potential risks, particularly those associated with copyright law. In the midst of this uncertainty, this paper presents a study jointly conducted by software engineering and legal researchers that surveyed 574 GitHub developers who use GenAI tools for development activities. The survey and follow-up interviews probed the developers’ opinions on emerging legal issues as well as their perception of copyrightability, ownership of generated code, and related considerations. We also investigate potential developer misconceptions, the impact of GenAI on developers’ work, and developers’ awareness of licensing/copyright risks. Qualitative and quantitative analysis showed that developers’ opinions on copyright issues vary broadly and that many developers are aware of the nuances these legal questions involve. We provide: (1) a survey of 574 developers on the licensing and copyright aspects of GenAI for coding, (2) a snapshot of practitioners’ views at a time when GenAI and perceptions of it are rapidly evolving, and (3) an analysis of developers’ views, yielding insights and recommendations that can inform future regulatory decisions in this evolving field.

[AI-70] ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

链接: https://arxiv.org/abs/2411.10867
作者: Vipula Rawte,Sarthak Jain,Aarush Sinha,Garv Kaushik,Aman Bansal,Prathiksha Rumale Vishwanath,Samyak Rajesh Jain,Aishwarya Naresh Reganti,Vinija Jain,Aman Chadha,Amit P. Sheth,Amitava Das
关键词-EN: Large Multimodal Models, Large Multimodal, include video understanding, Multimodal Models, Latest developments
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Latest developments in Large Multimodal Models (LMMs) have broadened their capabilities to include video understanding. Specifically, Text-to-video (T2V) models have made significant progress in quality, comprehension, and duration, excelling at creating videos from simple textual prompts. Yet, they still frequently produce hallucinated content that clearly signals the video is AI-generated. We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models. We identify five major types of hallucination: Vanishing Subject, Numeric Variability, Temporal Dysmorphia, Omission Error, and Physical Incongruity. Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos, comprising 3,782 videos annotated by humans into these five categories. ViBe offers a unique resource for evaluating the reliability of T2V models and provides a foundation for improving hallucination detection and mitigation in video generation. We establish classification as a baseline and present various ensemble classifier configurations, with the TimeSFormer + CNN combination yielding the best performance, achieving 0.345 accuracy and 0.342 F1 score. This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.

[AI-71] See-Saw Generative Mechanism for Scalable Recursive Code Generation with Generative AI

链接: https://arxiv.org/abs/2411.10861
作者: Ruslan Idelfonso Magaña Vsevolodovna
关键词-EN: iterative refinement requirements, models presents challenges, presents challenges due, refinement requirements, models presents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:The generation of complex, large-scale code projects using generative AI models presents challenges due to token limitations, dependency management, and iterative refinement requirements. This paper introduces the See-Saw generative mechanism, a novel methodology for dynamic and recursive code generation. The proposed approach alternates between main code updates and dependency generation to ensure alignment and functionality. By dynamically optimizing token usage and incorporating key elements of the main code into the generation of dependencies, the method enables efficient and scalable code generation for projects requiring hundreds of interdependent files. The mechanism ensures that all code components are synchronized and functional, enabling scalable and efficient project generation. Experimental validation demonstrates the method’s capability to manage dependencies effectively while maintaining coherence and minimizing computational overhead.

[AI-72] CODECLEANER: Elevating Standards with A Robust Data Contamination Mitigation Toolkit

链接: https://arxiv.org/abs/2411.10842
作者: Jialun Cao,Songqiang Chen,Wuqi Zhang,Hau Ching Lo,Shing-Chi Cheung
关键词-EN: critical barrier preventing, advanced software engineering, barrier preventing widespread, critical barrier, barrier preventing
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data contamination presents a critical barrier preventing widespread industrial adoption of advanced software engineering techniques that leverage code language models (CLMs). This phenomenon occurs when evaluation data inadvertently overlaps with the public code repositories used to train CLMs, severely undermining the credibility of performance evaluations. For software companies considering the integration of CLM-based techniques into their development pipeline, this uncertainty about true performance metrics poses an unacceptable business risk. Code refactoring, which comprises code restructuring and variable renaming, has emerged as a promising measure to mitigate data contamination. It provides a practical alternative to the resource-intensive process of building contamination-free evaluation datasets, which would require companies to collect, clean, and label code created after the CLMs’ training cutoff dates. However, the lack of automated code refactoring tools and scientifically validated refactoring techniques has hampered widespread industrial implementation. To bridge the gap, this paper presents the first systematic study to examine the efficacy of code refactoring operators at multiple scales (method-level, class-level, and cross-class level) and in different programming languages. In particular, we develop an open-sourced toolkit, CODECLEANER, which includes 11 operators for Python, with nine method-level, one class-level, and one cross-class-level operator. A drop of 65% overlap ratio is found when applying all operators in CODECLEANER, demonstrating their effectiveness in addressing data contamination. Additionally, we migrate four operators to Java, showing their generalizability to another language. We make CODECLEANER online available to facilitate further studies on mitigating CLM data contamination.

[AI-73] Adaptive Learning of Design Strategies over Non-Hierarchical Multi-Fidelity Models via Policy Alignment

链接: https://arxiv.org/abs/2411.10841
作者: Akash Agrawal(1),Christopher McComb(1) ((1) Carnegie Mellon University)
关键词-EN: Multi-fidelity Reinforcement Learning, Reinforcement Learning, Multi-fidelity Reinforcement, frameworks significantly enhance, computational costs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 48 pages, 20 figures

点击查看摘要

Abstract:Multi-fidelity Reinforcement Learning (RL) frameworks significantly enhance the efficiency of engineering design by leveraging analysis models with varying levels of accuracy and computational costs. The prevailing methodologies, characterized by transfer learning, human-inspired strategies, control variate techniques, and adaptive sampling, predominantly depend on a structured hierarchy of models. However, this reliance on a model hierarchy overlooks the heterogeneous error distributions of models across the design space, extending beyond mere fidelity levels. This work proposes ALPHA (Adaptively Learned Policy with Heterogeneous Analyses), a novel multi-fidelity RL framework to efficiently learn a high-fidelity policy by adaptively leveraging an arbitrary set of non-hierarchical, heterogeneous, low-fidelity models alongside a high-fidelity model. Specifically, low-fidelity policies and their experience data are dynamically used for efficient targeted learning, guided by their alignment with the high-fidelity policy. The effectiveness of ALPHA is demonstrated in analytical test optimization and octocopter design problems, utilizing two low-fidelity models alongside a high-fidelity one. The results highlight ALPHA’s adaptive capability to dynamically utilize models across time and design space, eliminating the need for scheduling models as required in a hierarchical framework. Furthermore, the adaptive agents find more direct paths to high-performance solutions, showing superior convergence behavior compared to hierarchical agents.

[AI-74] One-Layer Transformer Provably Learns One-Nearest Neighbor In Context

链接: https://arxiv.org/abs/2411.10830
作者: Zihao Li,Yuan Cao,Cheng Gao,Yihan He,Han Liu,Jason M. Klusowski,Jianqing Fan,Mengdi Wang
关键词-EN: achieved great success, recent years, achieved great, great success, success in recent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Transformers have achieved great success in recent years. Interestingly, transformers have shown particularly strong in-context learning capability – even without fine-tuning, they are still able to solve unseen tasks well purely based on task-specific prompts. In this paper, we study the capability of one-layer transformers in learning one of the most classical nonparametric estimators, the one-nearest neighbor prediction rule. Under a theoretical framework where the prompt contains a sequence of labeled training data and unlabeled test data, we show that, although the loss function is nonconvex when trained with gradient descent, a single softmax attention layer can successfully learn to behave like a one-nearest neighbor classifier. Our result gives a concrete example of how transformers can be trained to implement nonparametric machine learning algorithms, and sheds light on the role of softmax attention in transformer models.

[AI-75] Integrated Machine Learning and Survival Analysis Modeling for Enhanced Chronic Kidney Disease Risk Stratification ML4H ALT

链接: https://arxiv.org/abs/2411.10754
作者: Zachary Dana,Ahmed Ammar Naseer,Botros Toro,Sumanth Swaminathan
关键词-EN: public health challenge, significant public health, Chronic kidney disease, end-stage renal disease, Chronic kidney
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 19 pages

点击查看摘要

Abstract:Chronic kidney disease (CKD) is a significant public health challenge, often progressing to end-stage renal disease (ESRD) if not detected and managed early. Early intervention, warranted by silent disease progression, can significantly reduce associated morbidity, mortality, and financial burden. In this study, we propose a novel approach to modeling CKD progression using a combination of machine learning techniques and classical statistical models. Building on the work of Liu et al. (2023), we evaluate linear models, tree-based methods, and deep learning models to extract novel predictors for CKD progression, with feature importance assessed using Shapley values. These newly identified predictors, integrated with established clinical features from the Kidney Failure Risk Equation, are then applied within the framework of Cox proportional hazards models to predict CKD progression.

[AI-76] LTCXNet: Advancing Chest X-Ray Analysis with Solutions for Long-Tailed Multi-Label Classification and Fairness Challenges

链接: https://arxiv.org/abs/2411.10746
作者: Chin-Wei Huang,Mu-Yi Shen,Kuan-Chang Shih,Shih-Chih Lin,Chi-Yu Chen,Po-Chih Kuo
关键词-EN: Chest X-rays, disparate class frequencies, class frequencies, multi-label data, display various diseases
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Chest X-rays (CXRs) often display various diseases with disparate class frequencies, leading to a long-tailed, multi-label data distribution. In response to this challenge, we explore the Pruned MIMIC-CXR-LT dataset, a curated collection derived from the MIMIC-CXR dataset, specifically designed to represent a long-tailed and multi-label data scenario. We introduce LTCXNet, a novel framework that integrates the ConvNeXt model, ML-Decoder, and strategic data augmentation, further enhanced by an ensemble approach. We demonstrate that LTCXNet improves the performance of CXR interpretation across all classes, especially enhancing detection in rarer classes like Pneumoperitoneum' and Pneumomediastinum’ by 79% and 48%, respectively. Beyond performance metrics, our research extends into evaluating fairness, highlighting that some methods, while improving model accuracy, could inadvertently affect fairness across different demographic groups negatively. This work contributes to advancing the understanding and management of long-tailed, multi-label data distributions in medical imaging, paving the way for more equitable and effective diagnostic tools.

[AI-77] MetaLA: Unified Optimal Linear Approximation to Softmax Attention Map

链接: https://arxiv.org/abs/2411.10741
作者: Yuhong Chou,Man Yao,Kexin Wang,Yuqi Pan,Ruijie Zhu,Yiran Zhong,Yu Qiao,Jibin Wu,Bo Xu,Guoqi Li
关键词-EN: State Space Model, State Space, Transformer structures, Linear RNN, Linear Transformer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Various linear complexity models, such as Linear Transformer (LinFormer), State Space Model (SSM), and Linear RNN (LinRNN), have been proposed to replace the conventional softmax attention in Transformer structures. However, the optimal design of these linear models is still an open question. In this work, we attempt to answer this question by finding the best linear approximation to softmax attention from a theoretical perspective. We start by unifying existing linear complexity models as the linear attention form and then identify three conditions for the optimal linear attention design: 1) Dynamic memory ability; 2) Static approximation ability; 3) Least parameter approximation. We find that none of the current linear models meet all three conditions, resulting in suboptimal performance. Instead, we propose Meta Linear Attention (MetaLA) as a solution that satisfies these conditions. Our experiments on Multi-Query Associative Recall (MQAR) task, language modeling, image classification, and Long-Range Arena (LRA) benchmark demonstrate that MetaLA is more effective than the existing linear models.

[AI-78] HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization

链接: https://arxiv.org/abs/2411.10696
作者: Huaqin Zhao,Jiaxi Li,Yi Pan,Shizhe Liang,Xiaofeng Yang,Wei Liu,Xiang Li,Fei Dou,Tianming Liu,Jin Lu
关键词-EN: demands extensive resources, back-propagation process demands, process demands extensive, poses significant memory, significant memory challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) poses significant memory challenges, as the back-propagation process demands extensive resources, especially with growing model sizes. Recent work, MeZO, addresses this issue using a zeroth-order (ZO) optimization method, which reduces memory consumption by matching the usage to the inference phase. However, MeZO experiences slow convergence due to varying curvatures across model parameters. To overcome this limitation, we introduce HELENE, a novel scalable and memory-efficient optimizer that integrates annealed A-GNB gradients with a diagonal Hessian estimation and layer-wise clipping, serving as a second-order pre-conditioner. This combination allows for faster and more stable convergence. Our theoretical analysis demonstrates that HELENE improves convergence rates, particularly for models with heterogeneous layer dimensions, by reducing the dependency on the total parameter space dimension. Instead, the method scales with the largest layer dimension, making it highly suitable for modern LLM architectures. Experimental results on RoBERTa-large and OPT-1.3B across multiple tasks show that HELENE achieves up to a 20x speedup compared to MeZO, with average accuracy improvements of 1.5%. Furthermore, HELENE remains compatible with both full parameter tuning and parameter-efficient fine-tuning (PEFT), outperforming several state-of-the-art optimizers. The codes will be released after reviewing.

[AI-79] DEBUG-HD: Debugging TinyML models on-device using Hyper-Dimensional computing NEURIPS2024

链接: https://arxiv.org/abs/2411.10692
作者: Nikhil P Ghanathe,Steven J E Wilton
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted at the Machine Learning for Systems Workshop at NeurIPS 2024

点击查看摘要

[AI-80] Exploring Feature-based Knowledge Distillation For Recommender System: A Frequency Perspective

链接: https://arxiv.org/abs/2411.10676
作者: Zhangchi Zhu,Wei Zhang
关键词-EN:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-81] Pluralistic Alignment Over Time NEURIPS2024

链接: https://arxiv.org/abs/2411.10654
作者: Toryn Q. Klassen,Parand A. Alamdari,Sheila A. McIlraith
关键词-EN: system makes decisions, group of stakeholders, makes decisions, evaluate how aligned, temporally extended preferences
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Pluralistic Alignment Workshop at NeurIPS 2024

点击查看摘要

Abstract:If an AI system makes decisions over time, how should we evaluate how aligned it is with a group of stakeholders (who may have conflicting values and preferences)? In this position paper, we advocate for consideration of temporal aspects including stakeholders’ changing levels of satisfaction and their possibly temporally extended preferences. We suggest how a recent approach to evaluating fairness over time could be applied to a new form of pluralistic alignment: temporal pluralism, where the AI system reflects different stakeholders’ values at different times.

[AI-82] Understanding Learning with Sliced-Wasserstein Requires Rethinking Informative Slices

链接: https://arxiv.org/abs/2411.10651
作者: Huy Tran,Yikun Bai,Ashkan Shahbazi,John R. Hershey,Soheil Kolouri
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

[AI-83] Is thermography a viable solution for detecting pressure injuries in dark skin patients?

链接: https://arxiv.org/abs/2411.10627
作者: Miriam Asare-Baiden,Kathleen Jordan,Andrew Chung,Sharon Eve Sonenblum,Joyce C. Ho
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

[AI-84] Weak Permission is not Well-Founded Grounded and Stable

链接: https://arxiv.org/abs/2411.10624
作者: Guido Governatori
关键词-EN:
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-85] Attraction-Repulsion Swarming: A Generalized Framework of t-SNE via Force Normalization and Tunable Interactions

链接: https://arxiv.org/abs/2411.10617
作者: Jingcheng Lu,Jeff Calder
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Classical Analysis and ODEs (math.CA); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

[AI-86] Being Considerate as a Pathway Towards Pluralistic Alignment for Agent ic AI NEURIPS2024

链接: https://arxiv.org/abs/2411.10613
作者: Parand A. Alamdari,Toryn Q. Klassen,Rodrigo Toro Icarte,Sheila A. McIlraith
关键词-EN:
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Pluralistic Alignment Workshop at NeurIPS 2024

点击查看摘要

[AI-87] AmoebaLLM : Constructing Any-Shape Large Language Models for Efficient and Instant Deployment NEURIPS2024

链接: https://arxiv.org/abs/2411.10606
作者: Yonggan Fu,Zhongzhi Yu,Junwei Li,Jiayi Qian,Yongan Zhang,Xiangchi Yuan,Dachuan Shi,Roman Yakunin,Yingyan Celine Lin
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at NeurIPS 2024

点击查看摘要

[AI-88] Generating Energy-efficient code with LLM s

链接: https://arxiv.org/abs/2411.10599
作者: Tom Cappendijk,Pepijn de Reus,Ana Oprescu
关键词-EN:
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-89] Vision Eagle Attention: A New Lens for Advancing Image Classification

链接: https://arxiv.org/abs/2411.10564
作者: Mahmudul Hasan
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures, 3 tables

点击查看摘要

[AI-90] Low-Rank Optimal Transport through Factor Relaxation with Latent Coupling NEURIPS2024

链接: https://arxiv.org/abs/2411.10555
作者: Peter Halmos,Xinhao Liu,Julian Gold,Benjamin J Raphael
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 53 pages, 13 figures, NeurIPS 2024. Comments welcome!

点击查看摘要

[AI-91] Chain of Alignment: Integrating Public Will with Expert Intelligence for Language Model Alignment NEURIPS2024

链接: https://arxiv.org/abs/2411.10534
作者: Andrew Konya,Aviv Ovadya,Kevin Feng,Quan Ze Chen,Lisa Schirch,Colin Irwin,Amy X. Zhang
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Pluralistic Alignment Workshop at NeurIPS 2024

点击查看摘要

[AI-92] USP-Gaussian: Unifying Spike-based Image Reconstruction Pose Correction and Gaussian Splatting

链接: https://arxiv.org/abs/2411.10504
作者: Kang Chen,Jiyuan Zhang,Zecheng Hao,Yajing Zheng,Tiejun Huang,Zhaofei Yu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-93] Edge-Only Universal Adversarial Attacks in Distributed Learning

链接: https://arxiv.org/abs/2411.10500
作者: Giulio Rossolini,Tommaso Baldi,Alessandro Biondi,Giorgio Buttazzo
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-94] Guided Learning: Lubricating End-to-End Modeling for Multi-stage Decision-making

链接: https://arxiv.org/abs/2411.10496
作者: Jian Guo,Saizhuo Wang,Yiyan Qi
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

[AI-95] AI-Spectra: A Visual Dashboard for Model Multiplicity to Enhance Informed and Transparent Decision-Making

链接: https://arxiv.org/abs/2411.10490
作者: Gilles Eerlings,Sebe Vanbrabant,Jori Liesenborgs,Gustavo Rovelo Ruiz,Davy Vanacken,Kris Luyten
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted for publication in an LNCS Volume “Engineering Interactive Computer Systems - EICS 2024 - International Workshops and Doctoral Consortium, Selected Papers”

点击查看摘要

[AI-96] Biometrics in Extended Reality: A Review

链接: https://arxiv.org/abs/2411.10489
作者: Ayush Agarwal,Raghavendra Ramachandra,Sushma Venkatesh,S. R. Mahadeva Prasanna
关键词-EN:
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-97] he Future of Skill: What Is It to Be Skilled at Work?

链接: https://arxiv.org/abs/2411.10488
作者: Axel Niklasson,Sean Rintel,Stephann Makri,Alex Taylor
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

[AI-98] Artificial Intelligence for Infectious Disease Prediction and Prevention: A Comprehensive Review

链接: https://arxiv.org/abs/2411.10486
作者: Selestine Melchane,Youssef Elmir,Farid Kacimi,Larbi Boubchir
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Populations and Evolution (q-bio.PE)
*备注: 31 pages, 5 figures, this manuscript has been accepted for publication in ACTA UNIVERSITATIS SAPIENTIAE, Informatica

点击查看摘要

[AI-99] Large Language Models for Constructing and Optimizing Machine Learning Workflows: A Survey

链接: https://arxiv.org/abs/2411.10478
作者: Yang Gu,Hengyu You,Jian Cao,Muran Yu
关键词-EN:
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-100] Beyond object identification: How train drivers evaluate the risk of collision

链接: https://arxiv.org/abs/2411.10475
作者: Romy Müller,Judith Schmidt
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

[AI-101] Detecting Student Disengagement in Online Classes Using Deep Learning: A Review

链接: https://arxiv.org/abs/2411.10464
作者: Ahmed Mohamed,Mostafa Ali,Shahd Ahmed,Nouran Hani,Mohammed Hisham,Meram Mahmoud
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-102] Unexploited Information Value in Human-AI Collaboration

链接: https://arxiv.org/abs/2411.10463
作者: Ziyang Guo,Yifan Wu,Jason Hartline,Jessica Hullman
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-103] Utilizing Human Behavior Modeling to Manipulate Explanations in AI-Assisted Decision Making: The Good the Bad and the Scary NEURIPS2024

链接: https://arxiv.org/abs/2411.10461
作者: Zhuoyan Li,Ming Yin
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

[AI-104] Biotic Browser: Applying StreamingLLM as a Persistent Web Browsing Co-Pilot

链接: https://arxiv.org/abs/2411.10454
作者: Kevin F. Dunnell,Andrew P. Stoddard
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Written December 2023

点击查看摘要

[AI-105] owards Geometry-Preserving Reductions Between Constraint Satisfaction Problems (and other problems in NP)

链接: https://arxiv.org/abs/2411.10453
作者: Gabriel Istrate
关键词-EN:
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
*备注: In Proceedings FROM 2024, arXiv:2410.23020 . An extended version is under preparation and will also be posted on arXiv

点击查看摘要

[AI-106] Love in Action: Gamifying Public Video Cameras for Fostering Social Relationships in Real World

链接: https://arxiv.org/abs/2411.10449
作者: Zhang Zhang,Da Li,Geng Wu,Yaoning Li,Xiaobing Sun,Liang Wang
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: accepted as a main track paper by EAI-ArtsIT 2024

点击查看摘要

[AI-107] Goetterfunke: Creativity in Machinae Sapiens. About the Qualitative Shift in Generative AI with a Focus of Text-To-Image

链接: https://arxiv.org/abs/2411.10448
作者: Jens Knappe
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 3 figures (images), 33 pages

点击查看摘要

[AI-108] Backdoor Attack Against Vision Transformers via Attention Gradient-Based Image Erosion

链接: https://arxiv.org/abs/2410.22678
作者: Ji Guo,Hongwei Li,Wenbo Jiang,Guoming Lu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Accepted by IEEE GLOBECOM 2024

点击查看摘要

[AI-109] Backdoor Attacks against Image-to-Image Networks

链接: https://arxiv.org/abs/2407.10445
作者: Wenbo Jiang,Hongwei Li,Jiaming He,Rui Zhang,Guowen Xu,Tianwei Zhang,Rongxing Lu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[AI-110] Character-level Tokenizations as Powerful Inductive Biases for RNA Foundational Models

链接: https://arxiv.org/abs/2411.11808
作者: Adrián Morales-Pastor,Raquel Vázquez-Reza,Miłosz Wieczór,Clàudia Valverde,Manel Gil-Sorribes,Bertran Miquel-Oliver,Álvaro Ciudad,Alexis Molina
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: First version. Work in progress

点击查看摘要

[AI-111] Edge-Enhanced Dilated Residual Attention Network for Multimodal Medical Image Fusion

链接: https://arxiv.org/abs/2411.11799
作者: Meng Zhou,Yuxuan Zhang,Xiaolan Xu,Jiayi Wang,Farzad Khalvati
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: An extended version of the paper accepted at IEEE BIBM 2024

点击查看摘要

[AI-112] Exploring adversarial robustness of JPEG AI: methodology comparison and new methods

链接: https://arxiv.org/abs/2411.11795
作者: Egor Kovalev,Georgii Bychkov,Khaled Abud,Aleksandr Gushchin,Anna Chistyakova,Sergey Lavrushkin,Dmitriy Vatolin,Anastasia Antsiferova
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-113] Hybrid Data-Driven SSM for Interpretable and Label-Free mmWave Channel Prediction

链接: https://arxiv.org/abs/2411.11576
作者: Yiyong Sun,Jiajun He,Zhidi Lin,Wenqiang Pu,Feng Yin,Hing Cheung So
关键词-EN:
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-114] HistoEncoder: a digital pathology foundation model for prostate cancer

链接: https://arxiv.org/abs/2411.11458
作者: Joona Pohjonen,Abderrahim-Oussama Batouche,Antti Rannikko,Kevin Sandeman,Andrew Erickson,Esa Pitkanen,Tuomas Mirtti
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-115] Continuous K-space Recovery Network with Image Guidance for Fast MRI Reconstruction

链接: https://arxiv.org/abs/2411.11282
作者: Yucong Meng,Zhiwei Yang,Minghong Duan,Yonghong Shi,Zhijian Song
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[AI-116] MpoxVLM: A Vision-Language Model for Diagnosing Skin Lesions from Mpox Virus Infection ML4H2024

链接: https://arxiv.org/abs/2411.10888
作者: Xu Cao,Wenqian Ye,Kenny Moise,Megan Coffee
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ML4H 2024

点击查看摘要

[AI-117] A Novel Adaptive Hybrid Focal-Entropy Loss for Enhancing Diabetic Retinopathy Detection Using Convolutional Neural Networks

链接: https://arxiv.org/abs/2411.10843
作者: Pandiyaraju V,Santhosh Malarvannan,Shravan Venkatraman,Abeshek A,Priyadarshini B,Kannan A
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages,7 figures

点击查看摘要

[AI-118] MRI Parameter Mapping via Gaussian Mixture VAE: Breaking the Assumption of Independent Pixels NEURIPS2024

链接: https://arxiv.org/abs/2411.10772
作者: Moucheng Xu,Yukun Zhou,Tobias Goodwin-Allcock,Kimia Firoozabadi,Joseph Jacob,Daniel C. Alexander,Paddy J. Slator
关键词-EN:
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2024 Workshop in Machine Learning and the Physical Sciences

点击查看摘要

[AI-119] Digital-Analog Quantum Machine Learning

链接: https://arxiv.org/abs/2411.10744
作者: Lucas Lamata
关键词-EN:
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: Invited Perspective for Advanced Intelligent Discovery

点击查看摘要

[AI-120] A minimalistic representation model for head direction system NEURIPS2024

链接: https://arxiv.org/abs/2411.10596
作者: Minglu Zhao,Dehong Xu,Deqian Kong,Wen-Hao Zhang,Ying Nian Wu
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: Workshop on Symmetry and Geometry in Neural Representations (NeurReps) at NeurIPS 2024, Extended Abstract Track

点击查看摘要

[AI-121] Pragmatic information of aesthetic appraisal

链接: https://arxiv.org/abs/2411.10561
作者: Peter beim Graben
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: 10 pages, 3 figures

点击查看摘要

[AI-122] Dataset Refinement for Improving the Generalization Ability of the EEG Decoding Model

链接: https://arxiv.org/abs/2411.10450
作者: Sung-Jin Kim,Dae-Hyeok Lee,Hyeon-Taek Han
关键词-EN: brain-computer interfaces due, understanding human intentions, EEG, human intentions, characteristics and convenience
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 4 pages, 1 figure, conference

点击查看摘要

Abstract:Electroencephalography (EEG) is a generally used neuroimaging approach in brain-computer interfaces due to its non-invasive characteristics and convenience, making it an effective tool for understanding human intentions. Therefore, recent research has focused on decoding human intentions from EEG signals utilizing deep learning methods. However, since EEG signals are highly susceptible to noise during acquisition, there is a high possibility of the existence of noisy data in the dataset. Although pioneer studies have generally assumed that the dataset is well-curated, this assumption is not always met in the EEG dataset. In this paper, we addressed this issue by designing a dataset refinement algorithm that can eliminate noisy data based on metrics evaluating data influence during the training process. We applied the proposed algorithm to two motor imagery EEG public datasets and three different models to perform dataset refinement. The results indicated that retraining the model with the refined dataset consistently led to better generalization performance compared to using the original dataset. Hence, we demonstrated that removing noisy data from the training dataset alone can effectively improve the generalization performance of deep learning models in the EEG domain.

计算机视觉

[CV-0] UniHands: Unifying Various Wild-Collected Keypoints for Personalized Hand Reconstruction

链接: https://arxiv.org/abs/2411.11845
作者: Menghe Zhang,Joonyeoup Kim,Yangwen Liang,Shuangquan Wang,Kee-Bong Song
关键词-EN: Accurate hand motion, Accurate hand, hand motion capture, Accurate, hand-related tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Accurate hand motion capture and standardized 3D representation are essential for various hand-related tasks. Collecting keypoints-only data, while efficient and cost-effective, results in low-fidelity representations and lacks surface information. Furthermore, data inconsistencies across sources challenge their integration and use. We present UniHands, a novel method for creating standardized yet personalized hand models from wild-collected keypoints from diverse sources. Unlike existing neural implicit representation methods, UniHands uses the widely-adopted parametric models MANO and NIMBLE, providing a more scalable and versatile solution. It also derives unified hand joints from the meshes, which facilitates seamless integration into various hand-related tasks. Experiments on the FreiHAND and InterHand2.6M datasets demonstrate its ability to precisely reconstruct hand mesh vertices and keypoints, effectively capturing high-degree articulation motions. Empirical studies involving nine participants show a clear preference for our unified joints over existing configurations for accuracy and naturalism (p-value 0.016).

[CV-1] Generative World Explorer

链接: https://arxiv.org/abs/2411.11844
作者: Taiming Lu,Tianmin Shu,Alan Yuille,Daniel Khashabi,Jieneng Chen
关键词-EN: Planning with partial, textit, Generative World Explorer, world, Genex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Website: this http URL

点击查看摘要

Abstract:Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world this http URL contrast, humans can \textitimagine unseen parts of the world through a mental exploration and \textitrevise their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions, without necessitating the physical exploration of the world at all times. To achieve this human-like ability, we introduce the \textitGenerative World Explorer (Genex) , an egocentric world exploration framework that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train \textitGenex , we create a synthetic urban scene dataset, Genex-DB. Our experimental results demonstrate that (1) \textitGenex can generate high-quality and consistent observations during long-horizon exploration of a large virtual physical world and (2) the beliefs updated with the generated observations can inform an existing decision-making model (e.g., an LLM agent) to make better plans.

[CV-2] RoboGSim: A Real2Sim2Real Robotic Gaussian Splatting Simulator

链接: https://arxiv.org/abs/2411.11839
作者: Xinhai Li,Jialin Li,Ziheng Zhang,Rui Zhang,Fan Jia,Tiancai Wang,Haoqiang Fan,Kuo-Kun Tseng,Ruiping Wang
关键词-EN: increasingly critical, Efficient acquisition, Digital Twins Builder, real-world embodied data, efficient manner
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Efficient acquisition of real-world embodied data has been increasingly critical. However, large-scale demonstrations captured by remote operation tend to take extremely high costs and fail to scale up the data size in an efficient manner. Sampling the episodes under a simulated environment is a promising way for large-scale collection while existing simulators fail to high-fidelity modeling on texture and physics. To address these limitations, we introduce the RoboGSim, a real2sim2real robotic simulator, powered by 3D Gaussian Splatting and the physics engine. RoboGSim mainly includes four parts: Gaussian Reconstructor, Digital Twins Builder, Scene Composer, and Interactive Engine. It can synthesize the simulated data with novel views, objects, trajectories, and scenes. RoboGSim also provides an online, reproducible, and safe evaluation for different manipulation policies. The real2sim and sim2real transfer experiments show a high consistency in the texture and physics. Moreover, the effectiveness of synthetic data is validated under the real-world manipulated tasks. We hope RoboGSim serves as a closed-loop simulator for fair comparison on policy learning. More information can be found on our project page \hrefthis https URLthis https URL.

[CV-3] Revitalizing Electoral Trust: Enhancing Transparency and Efficiency through Automated Voter Counting with Machine Learning

链接: https://arxiv.org/abs/2411.11740
作者: Mir Faris,Syeda Aynul Karim,Md. Juniadul Islam
关键词-EN: advanced image processing, image processing techniques, automated voter counting, manual vote counting, order to address
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 Pages, 4 Figures

点击查看摘要

Abstract:In order to address issues with manual vote counting during election procedures, this study intends to examine the viability of using advanced image processing techniques for automated voter counting. The study aims to shed light on how automated systems that utilize cutting-edge technologies like OpenCV, CVZone, and the MOG2 algorithm could greatly increase the effectiveness and openness of electoral operations. The empirical findings demonstrate how automated voter counting can enhance voting processes and rebuild public confidence in election outcomes, particularly in places where trust is low. The study also emphasizes how rigorous metrics, such as the F1 score, should be used to systematically compare the accuracy of automated systems against manual counting methods. This methodology enables a detailed comprehension of the differences in performance between automated and human counting techniques by providing a nuanced assessment. The incorporation of said measures serves to reinforce an extensive assessment structure, guaranteeing the legitimacy and dependability of automated voting systems inside the electoral sphere.

[CV-4] Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

链接: https://arxiv.org/abs/2411.11727
作者: Ziyi Zhang,Li Shen,Sen Zhang,Deheng Ye,Yong Luo,Miaojing Shi,Bo Du,Dacheng Tao
关键词-EN: Aligning diffusion models, Aligning diffusion, few-step diffusion models, diffusion models, few-step diffusion
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Aligning diffusion models with downstream objectives is essential for their practical applications. However, standard alignment methods often struggle with step generalization when directly applied to few-step diffusion models, leading to inconsistent performance across different denoising step scenarios. To address this, we introduce Stepwise Diffusion Policy Optimization (SDPO), a novel alignment method tailored for few-step diffusion models. Unlike prior approaches that rely on a single sparse reward from only the final step of each denoising trajectory for trajectory-level optimization, SDPO incorporates dense reward feedback at every intermediate step. By learning the differences in dense rewards between paired samples, SDPO facilitates stepwise optimization of few-step diffusion models, ensuring consistent alignment across all denoising steps. To promote stable and efficient training, SDPO introduces an online reinforcement learning framework featuring several novel strategies designed to effectively exploit the stepwise granularity of dense rewards. Experimental results demonstrate that SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations, underscoring its robust step generalization capabilities. Code is avaliable at this https URL.

[CV-5] RAWMamba: Unified sRGB-to-RAW De-rendering With State Space Model

链接: https://arxiv.org/abs/2411.11717
作者: Hongjun Chen,Wencheng Han,Huan Zheng,Jianbing Shen
关键词-EN: increasingly emphasized metadata-driven, emphasized metadata-driven approaches, Recent advancements, partial RAW information, supplemented by partial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in sRGB-to-RAW de-rendering have increasingly emphasized metadata-driven approaches to reconstruct RAW data from sRGB images, supplemented by partial RAW information. In image-based de-rendering, metadata is commonly obtained through sampling, whereas in video tasks, it is typically derived from the initial frame. The distinct metadata requirements necessitate specialized network architectures, leading to architectural incompatibilities that increase deployment complexity. In this paper, we propose RAWMamba, a Mamba-based unified framework developed for sRGB-to-RAW de-rendering across both image and video domains. The core of RAWMamba is the Unified Metadata Embedding (UME) module, which harmonizes diverse metadata types into a unified representation. In detail, a multi-perspective affinity modeling method is proposed to promote the extraction of reference information. In addition, we introduce the Local Tone-Aware Mamba (LTA-Mamba) module, which captures long-range dependencies to enable effective global propagation of metadata. Experimental results demonstrate that the proposed RAWMamba achieves state-of-the-art performance, yielding high-quality RAW data reconstruction.

[CV-6] From Spectra to Geography: Intelligent Mapping of RRUFF Mineral Data

链接: https://arxiv.org/abs/2411.11693
作者: Francesco Pappone,Federico Califano,Marco Tafani
关键词-EN: Accurately determining, applications in geology, material science, determining the geographic, geographic origin
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Accurately determining the geographic origin of mineral samples is pivotal for applications in geology, mineralogy, and material science. Leveraging the comprehensive Raman spectral data from the RRUFF database, this study introduces a novel machine learning framework aimed at geolocating mineral specimens at the country level. We employ a one-dimensional ConvNeXt1D neural network architecture to classify mineral spectra based solely on their spectral signatures. The processed dataset comprises over 32,900 mineral samples, predominantly natural, spanning 101 countries. Through five-fold cross-validation, the ConvNeXt1D model achieved an impressive average classification accuracy of 93%, demonstrating its efficacy in capturing geospatial patterns inherent in Raman spectra.

[CV-7] owards Degradation-Robust Reconstruction in Generalizable NeRF

链接: https://arxiv.org/abs/2411.11691
作者: Chan Ho Park,Ka Leong Cheng,Zhicheng Wang,Qifeng Chen
关键词-EN: Neural Radiance Field, Generalizable Neural Radiance, Radiance Field, Neural Radiance, avoid per-scene optimization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generalizable Neural Radiance Field (GNeRF) across scenes has been proven to be an effective way to avoid per-scene optimization by representing a scene with deep image features of source images. However, despite its potential for real-world applications, there has been limited research on the robustness of GNeRFs to different types of degradation present in the source images. The lack of such research is primarily attributed to the absence of a large-scale dataset fit for training a degradation-robust generalizable NeRF model. To address this gap and facilitate investigations into the degradation robustness of 3D reconstruction tasks, we construct the Objaverse Blur Dataset, comprising 50,000 images from over 1000 settings featuring multiple levels of blur degradation. In addition, we design a simple and model-agnostic module for enhancing the degradation robustness of GNeRFs. Specifically, by extracting 3D-aware features through a lightweight depth estimator and denoiser, the proposed module shows improvement on different popular methods in GNeRFs in terms of both quantitative and visual quality over varying degradation types and levels. Our dataset and code will be made publicly available.

[CV-8] FERT: Real-Time Facial Expression Recognition with Short-Range FMCW Radar

链接: https://arxiv.org/abs/2411.11619
作者: Sabri Mustafa Kahya,Muhammet Sami Yavuz,Eckehard Steinbach
关键词-EN: short-range Frequency-Modulated Continuous-Wave, utilizing short-range Frequency-Modulated, Frequency-Modulated Continuous-Wave, study proposes, images
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted at IEEE SENSORS 2024

点击查看摘要

Abstract:This study proposes a novel approach for real-time facial expression recognition utilizing short-range Frequency-Modulated Continuous-Wave (FMCW) radar equipped with one transmit (Tx), and three receive (Rx) antennas. The system leverages four distinct modalities simultaneously: Range-Doppler images (RDIs), micro range-Doppler Images (micro-RDIs), range azimuth images (RAIs), and range elevation images (REIs). Our innovative architecture integrates feature extractor blocks, intermediate feature extractor blocks, and a ResNet block to accurately classify facial expressions into smile, anger, neutral, and no-face classes. Our model achieves an average classification accuracy of 98.91% on the dataset collected using a 60 GHz short-range FMCW radar. The proposed solution operates in real-time in a person-independent manner, which shows the potential use of low-cost FMCW radars for effective facial expression recognition in various applications.

[CV-9] Leveraging Computational Pathology AI for Noninvasive Optical Imaging Analysis Without Retraining

链接: https://arxiv.org/abs/2411.11613
作者: Danny Barash,Emilie Manning,Aidan Van Vleck,Omri Hirsch,Kyi Lei Aye,Jingxi Li,Philip O. Scumpia,Aydogan Ozcan,Sumaira Aasi,Kerri E. Rieger,Kavita Y. Sarin,Oren Freifeld,Yonatan Winetraub
关键词-EN: clinically relevant data, time generate gigabytes, generate gigabytes, gigabytes of clinically, clinically relevant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Noninvasive optical imaging modalities can probe patient’s tissue in 3D and over time generate gigabytes of clinically relevant data per sample. There is a need for AI models to analyze this data and assist clinical workflow. The lack of expert labelers and the large dataset required (100,000 images) for model training and tuning are the main hurdles in creating foundation models. In this paper we introduce FoundationShift, a method to apply any AI model from computational pathology without retraining. We show our method is more accurate than state of the art models (SAM, MedSAM, SAM-Med2D, CellProfiler, Hover-Net, PLIP, UNI and ChatGPT), with multiple imaging modalities (OCT and RCM). This is achieved without the need for model retraining or fine-tuning. Applying our method to noninvasive in vivo images could enable physicians to readily incorporate optical imaging modalities into their clinical practice, providing real time tissue analysis and improving patient care.

[CV-10] MSSIDD: A Benchmark for Multi-Sensor Denoising

链接: https://arxiv.org/abs/2411.11562
作者: Shibin Mei,Hang Wang,Bingbing Ni
关键词-EN: remains sufficient exploration, mobile terminals employ, domain denoising models, denoising models, raw domain denoising
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 15 pages,7 figures

点击查看摘要

Abstract:The cameras equipped on mobile terminals employ different sensors in different photograph modes, and the transferability of raw domain denoising models between these sensors is significant but remains sufficient exploration. Industrial solutions either develop distinct training strategies and models for different sensors or ignore the differences between sensors and simply extend existing models to new sensors, which leads to tedious training or unsatisfactory performance. In this paper, we introduce a new benchmark, the Multi-Sensor SIDD (MSSIDD) dataset, which is the first raw-domain dataset designed to evaluate the sensor transferability of denoising models. The MSSIDD dataset consists of 60,000 raw images of six distinct sensors, derived through the degeneration of sRGB images via different camera sensor parameters. Furthermore, we propose a sensor consistency training framework that enables denoising models to learn the sensor-invariant features, thereby facilitating the generalization of the consistent model to unseen sensors. We evaluate previous arts on the newly proposed MSSIDD dataset, and the experimental results validate the effectiveness of our proposed method. Our dataset is available at this https URL.

[CV-11] Reliable Poisoned Sample Detection against Backdoor Attacks Enhanced by Sharpness Aware Minimization

链接: https://arxiv.org/abs/2411.11525
作者: Mingda Zhang,Mingli Zhu,Zihao Zhu,Baoyuan Wu
关键词-EN: deep neural networks, backdoor attacks, Backdoor, neural networks, detection performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Backdoor attack has been considered as a serious security threat to deep neural networks (DNNs). Poisoned sample detection (PSD) that aims at filtering out poisoned samples from an untrustworthy training dataset has shown very promising performance for defending against data poisoning based backdoor attacks. However, we observe that the detection performance of many advanced methods is likely to be unstable when facing weak backdoor attacks, such as low poisoning ratio or weak trigger strength. To further verify this observation, we make a statistical investigation among various backdoor attacks and poisoned sample detections, showing a positive correlation between backdoor effect and detection performance. It inspires us to strengthen the backdoor effect to enhance detection performance. Since we cannot achieve that goal via directly manipulating poisoning ratio or trigger strength, we propose to train one model using the Sharpness-Aware Minimization (SAM) algorithm, rather than the vanilla training algorithm. We also provide both empirical and theoretical analysis about how SAM training strengthens the backdoor effect. Then, this SAM trained model can be seamlessly integrated with any off-the-shelf PSD method that extracts discriminative features from the trained model for detection, called SAM-enhanced PSD. Extensive experiments on several benchmark datasets show the reliable detection performance of the proposed method against both weak and strong backdoor attacks, with significant improvements against various attacks ( +34.38% TPR on average), over the conventional PSD methods (i.e., without SAM enhancement). Overall, this work provides new insights about PSD and proposes a novel approach that can complement existing detection methods, which may inspire more in-depth explorations in this field.

[CV-12] Cascaded Diffusion Models for 2D and 3D Microscopy Image Synthesis to Enhance Cell Segmentation

链接: https://arxiv.org/abs/2411.11515
作者: Rüveyda Yilmaz,Kaan Keven,Yuli Wu,Johannes Stegmaier
关键词-EN: Automated cell segmentation, Automated cell, biomedical research, prone to error, essential for biomedical
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated cell segmentation in microscopy images is essential for biomedical research, yet conventional methods are labor-intensive and prone to error. While deep learning-based approaches have proven effective, they often require large annotated datasets, which are scarce due to the challenges of manual annotation. To overcome this, we propose a novel framework for synthesizing densely annotated 2D and 3D cell microscopy images using cascaded diffusion models. Our method synthesizes 2D and 3D cell masks from sparse 2D annotations using multi-level diffusion models and NeuS, a 3D surface reconstruction approach. Following that, a pretrained 2D Stable Diffusion model is finetuned to generate realistic cell textures and the final outputs are combined to form cell populations. We show that training a segmentation model with a combination of our synthetic data and real data improves cell segmentation performance by up to 9% across multiple datasets. Additionally, the FID scores indicate that the synthetic data closely resembles real data. The code for our proposed approach will be available at this https URL_diffusion.

[CV-13] Learning a Neural Association Network for Self-supervised Multi-Object Tracking

链接: https://arxiv.org/abs/2411.11514
作者: Shuai Li,Michael Burke,Subramanian Ramamoorthy,Juergen Gall
关键词-EN: paper introduces, learn data association, neural network, multi-object tracking, neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces a novel framework to learn data association for multi-object tracking in a self-supervised manner. Fully-supervised learning methods are known to achieve excellent tracking performances, but acquiring identity-level annotations is tedious and time-consuming. Motivated by the fact that in real-world scenarios object motion can be usually represented by a Markov process, we present a novel expectation maximization (EM) algorithm that trains a neural network to associate detections for tracking, without requiring prior knowledge of their temporal correspondences. At the core of our method lies a neural Kalman filter, with an observation model conditioned on associations of detections parameterized by a neural network. Given a batch of frames as input, data associations between detections from adjacent frames are predicted by a neural network followed by a Sinkhorn normalization that determines the assignment probabilities of detections to states. Kalman smoothing is then used to obtain the marginal probability of observations given the inferred states, producing a training objective to maximize this marginal probability using gradient descent. The proposed framework is fully differentiable, allowing the underlying neural model to be trained end-to-end. We evaluate our approach on the challenging MOT17 and MOT20 datasets and achieve state-of-the-art results in comparison to self-supervised trackers using public detections. We furthermore demonstrate the capability of the learned model to generalize across datasets.

[CV-14] SignEye: Traffic Sign Interpretation from Vehicle First-Person View

链接: https://arxiv.org/abs/2411.11507
作者: Chuang Yang,Xu Han,Tao Han,Yuejiao SU,Junyu Gao,Hongyuan Zhang,Yi Wang,Lap-Pui Chau
关键词-EN: providing navigation instructions, autonomous driving systems, assisting autonomous driving, driving systems, Traffic signs play
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traffic signs play a key role in assisting autonomous driving systems (ADS) by enabling the assessment of vehicle behavior in compliance with traffic regulations and providing navigation instructions. However, current works are limited to basic sign understanding without considering the egocentric vehicle’s spatial position, which fails to support further regulation assessment and direction navigation. Following the above issues, we introduce a new task: traffic sign interpretation from the vehicle’s first-person view, referred to as TSI-FPV. Meanwhile, we develop a traffic guidance assistant (TGA) scenario application to re-explore the role of traffic signs in ADS as a complement to popular autonomous technologies (such as obstacle perception). Notably, TGA is not a replacement for electronic map navigation; rather, TGA can be an automatic tool for updating it and complementing it in situations such as offline conditions or temporary sign adjustments. Lastly, a spatial and semantic logic-aware stepwise reasoning pipeline (SignEye) is constructed to achieve the TSI-FPV and TGA, and an application-specific dataset (Traffic-CN) is built. Experiments show that TSI-FPV and TGA are achievable via our SignEye trained on Traffic-CN. The results also demonstrate that the TGA can provide complementary information to ADS beyond existing popular autonomous technologies.

[CV-15] LaVin-DiT: Large Vision Diffusion Transformer

链接: https://arxiv.org/abs/2411.11505
作者: Zhaoqing Wang,Xiaobo Xia,Runnan Chen,Dongdong Yu,Changhu Wang,Mingming Gong,Tongliang Liu
关键词-EN: Large Vision, Vision, paper presents, designed to tackle, Large Vision Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures, 2 tables

点击查看摘要

Abstract:This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models will be open-sourced.

[CV-16] Look a Group at Once: Multi-Slide Modeling for Survival Prediction

链接: https://arxiv.org/abs/2411.11487
作者: Xinyang Li,Yi Zhang,Yi Xie,Jianfei Yang,Xi Wang,Hao Chen,Haixian Zhang
关键词-EN: task in pathology, critical task, Survival prediction, Cancer Genome Atlas, clinical practice
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Survival prediction is a critical task in pathology. In clinical practice, pathologists often examine multiple cases, leveraging a broader spectrum of cancer phenotypes to enhance pathological assessment. Despite significant advancements in deep learning, current solutions typically model each slide as a sample, struggling to effectively capture comparable and slide-agnostic pathological features. In this paper, we introduce GroupMIL, a novel framework inspired by the clinical practice of collective analysis, which models multiple slides as a single sample and organizes groups of patches and slides sequentially to capture cross-slide prognostic features. We also present GPAMamba, a model designed to facilitate intra- and inter-slide feature interactions, effectively capturing local micro-environmental characteristics within slide-level graphs while uncovering essential prognostic patterns across an extended patch sequence within the group framework. Furthermore, we develop a dual-head predictor that delivers comprehensive survival risk and probability assessments for each patient. Extensive empirical evaluations demonstrate that our model significantly outperforms state-of-the-art approaches across five datasets from The Cancer Genome Atlas.

[CV-17] Exploring Emerging Trends and Research Opportunities in Visual Place Recognition ICRA

链接: https://arxiv.org/abs/2411.11481
作者: Antonios Gasteratos,Konstantinos A. Tsintotas,Tobias Fischer,Yiannis Aloimonos,Michael Milford
关键词-EN: image classification, Visual-based recognition, robotics communities, long-standing challenge, object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 2 pages, 1 figure. 40th Anniversary of the IEEE Conference on Robotics and Automation (ICRA@40), Rotterdam, Netherlands, September 23-26, 2024

点击查看摘要

Abstract:Visual-based recognition, e.g., image classification, object detection, etc., is a long-standing challenge in computer vision and robotics communities. Concerning the roboticists, since the knowledge of the environment is a prerequisite for complex navigation tasks, visual place recognition is vital for most localization implementations or re-localization and loop closure detection pipelines within simultaneous localization and mapping (SLAM). More specifically, it corresponds to the system’s ability to identify and match a previously visited location using computer vision tools. Towards developing novel techniques with enhanced accuracy and robustness, while motivated by the success presented in natural language processing methods, researchers have recently turned their attention to vision-language models, which integrate visual and textual data.

[CV-18] SL-YOLO: A Stronger and Lighter Drone Target Detection Model

链接: https://arxiv.org/abs/2411.11477
作者: Defan Chen,Luchan Zhang
关键词-EN: daunting challenge due, Detecting small objects, Detecting small, small target detection, complex scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detecting small objects in complex scenes, such as those captured by drones, is a daunting challenge due to the difficulty in capturing the complex features of small targets. While the YOLO family has achieved great success in large target detection, its performance is less than satisfactory when faced with small targets. Because of this, this paper proposes a revolutionary model SL-YOLO (Stronger and Lighter YOLO) that aims to break the bottleneck of small target detection. We propose the Hierarchical Extended Path Aggregation Network (HEPAN), a pioneering cross-scale feature fusion method that can ensure unparalleled detection accuracy even in the most challenging environments. At the same time, without sacrificing detection capabilities, we design the C2fDCB lightweight module and add the SCDown downsampling module to greatly reduce the model’s parameters and computational complexity. Our experimental results on the VisDrone2019 dataset reveal a significant improvement in performance, with mAP@0.5 jumping from 43.0% to 46.9% and mAP@0.5:0.95 increasing from 26.0% to 28.9%. At the same time, the model parameters are reduced from 11.1M to 9.6M, and the FPS can reach 132, making it an ideal solution for real-time small object detection in resource-constrained environments.

[CV-19] MVLight: Relightable Text-to-3D Generation via Light-conditioned Multi-View Diffusion

链接: https://arxiv.org/abs/2411.11475
作者: Dongseok Shim,Yichun Shi,Kejie Li,H. Jin Kim,Peng Wang
关键词-EN: Recent advancements, success of high-performance, richly textured, objects from textual, textual descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in text-to-3D generation, building on the success of high-performance text-to-image generative models, have made it possible to create imaginative and richly textured 3D objects from textual descriptions. However, a key challenge remains in effectively decoupling light-independent and lighting-dependent components to enhance the quality of generated 3D models and their relighting performance. In this paper, we present MVLight, a novel light-conditioned multi-view diffusion model that explicitly integrates lighting conditions directly into the generation process. This enables the model to synthesize high-quality images that faithfully reflect the specified lighting environment across multiple camera views. By leveraging this capability to Score Distillation Sampling (SDS), we can effectively synthesize 3D models with improved geometric precision and relighting capabilities. We validate the effectiveness of MVLight through extensive experiments and a user study.

[CV-20] Generalizable Person Re-identification via Balancing Alignment and Uniformity NEURIPS2024

链接: https://arxiv.org/abs/2411.11471
作者: Yoonki Cho,Jaeyoon Kim,Woo Jae Kim,Junsik Jung,Sung-eui Yoon
关键词-EN: generalizable person re-identification, Domain generalizable person, learn discriminative representations, person re-identification, aims to learn
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Domain generalizable person re-identification (DG re-ID) aims to learn discriminative representations that are robust to distributional shifts. While data augmentation is a straightforward solution to improve generalization, certain augmentations exhibit a polarized effect in this task, enhancing in-distribution performance while deteriorating out-of-distribution performance. In this paper, we investigate this phenomenon and reveal that it leads to sparse representation spaces with reduced uniformity. To address this issue, we propose a novel framework, Balancing Alignment and Uniformity (BAU), which effectively mitigates this effect by maintaining a balance between alignment and uniformity. Specifically, BAU incorporates alignment and uniformity losses applied to both original and augmented images and integrates a weighting strategy to assess the reliability of augmented samples, further improving the alignment loss. Additionally, we introduce a domain-specific uniformity loss that promotes uniformity within each source domain, thereby enhancing the learning of domain-invariant features. Extensive experimental results demonstrate that BAU effectively exploits the advantages of data augmentation, which previous studies could not fully utilize, and achieves state-of-the-art performance without requiring complex training procedures. The code is available at \urlthis https URL.

[CV-21] MGNiceNet: Unified Monocular Geometric Scene Understanding ACCV2024

链接: https://arxiv.org/abs/2411.11466
作者: Markus Schön,Michael Buchholz,Klaus Dietmayer
关键词-EN: geometric scene understanding, scene understanding combines, Monocular geometric scene, self-supervised depth estimation, panoptic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for ACCV 2024

点击查看摘要

Abstract:Monocular geometric scene understanding combines panoptic segmentation and self-supervised depth estimation, focusing on real-time application in autonomous vehicles. We introduce MGNiceNet, a unified approach that uses a linked kernel formulation for panoptic segmentation and self-supervised depth estimation. MGNiceNet is based on the state-of-the-art real-time panoptic segmentation method RT-K-Net and extends the architecture to cover both panoptic segmentation and self-supervised monocular depth estimation. To this end, we introduce a tightly coupled self-supervised depth estimation predictor that explicitly uses information from the panoptic path for depth prediction. Furthermore, we introduce a panoptic-guided motion masking method to improve depth estimation without relying on video panoptic segmentation annotations. We evaluate our method on two popular autonomous driving datasets, Cityscapes and KITTI. Our model shows state-of-the-art results compared to other real-time methods and closes the gap to computationally more demanding methods. Source code and trained models are available at this https URL.

[CV-22] he ADUULM-360 Dataset – A Multi-Modal Dataset for Depth Estimation in Adverse Weather ITSC

链接: https://arxiv.org/abs/2411.11455
作者: Markus Schön,Jona Ruof,Thomas Wodtko,Michael Buchholz,Klaus Dietmayer
关键词-EN: rich semantic information, semantic information captured, Depth estimation, Depth, projection of rich
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 2024 IEEE International Conference on Intelligent Transportation Systems (ITSC)

点击查看摘要

Abstract:Depth estimation is an essential task toward full scene understanding since it allows the projection of rich semantic information captured by cameras into 3D space. While the field has gained much attention recently, datasets for depth estimation lack scene diversity or sensor modalities. This work presents the ADUULM-360 dataset, a novel multi-modal dataset for depth estimation. The ADUULM-360 dataset covers all established autonomous driving sensor modalities, cameras, lidars, and radars. It covers a frontal-facing stereo setup, six surround cameras covering the full 360-degree, two high-resolution long-range lidar sensors, and five long-range radar sensors. It is also the first depth estimation dataset that contains diverse scenes in good and adverse weather conditions. We conduct extensive experiments using state-of-the-art self-supervised depth estimation methods under different training tasks, such as monocular training, stereo training, and full surround training. Discussing these results, we demonstrate common limitations of state-of-the-art methods, especially in adverse weather conditions, which hopefully will inspire future research in this area. Our dataset, development kit, and trained baselines are available at this https URL.

[CV-23] Relevance-guided Audio Visual Fusion for Video Saliency Prediction

链接: https://arxiv.org/abs/2411.11454
作者: Li Yu,Xuanzhe Sun,Pan Gao,Moncef Gabbouj
关键词-EN: audience visual attention, audio-visual saliency prediction, saliency prediction, visual features, video saliency prediction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Audio data, often synchronized with video frames, plays a crucial role in guiding the audience’s visual attention. Incorporating audio information into video saliency prediction tasks can enhance the prediction of human visual behavior. However, existing audio-visual saliency prediction methods often directly fuse audio and visual features, which ignore the possibility of inconsistency between the two modalities, such as when the audio serves as background music. To address this issue, we propose a novel relevance-guided audio-visual saliency prediction network dubbed AVRSP. Specifically, the Relevance-guided Audio-Visual feature Fusion module (RAVF) dynamically adjusts the retention of audio features based on the semantic relevance between audio and visual elements, thereby refining the integration process with visual features. Furthermore, the Multi-scale feature Synergy (MS) module integrates visual features from different encoding stages, enhancing the network’s ability to represent objects at various scales. The Multi-scale Regulator Gate (MRG) could transfer crucial fusion information to visual features, thus optimizing the utilization of multi-scale visual features. Extensive experiments on six audio-visual eye movement datasets have demonstrated that our AVRSP network achieves competitive performance in audio-visual saliency prediction.

[CV-24] GLDesigner: Leveraging Multi-Modal LLM s as Designer for Enhanced Aesthetic Text Glyph Layouts

链接: https://arxiv.org/abs/2411.11435
作者: Junwen He,Yifan Wang,Lijun Wang,Huchuan Lu,Jun-Yan He,Chenyang Li,Hanyuan Chen,Jin-Peng Lan,Bin Luo,Yifeng Geng
关键词-EN: design heavily relies, arranging element layouts, professional designers, important procedures, logo design heavily
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text logo design heavily relies on the creativity and expertise of professional designers, in which arranging element layouts is one of the most important procedures. However, few attention has been paid to this specific task which needs to take precise textural details and user constraints into consideration, but only on the broader tasks such as document/poster layout generation. In this paper, we propose a VLM-based framework that generates content-aware text logo layouts by integrating multi-modal inputs with user constraints, supporting a more flexible and stable layout design in real-world applications. We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously, while does not face performance degradation. To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset. Except for the geometric annotations (e.g. text masks and character recognition), we also compliment with comprehensive layout descriptions in natural language format, for more effective training to have reasoning ability when dealing with complex layouts and custom user constraints. Experimental studies demonstrate the effectiveness of our proposed model and datasets, when comparing with previous methods in various benchmarks to evaluate geometric aesthetics and human preferences. The code and datasets will be publicly available.

[CV-25] owards fast DBSCAN via Spectrum-Preserving Data Compression

链接: https://arxiv.org/abs/2411.11421
作者: Yongyu Wang
关键词-EN: significantly accelerate DBSCAN, employing spectral data, paper introduces, significantly accelerate, accelerate DBSCAN
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces a novel method to significantly accelerate DBSCAN by employing spectral data compression. The proposed approach reduces the size of the data set by a factor of five while preserving the essential clustering characteristics through an innovative spectral compression technique. This enables DBSCAN to run substantially faster without any loss of accuracy. Experiments on real-world data sets, such as USPS, demonstrate the method’s capability to achieve this dramatic reduction in data size while maintaining clustering performance.

[CV-26] Stacking Brick by Brick: Aligned Feature Isolation for Incremental Face Forgery Detection

链接: https://arxiv.org/abs/2411.11396
作者: Jikang Cheng,Zhiyuan Yan,Ying Zhang,Li Hao,Jiaxin Ai,Qin Zou,Chen Li,Zhongyuan Wang
关键词-EN: face forgery techniques, Face Forgery Detection, Incremental Face Forgery, face forgery, rapid advancement
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rapid advancement of face forgery techniques has introduced a growing variety of forgeries. Incremental Face Forgery Detection (IFFD), involving gradually adding new forgery data to fine-tune the previously trained model, has been introduced as a promising strategy to deal with evolving forgery methods. However, a naively trained IFFD model is prone to catastrophic forgetting when new forgeries are integrated, as treating all forgeries as a single ''Fake" class in the Real/Fake classification can cause different forgery types overriding one another, thereby resulting in the forgetting of unique characteristics from earlier tasks and limiting the model’s effectiveness in learning forgery specificity and generality. In this paper, we propose to stack the latent feature distributions of previous and new tasks brick by brick, \textiti.e. , achieving \textbfaligned feature isolation . In this manner, we aim to preserve learned forgery information and accumulate new knowledge by minimizing distribution overriding, thereby mitigating catastrophic forgetting. To achieve this, we first introduce Sparse Uniform Replay (SUR) to obtain the representative subsets that could be treated as the uniformly sparse versions of the previous global distributions. We then propose a Latent-space Incremental Detector (LID) that leverages SUR data to isolate and align distributions. For evaluation, we construct a more advanced and comprehensive benchmark tailored for IFFD. The leading experimental results validate the superiority of our method.

[CV-27] LeC2O-NeRF: Learning Continuous and Compact Large-Scale Occupancy for Urban Scenes

链接: https://arxiv.org/abs/2411.11374
作者: Zhenxing Mi,Dan Xu
关键词-EN: occupancy, occupancy network, network, critical problem, unoccupied
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 13 pages

点击查看摘要

Abstract:In NeRF, a critical problem is to effectively estimate the occupancy to guide empty-space skipping and point sampling. Grid-based methods work well for small-scale scenes. However, on large-scale scenes, they are limited by predefined bounding boxes, grid resolutions, and high memory usage for grid updates, and thus struggle to speed up training for large-scale, irregularly bounded and complex urban scenes without sacrificing accuracy. In this paper, we propose to learn a continuous and compact large-scale occupancy network, which can classify 3D points as occupied or unoccupied points. We train this occupancy network end-to-end together with the radiance field in a self-supervised manner by three designs. First, we propose a novel imbalanced occupancy loss to regularize the occupancy network. It makes the occupancy network effectively control the ratio of unoccupied and occupied points, motivated by the prior that most of 3D scene points are unoccupied. Second, we design an imbalanced architecture containing a large scene network and a small empty space network to separately encode occupied and unoccupied points classified by the occupancy network. This imbalanced structure can effectively model the imbalanced nature of occupied and unoccupied regions. Third, we design an explicit density loss to guide the occupancy network, making the density of unoccupied points smaller. As far as we know, we are the first to learn a continuous and compact occupancy of large-scale NeRF by a network. In our experiments, our occupancy network can quickly learn more compact, accurate and smooth occupancy compared to the occupancy grid. With our learned occupancy as guidance for empty space skipping on challenging large-scale benchmarks, our method consistently obtains higher accuracy compared to the occupancy grid, and our method can speed up state-of-the-art NeRF methods without sacrificing accuracy.

[CV-28] L-CLIP: A Power-specific Multimodal Pre-trained Visual Foundation Model for Transmission Line Defect Recognition

链接: https://arxiv.org/abs/2411.11370
作者: Ke Zhang,Zhaoye Zheng,Yurong Guo,Jiacun Wang,Jiyuan Yang,Yangjie Xiao
关键词-EN: Transmission line defect, line defect recognition, Transmission line, general pre-trained weights, line defect
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transmission line defect recognition models have traditionally used general pre-trained weights as the initial basis for their training. These models often suffer weak generalization capability due to the lack of domain knowledge in the pre-training dataset. To address this issue, we propose a two-stage transmission-line-oriented contrastive language-image pre-training (TL-CLIP) framework, which lays a more effective foundation for transmission line defect recognition. The pre-training process employs a novel power-specific multimodal algorithm assisted with two power-specific pre-training tasks for better modeling the power-related semantic knowledge contained in the inspection data. To fine-tune the pre-trained model, we develop a transfer learning strategy, namely fine-tuning with pre-training objective (FTP), to alleviate the overfitting problem caused by limited inspection data. Experimental results demonstrate that the proposed method significantly improves the performance of transmission line defect recognition in both classification and detection tasks, indicating clear advantages over traditional pre-trained models in the scene of transmission line inspection.

[CV-29] GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views CVPR2024

链接: https://arxiv.org/abs/2411.11363
作者: Boyao Zhou,Shunyuan Zheng,Hanzhang Tu,Ruizhi Shao,Boning Liu,Shengping Zhang,Liqiang Nie,Yebin Liu
关键词-EN: recently shown promising, shown promising results, free-viewpoint video synthesis, Gaussian Splatting, generalizable Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Journal extension of CVPR 2024,Project page: this https URL

点击查看摘要

Abstract:Differentiable rendering techniques have recently shown promising results for free-viewpoint video synthesis of characters. However, such methods, either Gaussian Splatting or neural implicit rendering, typically necessitate per-subject optimization which does not meet the requirement of real-time rendering in an interactive application. We propose a generalizable Gaussian Splatting approach for high-resolution image rendering under a sparse-view camera setting. To this end, we introduce Gaussian parameter maps defined on the source views and directly regress Gaussian properties for instant novel view synthesis without any fine-tuning or optimization. We train our Gaussian parameter regression module on human-only data or human-scene data, jointly with a depth estimation module to lift 2D parameter maps to 3D space. The proposed framework is fully differentiable with both depth and rendering supervision or with only rendering supervision. We further introduce a regularization term and an epipolar attention mechanism to preserve geometry consistency between two source views, especially when neglecting depth supervision. Experiments on several datasets demonstrate that our method outperforms state-of-the-art methods while achieving an exceeding rendering speed.

[CV-30] Scalable Autoregressive Monocular Depth Estimation

链接: https://arxiv.org/abs/2411.11361
作者: Jinhong Wang,Jian Liu,Dongqi Tang,Weiqiang Wang,Wentong Li,Danny Chen,J intai Chen,Jian Wu
关键词-EN: monocular depth estimator, depth, paper proposes, autoregressive, DAR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper proposes a new autoregressive model as an effective and scalable monocular depth estimator. Our idea is simple: We tackle the monocular depth estimation (MDE) task with an autoregressive prediction paradigm, based on two core designs. First, our depth autoregressive model (DAR) treats the depth map of different resolutions as a set of tokens, and conducts the low-to-high resolution autoregressive objective with a patch-wise casual mask. Second, our DAR recursively discretizes the entire depth range into more compact intervals, and attains the coarse-to-fine granularity autoregressive objective in an ordinal-regression manner. By coupling these two autoregressive objectives, our DAR establishes new state-of-the-art (SOTA) on KITTI and NYU Depth v2 by clear margins. Further, our scalable approach allows us to scale the model up to 2.0B and achieve the best RMSE of 1.799 on the KITTI dataset (5% improvement) compared to 1.896 by the current SOTA (Depth Anything). DAR further showcases zero-shot generalization ability on unseen datasets. These results suggest that DAR yields superior performance with an autoregressive prediction paradigm, providing a promising approach to equip modern autoregressive large models (e.g., GPT-4o) with depth estimation capabilities.

[CV-31] CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

链接: https://arxiv.org/abs/2411.11360
作者: Zhiming Wang,Mingze Wang,Sheng Xu,Yanjing Li,Baochang Zhang
关键词-EN: multi-temporal remote sensing, Remote Sensing Image, Image Change Captioning, Sensing Image Change, Remote Sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Remote Sensing Image Change Captioning (RSICC) aims to generate natural language descriptions of surface changes between multi-temporal remote sensing images, detailing the categories, locations, and dynamics of changed objects (e.g., additions or disappearances). Many current methods attempt to leverage the long-sequence understanding and reasoning capabilities of multimodal large language models (MLLMs) for this task. However, without comprehensive data support, these approaches often alter the essential feature transmission pathways of MLLMs, disrupting the intrinsic knowledge within the models and limiting their potential in RSICC. In this paper, we propose a novel model, CCExpert, based on a new, advanced multimodal large model framework. Firstly, we design a difference-aware integration module to capture multi-scale differences between bi-temporal images and incorporate them into the original image context, thereby enhancing the signal-to-noise ratio of differential features. Secondly, we constructed a high-quality, diversified dataset called CC-Foundation, containing 200,000 image pairs and 1.2 million captions, to provide substantial data support for continue pretraining in this domain. Lastly, we employed a three-stage progressive training process to ensure the deep integration of the difference-aware integration module with the pretrained MLLM. CCExpert achieved a notable performance of S^*_m=81.80 on the LEVIR-CC benchmark, significantly surpassing previous state-of-the-art methods. The code and part of the dataset will soon be open-sourced at this https URL.

[CV-32] xt-guided Zero-Shot Object Localization

链接: https://arxiv.org/abs/2411.11357
作者: Jingjing Wang,Xinglin Piao,Zongzhi Gao,Bo Li,Yong Zhang,Baocai Yin
关键词-EN: computer vision area, Object localization, vision area, existing object localization, hot issue
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object localization is a hot issue in computer vision area, which aims to identify and determine the precise location of specific objects from image or video. Most existing object localization methods heavily rely on extensive labeled data, which are costly to annotate and constrain their applicability. Therefore, we propose a new Zero-Shot Object Localization (ZSOL) framework for addressing the aforementioned challenges. In the proposed framework, we introduce the Contrastive Language Image Pre-training (CLIP) module which could integrate visual and linguistic information effectively. Furthermore, we design a Text Self-Similarity Matching (TSSM) module, which could improve the localization accuracy by enhancing the representation of text features extracted by CLIP module. Hence, the proposed framework can be guided by prompt words to identify and locate specific objects in an image in the absence of labeled samples. The results of extensive experiments demonstrate that the proposed method could improve the localization performance significantly and establishes an effective benchmark for further research.

[CV-33] Superpixel-informed Implicit Neural Representation for Multi-Dimensional Data ECCV2024

链接: https://arxiv.org/abs/2411.11356
作者: Jiayi Li,Xile Zhao,Jianli Wang,Chao Wang,Min Wang
关键词-EN: implicit neural representations, attracted increasing attention, multi-dimensional data recovery, implicit neural, neural representations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ECCV 2024, 18 pages, 7 figures

点击查看摘要

Abstract:Recently, implicit neural representations (INRs) have attracted increasing attention for multi-dimensional data recovery. However, INRs simply map coordinates via a multi-layer perception (MLP) to corresponding values, ignoring the inherent semantic information of the data. To leverage semantic priors from the data, we propose a novel Superpixel-informed INR (S-INR). Specifically, we suggest utilizing generalized superpixel instead of pixel as an alternative basic unit of INR for multi-dimensional data (e.g., images and weather data). The coordinates of generalized superpixels are first fed into exclusive attention-based MLPs, and then the intermediate results interact with a shared dictionary matrix. The elaborately designed modules in S-INR allow us to ingenuously exploit the semantic information within and across generalized superpixels. Extensive experiments on various applications validate the effectiveness and efficacy of our S-INR compared to state-of-the-art INR methods.

[CV-34] Visual-Semantic Graph Matching Net for Zero-Shot Learning

链接: https://arxiv.org/abs/2411.11351
作者: Bowen Duan,Shiming Chen,Yufei Guo,Guo-Sen Xie,Weiping Ding,Yisong Wang
关键词-EN: Zero-shot learning, recognize unseen classes, Graph Matching Network, Graph Matching Net, unseen classes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Zero-shot learning (ZSL) aims to leverage additional semantic information to recognize unseen classes. To transfer knowledge from seen to unseen classes, most ZSL methods often learn a shared embedding space by simply aligning visual embeddings with semantic prototypes. However, methods trained under this paradigm often struggle to learn robust embedding space because they align the two modalities in an isolated manner among classes, which ignore the crucial class relationship during the alignment process. To address the aforementioned challenges, this paper proposes a Visual-Semantic Graph Matching Net, termed as VSGMN, which leverages semantic relationships among classes to aid in visual-semantic embedding. VSGMN employs a Graph Build Network (GBN) and a Graph Matching Network (GMN) to achieve two-stage visual-semantic alignment. Specifically, GBN first utilizes an embedding-based approach to build visual and semantic graphs in the semantic space and align the embedding with its prototype for first-stage alignment. Additionally, to supplement unseen class relations in these graphs, GBN also build the unseen class nodes based on semantic relationships. In the second stage, GMN continuously integrates neighbor and cross-graph information into the constructed graph nodes, and aligns the node relationships between the two graphs under the class relationship constraint. Extensive experiments on three benchmark datasets demonstrate that VSGMN achieves superior performance in both conventional and generalized ZSL scenarios. The implementation of our VSGMN and experimental results are available at github: this https URL

[CV-35] aching Video Diffusion Model with Latent Physical Phenomenon Knowledge

链接: https://arxiv.org/abs/2411.11343
作者: Qinglong Cao,Ding Wang,Xirui Li,Yuntian Chen,Chao Ma,Xiaokang Yang
关键词-EN: exhibited tremendous progress, physical, video generation tasks, Video diffusion models, exhibited tremendous
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注: 7 figures, 14 pages

点击查看摘要

Abstract:Video diffusion models have exhibited tremendous progress in various video generation tasks. However, existing models struggle to capture latent physical knowledge, failing to infer physical phenomena that are challenging to articulate with natural language. Generating videos following the fundamental physical laws is still an opening challenge. To address this challenge, we propose a novel method to teach video diffusion models with latent physical phenomenon knowledge, enabling the accurate generation of physically informed phenomena. Specifically, we first pretrain Masked Autoencoders (MAE) to reconstruct the physical phenomena, resulting in output embeddings that encapsulate latent physical phenomenon knowledge. Leveraging these embeddings, we could generate the pseudo-language prompt features based on the aligned spatial relationships between CLIP vision and language encoders. Particularly, given that diffusion models typically use CLIP’s language encoder for text prompt embeddings, our approach integrates the CLIP visual features informed by latent physical knowledge into a quaternion hidden space. This enables the modeling of spatial relationships to produce physical knowledge-informed pseudo-language prompts. By incorporating these prompt features and fine-tuning the video diffusion model in a parameter-efficient manner, the physical knowledge-informed videos are successfully generated. We validate our method extensively through both numerical simulations and real-world observations of physical phenomena, demonstrating its remarkable performance across diverse scenarios.

[CV-36] Video-to-Task Learning via Motion-Guided Attention for Few-Shot Action Recognition

链接: https://arxiv.org/abs/2411.11335
作者: Hanyu Guo,Wanchuan Yu,Suzhou Que,Kaiwen Du,Yan Yan,Hanzi Wang
关键词-EN: achieved remarkable performance, spatio-temporal relationships, Motion-Guided Attention, task level, level
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, few-shot action recognition has achieved remarkable performance through spatio-temporal relation modeling. Although a wide range of spatial and temporal alignment modules have been proposed, they primarily address spatial or temporal misalignments at the video level, while the spatio-temporal relationships across different videos at the task level remain underexplored. Recent studies utilize class prototypes to learn task-specific features but overlook the spatio-temporal relationships across different videos at the task level, especially in the spatial dimension, where these relationships provide rich information. In this paper, we propose a novel Dual Motion-Guided Attention Learning method (called DMGAL) for few-shot action recognition, aiming to learn the spatio-temporal relationships from the video-specific to the task-specific level. To achieve this, we propose a carefully designed Motion-Guided Attention (MGA) method to identify and correlate motion-related region features from the video level to the task level. Specifically, the Self Motion-Guided Attention module (S-MGA) achieves spatio-temporal relation modeling at the video level by identifying and correlating motion-related region features between different frames within a video. The Cross Motion-Guided Attention module (C-MGA) identifies and correlates motion-related region features between frames of different videos within a specific task to achieve spatio-temporal relationships at the task level. This approach enables the model to construct class prototypes that fully incorporate spatio-temporal relationships from the video-specific level to the task-specific level. We validate the effectiveness of our DMGAL method by employing both fully fine-tuning and adapter-tuning paradigms. The models developed using these paradigms are termed DMGAL-FT and DMGAL-Adapter, respectively.

[CV-37] Color-Oriented Redundancy Reduction in Dataset Distillation NEURIPS2024

链接: https://arxiv.org/abs/2411.11329
作者: Bowen Yuan,Zijian Wang,Yadan Luo,Mahsa Baktashmotlagh,Yadan Luo,Zi Huang
关键词-EN: generate condensed representations, enhancing training efficiency, Dataset Distillation, designed to generate, generate condensed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Dataset Distillation (DD) is designed to generate condensed representations of extensive image datasets, enhancing training efficiency. Despite recent advances, there remains considerable potential for improvement, particularly in addressing the notable redundancy within the color space of distilled images. In this paper, we propose AutoPalette, a framework that minimizes color redundancy at the individual image and overall dataset levels, respectively. At the image level, we employ a palette network, a specialized neural network, to dynamically allocate colors from a reduced color space to each pixel. The palette network identifies essential areas in synthetic images for model training and consequently assigns more unique colors to them. At the dataset level, we develop a color-guided initialization strategy to minimize redundancy among images. Representative images with the least replicated color patterns are selected based on the information gain. A comprehensive performance study involving various datasets and evaluation scenarios is conducted, demonstrating the superior performance of our proposed color-aware DD compared to existing DD methods. The code is available at \urlthis https URL.

[CV-38] Performance Evaluation of Geospatial Images based on Zarr and Tiff

链接: https://arxiv.org/abs/2411.11291
作者: Jaheer Khan,Swarup E,Rakshit Ramesh
关键词-EN: Tagged Image File, Zarr and TIFF, Image File Format, Traditional Tagged Image, distinct data storage
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This evaluate the performance of geospatial image processing using two distinct data storage formats: Zarr and TIFF. Geospatial images, converted to numerous applications like environmental monitoring, urban planning, and disaster management. Traditional Tagged Image File Format is mostly used because it is simple and compatible but may lack by performance limitations while working on large datasets. Zarr is a new format designed for the cloud systems,that offers scalability and efficient storage with data chunking and compression techniques. This study compares the two formats in terms of storage efficiency, access speed, and computational performance during typical geospatial processing tasks. Through analysis on a range of geospatial datasets, this provides details about the practical advantages and limitations of each format,helping users to select the appropriate format based on their specific needs and constraints.

[CV-39] Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition

链接: https://arxiv.org/abs/2411.11288
作者: Yang Chen,Jingcai Guo,Song Guo,Dacheng Tao
关键词-EN: Zero-shot skeleton action, Zero-shot skeleton, skeleton action recognition, requires robust unseen, robust unseen generalization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Zero-shot skeleton action recognition is a non-trivial task that requires robust unseen generalization with prior knowledge from only seen classes and shared semantics. Existing methods typically build the skeleton-semantics interactions by uncontrollable mappings and conspicuous representations, thereby can hardly capture the intricate and fine-grained relationship for effective cross-modal transferability. To address these issues, we propose a novel dyNamically Evolving dUal skeleton-semantic syneRgistic framework with the guidance of cOntext-aware side informatioN (dubbed Neuron), to explore more fine-grained cross-modal correspondence from micro to macro perspectives at both spatial and temporal levels, respectively. Concretely, 1) we first construct the spatial-temporal evolving micro-prototypes and integrate dynamic context-aware side information to capture the intricate and synergistic skeleton-semantic correlations step-by-step, progressively refining cross-model alignment; and 2) we introduce the spatial compression and temporal memory mechanisms to guide the growth of spatial-temporal micro-prototypes, enabling them to absorb structure-related spatial representations and regularity-dependent temporal patterns. Notably, such processes are analogous to the learning and growth of neurons, equipping the framework with the capacity to generalize to novel unseen action categories. Extensive experiments on various benchmark datasets demonstrated the superiority of the proposed method.

[CV-40] Reducing Label Dependency for Underwater Scene Understanding: A Survey of Datasets Techniques and Applications

链接: https://arxiv.org/abs/2411.11287
作者: Scarlett Raine,Frederic Maire,Niko Suenderhauf,Tobias Fischer
关键词-EN: informing management strategies, coral reef health, blue carbon stocks, estimating blue carbon, monitoring coral reef
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 70 pages, 20 figures

点击查看摘要

Abstract:Underwater surveys provide long-term data for informing management strategies, monitoring coral reef health, and estimating blue carbon stocks. Advances in broad-scale survey methods, such as robotic underwater vehicles, have increased the range of marine surveys but generate large volumes of imagery requiring analysis. Computer vision methods such as semantic segmentation aid automated image analysis, but typically rely on fully supervised training with extensive labelled data. While ground truth label masks for tasks like street scene segmentation can be quickly and affordably generated by non-experts through crowdsourcing services like Amazon Mechanical Turk, ecology presents greater challenges. The complexity of underwater images, coupled with the specialist expertise needed to accurately identify species at the pixel level, makes this process costly, time-consuming, and heavily dependent on domain experts. In recent years, some works have performed automated analysis of underwater imagery, and a smaller number of studies have focused on weakly supervised approaches which aim to reduce the expert-provided labelled data required. This survey focuses on approaches which reduce dependency on human expert input, while reviewing the prior and related approaches to position these works in the wider field of underwater perception. Further, we offer an overview of coastal ecosystems and the challenges of underwater imagery. We provide background on weakly and self-supervised deep learning and integrate these elements into a taxonomy that centres on the intersection of underwater monitoring, computer vision, and deep learning, while motivating approaches for weakly supervised deep learning with reduced dependency on domain expert data annotations. Lastly, the survey examines available datasets and platforms, and identifies gaps, barriers, and opportunities for automating underwater surveys.

[CV-41] owards Open-Vocabulary Audio-Visual Event Localization

链接: https://arxiv.org/abs/2411.11278
作者: Jinxing Zhou,Dan Guo,Ruohao Guo,Yuxin Mao,Jingjing Hu,Yiran Zhong,Xiaojun Chang,Meng Wang
关键词-EN: Audio-Visual Event Localization, Event Localization, classify video events, audible and visible, aims to temporally
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Project page: this https URL

点击查看摘要

Abstract:The Audio-Visual Event Localization (AVEL) task aims to temporally locate and classify video events that are both audible and visible. Most research in this field assumes a closed-set setting, which restricts these models’ ability to handle test data containing event categories absent (unseen) during training. Recently, a few studies have explored AVEL in an open-set setting, enabling the recognition of unseen events as ``unknown’', but without providing category-specific semantics. In this paper, we advance the field by introducing the Open-Vocabulary Audio-Visual Event Localization (OV-AVEL) problem, which requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. To address this new task, we propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes (seen:unseen = 46:21), each with manual segment-level annotation. We also establish three evaluation metrics for this task. Moreover, we investigate two baseline approaches, one training-free and one using a further fine-tuning paradigm. Specifically, we utilize the unified multimodal space from the pretrained ImageBind model to extract audio, visual, and textual (event classes) features. The training-free baseline then determines predictions by comparing the consistency of audio-text and visual-text feature similarities. The fine-tuning baseline incorporates lightweight temporal layers to encode temporal relations within the audio and visual modalities, using OV-AVEBench training data for model fine-tuning. We evaluate these baselines on the proposed OV-AVEBench dataset and discuss potential directions for future work in this new field.

[CV-42] Semantic or Covariate? A Study on the Intractable Case of Out-of-Distribution Detection

链接: https://arxiv.org/abs/2411.11254
作者: Xingming Long,Jie Zhang,Shiguang Shan,Xilin Chen
关键词-EN: post-hoc OOD detection, OOD detection methods, OOD, OOD detection, OOD samples
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: v1

点击查看摘要

Abstract:The primary goal of out-of-distribution (OOD) detection tasks is to identify inputs with semantic shifts, i.e., if samples from novel classes are absent in the in-distribution (ID) dataset used for training, we should reject these OOD samples rather than misclassifying them into existing ID classes. However, we find the current definition of “semantic shift” is ambiguous, which renders certain OOD testing protocols intractable for the post-hoc OOD detection methods based on a classifier trained on the ID dataset. In this paper, we offer a more precise definition of the Semantic Space and the Covariate Space for the ID distribution, allowing us to theoretically analyze which types of OOD distributions make the detection task intractable. To avoid the flaw in the existing OOD settings, we further define the “Tractable OOD” setting which ensures the distinguishability of OOD and ID distributions for the post-hoc OOD detection methods. Finally, we conduct several experiments to demonstrate the necessity of our definitions and validate the correctness of our theorems.

[CV-43] DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation

链接: https://arxiv.org/abs/2411.11252
作者: Tianyi Yan,Dongming Wu,Wencheng Han,Junpeng Jiang,Xia Zhou,Kun Zhan,Cheng-zhong Xu,Jianbing Shen
关键词-EN: actual road conditions, responsive feedback loops, closely replicate actual, replicate actual road, including real-world sensory
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:Autonomous driving evaluation requires simulation environments that closely replicate actual road conditions, including real-world sensory data and responsive feedback loops. However, many existing simulations need to predict waypoints along fixed routes on public datasets or synthetic photorealistic data, \ie, open-loop simulation usually lacks the ability to assess dynamic decision-making. While the recent efforts of closed-loop simulation offer feedback-driven environments, they cannot process visual sensor inputs or produce outputs that differ from real-world data. To address these challenges, we propose DrivingSphere, a realistic and closed-loop simulation framework. Its core idea is to build 4D world representation and generate real-life and controllable driving scenarios. In specific, our framework includes a Dynamic Environment Composition module that constructs a detailed 4D driving world with a format of occupancy equipping with static backgrounds and dynamic objects, and a Visual Scene Synthesis module that transforms this data into high-fidelity, multi-view video outputs, ensuring spatial and temporal consistency. By providing a dynamic and realistic simulation environment, DrivingSphere enables comprehensive testing and validation of autonomous driving algorithms, ultimately advancing the development of more reliable autonomous cars. The benchmark will be publicly released.

[CV-44] Noise Filtering Benchmark for Neuromorphic Satellites Observations

链接: https://arxiv.org/abs/2411.11233
作者: Sami Arja,Alexandre Marcireau,Nicholas Owen Ralph,Saeed Afshar,Gregory Cohen
关键词-EN: low power consumption, high temporal resolution, high dynamic range, offer high temporal, Space Situational Awareness
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 8 figures, 1 table

点击查看摘要

Abstract:Event cameras capture sparse, asynchronous brightness changes which offer high temporal resolution, high dynamic range, low power consumption, and sparse data output. These advantages make them ideal for Space Situational Awareness, particularly in detecting resident space objects moving within a telescope’s field of view. However, the output from event cameras often includes substantial background activity noise, which is known to be more prevalent in low-light conditions. This noise can overwhelm the sparse events generated by satellite signals, making detection and tracking more challenging. Existing noise-filtering algorithms struggle in these scenarios because they are typically designed for denser scenes, where losing some signal is acceptable. This limitation hinders the application of event cameras in complex, real-world environments where signals are extremely sparse. In this paper, we propose new event-driven noise-filtering algorithms specifically designed for very sparse scenes. We categorise the algorithms into logical-based and learning-based approaches and benchmark their performance against 11 state-of-the-art noise-filtering algorithms, evaluating how effectively they remove noise and hot pixels while preserving the signal. Their performance was quantified by measuring signal retention and noise removal accuracy, with results reported using ROC curves across the parameter space. Additionally, we introduce a new high-resolution satellite dataset with ground truth from a real-world platform under various noise conditions, which we have made publicly available. Code, dataset, and trained weights are available at \urlthis https URL.

[CV-45] BeautyBank: Encoding Facial Makeup in Latent Space

链接: https://arxiv.org/abs/2411.11231
作者: Qianwen Lu,Xingchao Yang,Takafumi Taketomi
关键词-EN: makeup, demonstrated their effectiveness, features, pattern features, Progressive Makeup Tuning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The advancement of makeup transfer, editing, and image encoding has demonstrated their effectiveness and superior quality. However, existing makeup works primarily focus on low-dimensional features such as color distributions and patterns, limiting their versatillity across a wide range of makeup applications. Futhermore, existing high-dimensional latent encoding methods mainly target global features such as structure and style, and are less effective for tasks that require detailed attention to local color and pattern features of makeup. To overcome these limitations, we propose BeautyBank, a novel makeup encoder that disentangles pattern features of bare and makeup faces. Our method encodes makeup features into a high-dimensional space, preserving essential details necessary for makeup reconstruction and broadening the scope of potential makeup research applications. We also propose a Progressive Makeup Tuning (PMT) strategy, specifically designed to enhance the preservation of detailed makeup features while preventing the inclusion of irrelevant attributes. We further explore novel makeup applications, including facial image generation with makeup injection and makeup similarity measure. Extensive empirical experiments validate that our method offers superior task adaptability and holds significant potential for widespread application in various makeup-related fields. Furthermore, to address the lack of large-scale, high-quality paired makeup datasets in the field, we constructed the Bare-Makeup Synthesis Dataset (BMS), comprising 324,000 pairs of 512x512 pixel images of bare and makeup-enhanced faces.

[CV-46] Efficient Transfer Learning for Video-language Foundation Models

链接: https://arxiv.org/abs/2411.11223
作者: Haoxing Chen,Zizheng Huang,Yan Hong,Yanshuo Wang,Zhongcai Lyu,Zhuoer Xu,Jun Lan,Zhangxuan Gu
关键词-EN: vision-language models provide, Pre-trained vision-language models, provide a robust, robust foundation, foundation for efficient
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pre-trained vision-language models provide a robust foundation for efficient transfer learning across various downstream tasks. In the field of video action recognition, mainstream approaches often introduce additional parameter modules to capture temporal information. While the increased model capacity brought by these additional parameters helps better fit the video-specific inductive biases, existing methods require learning a large number of parameters and are prone to catastrophic forgetting of the original generalizable knowledge. In this paper, we propose a simple yet effective Multi-modal Spatio-Temporal Adapter (MSTA) to improve the alignment between representations in the text and vision branches, achieving a balance between general knowledge and task-specific knowledge. Furthermore, to mitigate over-fitting and enhance generalizability, we introduce a spatio-temporal description-guided consistency constraint. This constraint involves feeding template inputs (i.e., ``a video of \textbfcls\ ‘’) into the trainable language branch, while LLM-generated spatio-temporal descriptions are input into the pre-trained language branch, enforcing consistency between the outputs of the two branches. This mechanism prevents over-fitting to downstream tasks and improves the distinguishability of the trainable branch within the spatio-temporal semantic space. We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-novel generalization, and fully-supervised learning. Compared to many state-of-the-art methods, our MSTA achieves outstanding performance across all evaluations, while using only 2-7% of the trainable parameters in the original model. Code will be avaliable at this https URL.

[CV-47] he Sound of Water: Inferring Physical Properties from Pouring Liquids

链接: https://arxiv.org/abs/2411.11222
作者: Piyush Bagad,Makarand Tapaswi,Cees G. M. Snoek,Andrew Zisserman
关键词-EN: intriguing everyday activity, everyday activity, connection between audio-visual, audio-visual observations, mundane yet intriguing
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 25 pages, 17 figures. Project page at this https URL

点击查看摘要

Abstract:We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring.

[CV-48] Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition

链接: https://arxiv.org/abs/2411.11219
作者: Tiancheng Lin,Jinglei Zhang,Yi Xu,Kai Chen,Rui Zhang,Chang-Wen Chen
关键词-EN: achieved remarkable advancements, supervised scene text, leveraging semantic priors, scene text recognition, Context-aware methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Context-aware methods have achieved remarkable advancements in supervised scene text recognition by leveraging semantic priors from words. Considering the heterogeneity of text and background in STR, we propose that such contextual priors can be reinterpreted as the relations between textual elements, serving as effective self-supervised labels for representation learning. However, textual relations are restricted to the finite size of the dataset due to lexical dependencies, which causes over-fitting problem, thus compromising the representation quality. To address this, our work introduces a unified framework of Relational Contrastive Learning and Masked Image Modeling for STR (RCMSTR), which explicitly models the enriched textual relations. For the RCL branch, we first introduce the relational rearrangement module to cultivate new relations on the fly. Based on this, we further conduct relational contrastive learning to model the intra- and inter-hierarchical relations for frames, sub-words and this http URL the other hand, MIM can naturally boost the context information via masking, where we find that the block masking strategy is more effective for STR. For the effective integration of RCL and MIM, we also introduce a novel decoupling design aimed at mitigating the impact of masked images on contrastive learning. Additionally, to enhance the compatibility of MIM with CNNs, we propose the adoption of sparse convolutions and directly sharing the weights with dense convolutions in training. The proposed RCMSTR demonstrates superior performance in various evaluation protocols for different STR-related downstream tasks, outperforming the existing state-of-the-art self-supervised STR techniques. Ablation studies and qualitative experimental results further validate the effectiveness of our this http URL code and pre-trained models will be available at this https URL .

[CV-49] DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery

链接: https://arxiv.org/abs/2411.11214
作者: Jaewoo Heo,George Hu,Zeyu Wang,Serena Yeung-Levy
关键词-EN: including motion capture, domains including motion, Human Mesh Recovery, human pose parameters, augmented reality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 5 figures, 3DV2025

点击查看摘要

Abstract:Human Mesh Recovery (HMR) is an important yet challenging problem with applications across various domains including motion capture, augmented reality, and biomechanics. Accurately predicting human pose parameters from a single image remains a challenging 3D computer vision task. In this work, we introduce DeforHMR, a novel regression-based monocular HMR framework designed to enhance the prediction of human pose parameters using deformable attention transformers. DeforHMR leverages a novel query-agnostic deformable cross-attention mechanism within the transformer decoder to effectively regress the visual features extracted from a frozen pretrained vision transformer (ViT) encoder. The proposed deformable cross-attention mechanism allows the model to attend to relevant spatial features more flexibly and in a data-dependent manner. Equipped with a transformer decoder capable of spatially-nuanced attention, DeforHMR achieves state-of-the-art performance for single-frame regression-based methods on the widely used 3D HMR benchmarks 3DPW and RICH. By pushing the boundary on the field of 3D human mesh recovery through deformable attention, we introduce an new, effective paradigm for decoding local spatial information from large pretrained vision encoders in computer vision.

[CV-50] BVI-CR: A Multi-View Human Dataset for Volumetric Video Compression

链接: https://arxiv.org/abs/2411.11199
作者: Ge Gao,Adrian Azzarelli,Ho Man Kwan,Nantheera Anantrasirichai,Fan Zhang,Oliver Moolan-Feroze,David Bull
关键词-EN: fine details, advances in immersive, immersive technologies, enabled the creation, creation of digital
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The advances in immersive technologies and 3D reconstruction have enabled the creation of digital replicas of real-world objects and environments with fine details. These processes generate vast amounts of 3D data, requiring more efficient compression methods to satisfy the memory and bandwidth constraints associated with data storage and transmission. However, the development and validation of efficient 3D data compression methods are constrained by the lack of comprehensive and high-quality volumetric video datasets, which typically require much more effort to acquire and consume increased resources compared to 2D image and video databases. To bridge this gap, we present an open multi-view volumetric human dataset, denoted BVI-CR, which contains 18 multi-view RGB-D captures and their corresponding textured polygonal meshes, depicting a range of diverse human actions. Each video sequence contains 10 views in 1080p resolution with durations between 10-15 seconds at 30FPS. Using BVI-CR, we benchmarked three conventional and neural coordinate-based multi-view video compression methods, following the MPEG MIV Common Test Conditions, and reported their rate quality performance based on various quality metrics. The results show the great potential of neural representation based methods in volumetric video compression compared to conventional video coding methods (with an up to 38% average coding gain in PSNR). This dataset provides a development and validation platform for a variety of tasks including volumetric reconstruction, compression, and quality assessment. The database will be shared publicly at \urlthis https URL.

[CV-51] Person Segmentation and Action Classification for Multi-Channel Hemisphere Field of View LiDAR Sensors

链接: https://arxiv.org/abs/2411.11151
作者: Svetlana Seliunina,Artem Otelepko,Raphael Memmesheimer,Sven Behnke
关键词-EN: surroundings for safety, Robots, person segmentation, person segmentation task, perceive persons
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 6 pages, 9 figures, 4 tables, accepted for publication at IEEE/SICE International Symposium on System Integration (SII), Munich, Germany, January 2025

点击查看摘要

Abstract:Robots need to perceive persons in their surroundings for safety and to interact with them. In this paper, we present a person segmentation and action classification approach that operates on 3D scans of hemisphere field of view LiDAR sensors. We recorded a data set with an Ouster OSDome-64 sensor consisting of scenes where persons perform three different actions and annotated it. We propose a method based on a MaskDINO model to detect and segment persons and to recognize their actions from combined spherical projected multi-channel representations of the LiDAR data with an additional positional encoding. Our approach demonstrates good performance for the person segmentation task and further performs well for the estimation of the person action states walking, waving, and sitting. An ablation study provides insights about the individual channel contributions for the person segmentation task. The trained models, code and dataset are made publicly available.

[CV-52] A Comprehensive Survey on Visual Question Answering Datasets and Algorithms

链接: https://arxiv.org/abs/2411.11150
作者: Raihan Kabir,Naznin Haque,Md Saiful Islam,Marium-E-Jannat
关键词-EN: correct natural language, natural language question, natural language, natural language answer, Visual question answering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual question answering (VQA) refers to the problem where, given an image and a natural language question about the image, a correct natural language answer has to be generated. A VQA model has to demonstrate both the visual understanding of the image and the semantic understanding of the question, demonstrating reasoning capability. Since the inception of this field, a plethora of VQA datasets and models have been published. In this article, we meticulously analyze the current state of VQA datasets and models, while cleanly dividing them into distinct categories and then summarizing the methodologies and characteristics of each category. We divide VQA datasets into four categories: (1) available datasets that contain a rich collection of authentic images, (2) synthetic datasets that contain only synthetic images produced through artificial means, (3) diagnostic datasets that are specially designed to test model performance in a particular area, e.g., understanding the scene text, and (4) KB (Knowledge-Based) datasets that are designed to measure a model’s ability to utilize outside knowledge. Concurrently, we explore six main paradigms of VQA models: fusion, where we discuss different methods of fusing information between visual and textual modalities; attention, the technique of using information from one modality to filter information from another; external knowledge base, where we discuss different models utilizing outside information; composition or reasoning, where we analyze techniques to answer advanced questions that require complex reasoning steps; explanation, which is the process of generating visual and textual descriptions to verify sound reasoning; and graph models, which encode and manipulate relationships through nodes in a graph. We also discuss some miscellaneous topics, such as scene text understanding, counting, and bias reduction.

[CV-53] Oscillation Inversion: Understand the structure of Large Flow Model through the Lens of Inversion Method

链接: https://arxiv.org/abs/2411.11135
作者: Yan Zheng,Zhenxiao Liang,Xiaoyan Cong,Lanqing guo,Yuehao Wang,Peihao Wang,Zhangyang Wang
关键词-EN: inversion methods applied, applied to large-scale, observed in inversion, Flux, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We explore the oscillatory behavior observed in inversion methods applied to large-scale text-to-image diffusion models, with a focus on the “Flux” model. By employing a fixed-point-inspired iterative approach to invert real-world images, we observe that the solution does not achieve convergence, instead oscillating between distinct clusters. Through both toy experiments and real-world diffusion models, we demonstrate that these oscillating clusters exhibit notable semantic coherence. We offer theoretical insights, showing that this behavior arises from oscillatory dynamics in rectified flow models. Building on this understanding, we introduce a simple and fast distribution transfer technique that facilitates image enhancement, stroke-based recoloring, as well as visual prompt-guided image editing. Furthermore, we provide quantitative results demonstrating the effectiveness of our method for tasks such as image enhancement, makeup transfer, reconstruction quality, and guided sampling quality. Higher-quality examples of videos and images are available at \hrefthis https URLthis link.

[CV-54] MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

链接: https://arxiv.org/abs/2411.11098
作者: Xi Fang,Jiankun Wang,Xiaochen Cai,Shangqian Chen,Shuwen Yang,Lin Yao,Linfeng Zhang,Guolin Ke
关键词-EN: recent decades, increased rapidly, chemistry publications, molecular image, molecular
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent decades, chemistry publications and patents have increased rapidly. A significant portion of key information is embedded in molecular structure figures, complicating large-scale literature searches and limiting the application of large language models in fields such as biology, chemistry, and pharmaceuticals. The automatic extraction of precise chemical structures is of critical importance. However, the presence of numerous Markush structures in real-world documents, along with variations in molecular image quality, drawing styles, and noise, significantly limits the performance of existing optical chemical structure recognition (OCSR) methods. We present MolParser, a novel end-to-end OCSR method that efficiently and accurately recognizes chemical structures from real-world documents, including difficult Markush structure. We use a extended SMILES encoding rule to annotate our training dataset. Under this rule, we build MolParser-7M, the largest annotated molecular image dataset to our knowledge. While utilizing a large amount of synthetic data, we employed active learning methods to incorporate substantial in-the-wild data, specifically samples cropped from real patents and scientific literature, into the training process. We trained an end-to-end molecular image captioning model, MolParser, using a curriculum learning approach. MolParser significantly outperforms classical and learning-based methods across most scenarios, with potential for broader downstream applications. The dataset is publicly available.

[CV-55] D-Cube: Exploiting Hyper-Features of Diffusion Model for Robust Medical Classification

链接: https://arxiv.org/abs/2411.11087
作者: Minhee Jang,Juheon Son,Thanaporn Viriyasaranon,Junho Kim,Jang-Hwan Choi
关键词-EN: high mortality rates, complex imaging characteristics, present significant diagnostic, significant diagnostic challenges, diagnostic challenges due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:The integration of deep learning technologies in medical imaging aims to enhance the efficiency and accuracy of cancer diagnosis, particularly for pancreatic and breast cancers, which present significant diagnostic challenges due to their high mortality rates and complex imaging characteristics. This paper introduces Diffusion-Driven Diagnosis (D-Cube), a novel approach that leverages hyper-features from a diffusion model combined with contrastive learning to improve cancer diagnosis. D-Cube employs advanced feature selection techniques that utilize the robust representational capabilities of diffusion models, enhancing classification performance on medical datasets under challenging conditions such as data imbalance and limited sample availability. The feature selection process optimizes the extraction of clinically relevant features, significantly improving classification accuracy and demonstrating resilience in imbalanced and limited data scenarios. Experimental results validate the effectiveness of D-Cube across multiple medical imaging modalities, including CT, MRI, and X-ray, showing superior performance compared to existing baseline models. D-Cube represents a new strategy in cancer detection, employing advanced deep learning techniques to achieve state-of-the-art diagnostic accuracy and efficiency.

[CV-56] STOP: Spatiotemporal Orthogonal Propagation for Weight-Threshold-Leakage Synergistic Training of Deep Spiking Neural Networks

链接: https://arxiv.org/abs/2411.11082
作者: Haoran Gao,Xichuan Zhou,Yingcheng Lin,Min Tian,Liyuan Liu,Cong Shi
关键词-EN: sparse binary activations, neuromorphic agents leveraging, agents leveraging brain-inspired, spatiotemporally sparse binary, edge computing paradigms
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages (exclude supplementary), 5 figures

点击查看摘要

Abstract:The prevailing of artificial intelligence-of-things calls for higher energy-efficient edge computing paradigms, such as neuromorphic agents leveraging brain-inspired spiking neural network (SNN) models based on spatiotemporally sparse binary activations. However, the lack of efficient and high-accuracy deep SNN learning algorithms prevents them from practical edge deployments with a strictly bounded cost. In this paper, we propose a spatiotemporal orthogonal propagation (STOP) algorithm to tack this challenge. Our algorithm enables fully synergistic learning of synaptic weights as well as firing thresholds and leakage factors in spiking neurons to improve SNN accuracy, while under a unified temporally-forward trace-based framework to mitigate the huge memory requirement for storing neural states of all time-steps in the forward pass. Characteristically, the spatially-backward neuronal errors and temporally-forward traces propagate orthogonally to and independently of each other, substantially reducing computational overhead. Our STOP algorithm obtained high recognition accuracies of 99.53%, 94.84%, 74.92%, 98.26% and 77.10% on the MNIST, CIFAR-10, CIFAR-100, DVS-Gesture and DVS-CIFAR10 datasets with adequate SNNs of intermediate scales from LeNet-5 to ResNet-18. Compared with other deep SNN training works, our method is more plausible for edge intelligent scenarios where resources are limited but high-accuracy in-situ learning is desired.

[CV-57] Electrostatic Force Regularization for Neural Structured Pruning

链接: https://arxiv.org/abs/2411.11079
作者: Abdesselam Ferdi,Abdelmalik Taleb-Ahmed,Amir Nakib,Youcef Ferdi
关键词-EN: applications remains substantial, deploying deep convolutional, deep convolutional neural, real-time applications remains, convolutional neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The demand for deploying deep convolutional neural networks (DCNNs) on resource-constrained devices for real-time applications remains substantial. However, existing state-of-the-art structured pruning methods often involve intricate implementations, require modifications to the original network architectures, and necessitate an extensive fine-tuning phase. To overcome these challenges, we propose a novel method that, for the first time, incorporates the concepts of charge and electrostatic force from physics into the training process of DCNNs. The magnitude of this force is directly proportional to the product of the charges of the convolution filter and the source filter, and inversely proportional to the square of the distance between them. We applied this electrostatic-like force to the convolution filters, either attracting filters with opposite charges toward non-zero weights or repelling filters with like charges toward zero weights. Consequently, filters subject to repulsive forces have their weights reduced to zero, enabling their removal, while the attractive forces preserve filters with significant weights that retain information. Unlike conventional methods, our approach is straightforward to implement, does not require any architectural modifications, and simultaneously optimizes weights and ranks filter importance, all without the need for extensive fine-tuning. We validated the efficacy of our method on modern DCNN architectures using the MNIST, CIFAR, and ImageNet datasets, achieving competitive performance compared to existing structured pruning approaches.

[CV-58] Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification

链接: https://arxiv.org/abs/2411.11069
作者: Wenjia Jiang,Xiaoke Zhu,Jiakang Gao,Di Liao
关键词-EN: Video-based visible-infrared person, visible-infrared person re-identification, Video-based visible-infrared, person re-identification, visible-infrared person
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video-based visible-infrared person re-identification (VVI-ReID) is challenging due to significant modality feature discrepancies. Spatial-temporal information in videos is crucial, but the accuracy of spatial-temporal information is often influenced by issues like low quality and occlusions in videos. Existing methods mainly focus on reducing modality differences, but pay limited attention to improving spatial-temporal features, particularly for infrared videos. To address this, we propose a novel Skeleton-guided spatial-Temporal feAture leaRning (STAR) method for VVI-ReID. By using skeleton information, which is robust to issues such as poor image quality and occlusions, STAR improves the accuracy of spatial-temporal features in videos of both modalities. Specifically, STAR employs two levels of skeleton-guided strategies: frame level and sequence level. At the frame level, the robust structured skeleton information is used to refine the visual features of individual frames. At the sequence level, we design a feature aggregation mechanism based on skeleton key points graph, which learns the contribution of different body parts to spatial-temporal features, further enhancing the accuracy of global features. Experiments on benchmark datasets demonstrate that STAR outperforms state-of-the-art methods. Code will be open source soon.

[CV-59] S-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

链接: https://arxiv.org/abs/2411.11066
作者: Tingyu Qu,Mingxiao Li,Tinne Tuytelaars,Marie-Francine Moens
关键词-EN: multimodal Large Language, Large Language Models, Large Language, shown great success, Recent advances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: work in progress

点击查看摘要

Abstract:Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In contrast, paired image-text data are much easier to obtain, and there is substantial similarity between images and videos. Consequently, extending image LLMs for video understanding tasks presents an appealing alternative. Developing effective strategies for compressing visual tokens from multiple frames is a promising way to leverage the powerful pre-trained image LLM. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM. The findings lead to our method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy. Given a video, we select few equidistant frames from all input frames to construct a Thumbnail image as a detailed visual cue, complemented by Sampled visual tokens from all input frames. Our method establishes the new state-of-the-art performance among training-free video LLMs on various benchmarks. Notably, our 34B model outperforms GPT-4V on the MVBench benchmark, and achieves performance comparable to the 72B training-based video LLM, Video-LLaMA2, on the challenging MLVU benchmark. Code is available at this https URL.

[CV-60] StableV2V: Stablizing Shape Consistency in Video-to-Video Editing

链接: https://arxiv.org/abs/2411.11045
作者: Chang Liu,Rui Li,Kaidong Zhang,Yunwei Lan,Dong Liu
关键词-EN: significantly promoted content, promoted content creation, Recent advancements, advancements of generative, significantly promoted
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL , code: this https URL , model weights: this https URL , dataset (DAVIS-Edit): this https URL

点击查看摘要

Abstract:Recent advancements of generative AI have significantly promoted content creation and editing, where prevailing studies further extend this exciting progress to video editing. In doing so, these studies mainly transfer the inherent motion patterns from the source videos to the edited ones, where results with inferior consistency to user prompts are often observed, due to the lack of particular alignments between the delivered motions and edited contents. To address this limitation, we present a shape-consistent video editing method, namely StableV2V, in this paper. Our method decomposes the entire editing pipeline into several sequential procedures, where it edits the first video frame, then establishes an alignment between the delivered motions and user prompts, and eventually propagates the edited contents to all other frames based on such alignment. Furthermore, we curate a testing benchmark, namely DAVIS-Edit, for a comprehensive evaluation of video editing, considering various types of prompts and difficulties. Experimental results and analyses illustrate the outperforming performance, visual consistency, and inference efficiency of our method compared to existing state-of-the-art studies.

[CV-61] VeGaS: Video Gaussian Splatting

链接: https://arxiv.org/abs/2411.11024
作者: Weronika Smolak-Dyżewska,Dawid Malarz,Kornel Howil,Jan Kaczmarczyk,Marcin Mazur,Przemysław Spurek
关键词-EN: Implicit Neural Representations, employ neural networks, Implicit Neural, employ neural, Neural Representations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) employ neural networks to approximate discrete data as continuous functions. In the context of video data, such models can be utilized to transform the coordinates of pixel locations along with frame occurrence times (or indices) into RGB color values. Although INRs facilitate effective compression, they are unsuitable for editing purposes. One potential solution is to use a 3D Gaussian Splatting (3DGS) based model, such as the Video Gaussian Representation (VGR), which is capable of encoding video as a multitude of 3D Gaussians and is applicable for numerous video processing operations, including editing. Nevertheless, in this case, the capacity for modification is constrained to a limited set of basic transformations. To address this issue, we introduce the Video Gaussian Splatting (VeGaS) model, which enables realistic modifications of video data. To construct VeGaS, we propose a novel family of Folded-Gaussian distributions designed to capture nonlinear dynamics in a video stream and model consecutive frames by 2D Gaussians obtained as respective conditional distributions. Our experiments demonstrate that VeGaS outperforms state-of-the-art solutions in frame reconstruction tasks and allows realistic modifications of video data. The code is available at: this https URL.

[CV-62] CCi-YOLOv8n: Enhanced Fire Detection with CARAFE and Context-Guided Modules

链接: https://arxiv.org/abs/2411.11011
作者: Kunwei Lv
关键词-EN: forested areas pose, effective detection technologies, incidents in urban, urban and forested, forested areas
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages,7 figures

点击查看摘要

Abstract:Fire incidents in urban and forested areas pose serious threats,underscoring the need for more effective detection technologies. To address these challenges, we present CCi-YOLOv8n, an enhanced YOLOv8 model with targeted improvements for detecting small fires and smoke. The model integrates the CARAFE up-sampling operator and a context-guided module to reduce information loss during up-sampling and down-sampling, thereby retaining richer feature representations. Additionally, an inverted residual mobile block enhanced C2f module captures small targets and fine smoke patterns, a critical improvement over the original model’s detection this http URL validation, we introduce Web-Fire, a dataset curated for fire and smoke detection across diverse real-world scenarios. Experimental results indicate that CCi-YOLOv8n outperforms YOLOv8n in detection precision, confirming its effectiveness for robust fire detection tasks.

[CV-63] EROAM: Event-based Camera Rotational Odometry and Mapping in Real-time

链接: https://arxiv.org/abs/2411.11004
作者: Wanli Xing,Shijie Lin,Linhan Yang,Zeqing Zhang,Yanjun Du,Maolin Lei,Yipeng Pan,Jia Pan
关键词-EN: Iterative Closest Point, accurate camera rotation, camera rotation estimation, paper presents EROAM, event-based rotational odometry
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper presents EROAM, a novel event-based rotational odometry and mapping system that achieves real-time, accurate camera rotation estimation. Unlike existing approaches that rely on event generation models or contrast maximization, EROAM employs a spherical event representation by projecting events onto a unit sphere and introduces Event Spherical Iterative Closest Point (ES-ICP), a novel geometric optimization framework designed specifically for event camera data. The spherical representation simplifies rotational motion formulation while enabling continuous mapping for enhanced spatial resolution. Combined with parallel point-to-line optimization, EROAM achieves efficient computation without compromising accuracy. Extensive experiments on both synthetic and real-world datasets show that EROAM significantly outperforms state-of-the-art methods in terms of accuracy, robustness, and computational efficiency. Our method maintains consistent performance under challenging conditions, including high angular velocities and extended sequences, where other methods often fail or show significant drift. Additionally, EROAM produces high-quality panoramic reconstructions with preserved fine structural details.

[CV-64] G: Temporal-Granularity Method for Anomaly Detection with Attention in Smart City Surveillance

链接: https://arxiv.org/abs/2411.11003
作者: Erkut Akdag,Egor Bondarev,Peter H. N. De With
关键词-EN: recently gained interest, recently gained, gained interest, Anomaly detection, research community
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Anomaly detection in video surveillance has recently gained interest from the research community. Temporal duration of anomalies vary within video streams, leading to complications in learning the temporal dynamics of specific events. This paper presents a temporal-granularity method for an anomaly detection model (TeG) in real-world surveillance, combining spatio-temporal features at different time-scales. The TeG model employs multi-head cross-attention blocks and multi-head self-attention blocks for this purpose. Additionally, we extend the UCF-Crime dataset with new anomaly types relevant to Smart City research project. The TeG model is deployed and validated in a city surveillance system, achieving successful real-time results in industrial settings.

[CV-65] AppSign: Multi-level Approximate Computing for Real-Time Traffic Sign Recognition in Autonomous Vehicles

链接: https://arxiv.org/abs/2411.10988
作者: Fatemeh Omidian,Athena Abdi
关键词-EN: autonomous vehicles, traffic sign recognition, autonomous vehicles called, real-time traffic sign, approximate computing approach
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a multi-level approximate computing approach for real-time traffic sign recognition in autonomous vehicles called AppSign. Since autonomous vehicles are real-time systems, they must gather environmental information and process them instantaneously to respond properly. However, due to the limited resources of these systems, executing computation-intensive algorithms such as deep-learning schemes that lead to precise output is impossible and takes a long time. To tackle this, imprecise computation schemes compromise the complexity and real-time operations. In this context, AppSign presents a multi-level approximate computing scheme to balance the accuracy and computation cost of the computation-intensive schemes and make them appropriate for real-time applications. AppSign is applied to the CNN-based traffic sign recognition unit by approximating the convolution operation of CNN which is the primal solution for image processing applications. In AppSign a novel approximate multiplication method called “TIRuD” is proposed that truncates the operations while keeping the accuracy acceptable. Moreover, it provides the adaptive approximation of the underlying CNN by involving various levels of computation and considering different approximation methods. The efficiency of the proposed AppSign, in real-time traffic sign recognition, is evaluated through several experiments. Based on these experiments, our proposed TIRuD reduces the accuracy by about 10% while saving execution time about 64% over the exact multiplication, averagely. Moreover, employing our proposed hierarchical approximation in various model layers outperforms the exact computation 27.78% considering “AoC” that joins accuracy and computation cost in a parameter.

[CV-66] Framework for developing and evaluating ethical collaboration between expert and machine ECAI

链接: https://arxiv.org/abs/2411.10983
作者: Ayan Banerjee,Payal Kamboj,Sandeep Gupta
关键词-EN: coronary artery disease, personalized intervention planning, accessible disease diagnosis, illnesses like Type, Precision medicine
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted in ECAI Workshop AIEB

点击查看摘要

Abstract:Precision medicine is a promising approach for accessible disease diagnosis and personalized intervention planning in high-mortality diseases such as coronary artery disease (CAD), drug-resistant epilepsy (DRE), and chronic illnesses like Type 1 diabetes (T1D). By leveraging artificial intelligence (AI), precision medicine tailors diagnosis and treatment solutions to individual patients by explicitly modeling variance in pathophysiology. However, the adoption of AI in medical applications faces significant challenges, including poor generalizability across centers, demographics, and comorbidities, limited explainability in clinical terms, and a lack of trust in ethical decision-making. This paper proposes a framework to develop and ethically evaluate expert-guided multi-modal AI, addressing these challenges in AI integration within precision medicine. We illustrate this framework with case study on insulin management for T1D. To ensure ethical considerations and clinician engagement, we adopt a co-design approach where AI serves an assistive role, with final diagnoses or treatment plans emerging from collaboration between clinicians and AI.

[CV-67] V2X-Radar: A Multi-modal Dataset with 4D Radar for Cooperative Perception

链接: https://arxiv.org/abs/2411.10962
作者: Lei Yang,Xinyu Zhang,Jun Li,Chen Wang,Zhiying Song,Tong Zhao,Ziying Song,Li Wang,Mo Zhou,Yang Shen,Kai Wu,Chen Lv
关键词-EN: Modern autonomous vehicle, Modern autonomous, limited perception range, perception, Radar
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Modern autonomous vehicle perception systems often struggle with occlusions and limited perception range. Previous studies have demonstrated the effectiveness of cooperative perception in extending the perception range and overcoming occlusions, thereby improving the safety of autonomous driving. In recent years, a series of cooperative perception datasets have emerged. However, these datasets only focus on camera and LiDAR, overlooking 4D Radar, a sensor employed in single-vehicle autonomous driving for robust perception in adverse weather conditions. In this paper, to bridge the gap of missing 4D Radar datasets in cooperative perception, we present V2X-Radar, the first large real-world multi-modal dataset featuring 4D Radar. Our V2X-Radar dataset is collected using a connected vehicle platform and an intelligent roadside unit equipped with 4D Radar, LiDAR, and multi-view cameras. The collected data includes sunny and rainy weather conditions, spanning daytime, dusk, and nighttime, as well as typical challenging scenarios. The dataset comprises 20K LiDAR frames, 40K camera images, and 20K 4D Radar data, with 350K annotated bounding boxes across five categories. To facilitate diverse research domains, we establish V2X-Radar-C for cooperative perception, V2X-Radar-I for roadside perception, and V2X-Radar-V for single-vehicle perception. We further provide comprehensive benchmarks of recent perception algorithms on the above three sub-datasets. The dataset and benchmark codebase will be available at \urlthis http URL.

[CV-68] Map-Free Trajectory Prediction with Map Distillation and Hierarchical Encoding

链接: https://arxiv.org/abs/2411.10961
作者: Xiaodong Liu,Yucheng Xing,Xin Wang
关键词-EN: Reliable motion forecasting, Reliable motion, motion forecasting, forecasting of surrounding, essential for ensuring
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reliable motion forecasting of surrounding agents is essential for ensuring the safe operation of autonomous vehicles. Many existing trajectory prediction methods rely heavily on high-definition (HD) maps as strong driving priors. However, the availability and accuracy of these priors are not guaranteed due to substantial costs to build, localization errors of vehicles, or ongoing road constructions. In this paper, we introduce MFTP, a Map-Free Trajectory Prediction method that offers several advantages. First, it eliminates the need for HD maps during inference while still benefiting from map priors during training via knowledge distillation. Second, we present a novel hierarchical encoder that effectively extracts spatial-temporal agent features and aggregates them into multiple trajectory queries. Additionally, we introduce an iterative decoder that sequentially decodes trajectory queries to generate the final predictions. Extensive experiments show that our approach achieves state-of-the-art performance on the Argoverse dataset under the map-free setting.

[CV-69] SFormer: A Robust Framework for Efficient UHD Image Restoration

链接: https://arxiv.org/abs/2411.10951
作者: Xin Su,Chen Wu,Zhuoran Zheng
关键词-EN: exceptional visual fidelity, applications demanding exceptional, demanding exceptional visual, visual fidelity, limiting their practical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Ultra-high-definition (UHD) image restoration is vital for applications demanding exceptional visual fidelity, yet existing methods often face a trade-off between restoration quality and efficiency, limiting their practical deployment. In this paper, we propose TSFormer, an all-in-one framework that integrates \textbfTrusted learning with \textbfSparsification to boost both generalization capability and computational efficiency in UHD image restoration. The key is that only a small amount of token movement is allowed within the model. To efficiently filter tokens, we use Min- p with random matrix theory to quantify the uncertainty of tokens, thereby improving the robustness of the model. Our model can run a 4K image in real time (40fps) with 3.38 M parameters. Extensive experiments demonstrate that TSFormer achieves state-of-the-art restoration quality while enhancing generalization and reducing computational demands. In addition, our token filtering method can be applied to other image restoration models to effectively accelerate inference and maintain performance.

[CV-70] owards Accurate and Efficient Sub-8-Bit Integer Training

链接: https://arxiv.org/abs/2411.10948
作者: Wenjin Guo,Donglai Liu,Weiying Xie,Yunsong Li,Xuefei Ning,Zihan Meng,Shulin Zeng,Jie Lei,Zhenman Fang,Yu Wang
关键词-EN: training, integer training, normalization, Quantization, compute-intensive task
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Neural network training is a memory- and compute-intensive task. Quantization, which enables low-bitwidth formats in training, can significantly mitigate the workload. To reduce quantization error, recent methods have developed new data formats and additional pre-processing operations on quantizers. However, it remains quite challenging to achieve high accuracy and efficiency simultaneously. In this paper, we explore sub-8-bit integer training from its essence of gradient descent optimization. Our integer training framework includes two components: ShiftQuant to realize accurate gradient estimation, and L1 normalization to smoothen the loss landscape. ShiftQuant attains performance that approaches the theoretical upper bound of group quantization. Furthermore, it liberates group quantization from inefficient memory rearrangement. The L1 normalization facilitates the implementation of fully quantized normalization layers with impressive convergence accuracy. Our method frees sub-8-bit integer training from pre-processing and supports general devices. This framework achieves negligible accuracy loss across various neural networks and tasks ( 0.92% on 4-bit ResNets, 0.61% on 6-bit Transformers). The prototypical implementation of ShiftQuant achieves more than 1.85\times/15.3% performance improvement on CPU/GPU compared to its FP16 counterparts, and 33.9% resource consumption reduction on FPGA than the FP16 counterparts. The proposed fully-quantized L1 normalization layers achieve more than 35.54% improvement in throughout on CPU compared to traditional L2 normalization layers. Moreover, theoretical analysis verifies the advancement of our method.

[CV-71] Direct and Explicit 3D Generation from a Single Image

链接: https://arxiv.org/abs/2411.10947
作者: Haoyu Wu,Meher Gitika Karumuri,Chuhang Zou,Seungbae Bang,Yuelong Li,Dimitris Samaras,Sunil Hadap
关键词-EN: high computational costs, Stable Diffusion model, repurposed Stable Diffusion, approaches suffer, high-resolution outputs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 3DV 2025, Project page: this https URL

点击查看摘要

Abstract:Current image-to-3D approaches suffer from high computational costs and lack scalability for high-resolution outputs. In contrast, we introduce a novel framework to directly generate explicit surface geometry and texture using multi-view 2D depth and RGB images along with 3D Gaussian features using a repurposed Stable Diffusion model. We introduce a depth branch into U-Net for efficient and high quality multi-view, cross-domain generation and incorporate epipolar attention into the latent-to-pixel decoder for pixel-level multi-view consistency. By back-projecting the generated depth pixels into 3D space, we create a structured 3D representation that can be either rendered via Gaussian splatting or extracted to high-quality meshes, thereby leveraging additional novel view synthesis loss to further improve our performance. Extensive experiments demonstrate that our method surpasses existing baselines in geometry and texture quality while achieving significantly faster generation time.

[CV-72] Anomaly Detection for People with Visual Impairments Using an Egocentric 360-Degree Camera WACV2025

链接: https://arxiv.org/abs/2411.10945
作者: Inpyo Song,Sanghyeon Lee,Minjun Joo,Jangwon Lee
关键词-EN: Recent advancements, developing assistive technologies, visual impairments, assistive technologies, vision have led
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: WACV2025

点击查看摘要

Abstract:Recent advancements in computer vision have led to a renewed interest in developing assistive technologies for individuals with visual impairments. Although extensive research has been conducted in the field of computer vision-based assistive technologies, most of the focus has been on understanding contexts in images, rather than addressing their physical safety and security concerns. To address this challenge, we propose the first step towards detecting anomalous situations for visually impaired people by observing their entire surroundings using an egocentric 360-degree camera. We first introduce a novel egocentric 360-degree video dataset called VIEW360 (Visually Impaired Equipped with Wearable 360-degree camera), which contains abnormal activities that visually impaired individuals may encounter, such as shoulder surfing and pickpocketing. Furthermore, we propose a new architecture called the FDPN (Frame and Direction Prediction Network), which facilitates frame-level prediction of abnormal events and identifying of their directions. Finally, we evaluate our approach on our VIEW360 dataset and the publicly available UCF-Crime and Shanghaitech datasets, demonstrating state-of-the-art performance.

[CV-73] A Monocular SLAM-based Multi-User Positioning System with Image Occlusion in Augmented Reality

链接: https://arxiv.org/abs/2411.10940
作者: Wei-Hsiang Lien,Benedictus Kent Chandra,Robin Fischer,Ya-Hui Tang,Shiann-Jang Wang,Wei-En Hsu,Li-Chen Fu
关键词-EN: multi-user collaborative experiences, recent years, augmented reality, increasing demand, collaborative experiences
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, with the rapid development of augmented reality (AR) technology, there is an increasing demand for multi-user collaborative experiences. Unlike for single-user experiences, ensuring the spatial localization of every user and maintaining synchronization and consistency of positioning and orientation across multiple users is a significant challenge. In this paper, we propose a multi-user localization system based on ORB-SLAM2 using monocular RGB images as a development platform based on the Unity 3D game engine. This system not only performs user localization but also places a common virtual object on a planar surface (such as table) in the environment so that every user holds a proper perspective view of the object. These generated virtual objects serve as reference points for multi-user position synchronization. The positioning information is passed among every user’s AR devices via a central server, based on which the relative position and movement of other users in the space of a specific user are presented via virtual avatars all with respect to these virtual objects. In addition, we use deep learning techniques to estimate the depth map of an image from a single RGB image to solve occlusion problems in AR applications, making virtual objects appear more natural in AR scenes.

[CV-74] Iterative Camera-LiDAR Extrinsic Optimization via Surrogate Diffusion

链接: https://arxiv.org/abs/2411.10936
作者: Ni Ou,Zhuo Chen,Xinru Zhang,Junzheng Wang
关键词-EN: Cameras and LiDAR, autonomous vehicles, LiDAR are essential, Cameras, essential sensors
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Cameras and LiDAR are essential sensors for autonomous vehicles. Camera-LiDAR data fusion compensate for deficiencies of stand-alone sensors but relies on precise extrinsic calibration. Many learning-based calibration methods predict extrinsic parameters in a single step. Driven by the growing demand for higher accuracy, a few approaches utilize multi-range models or integrate multiple methods to improve extrinsic parameter predictions, but these strategies incur extended training times and require additional storage for separate models. To address these issues, we propose a single-model iterative approach based on surrogate diffusion to significantly enhance the capacity of individual calibration methods. By applying a buffering technique proposed by us, the inference time of our surrogate diffusion is 43.7% less than that of multi-range models. Additionally, we create a calibration network as our denoiser, featuring both projection-first and encoding-first branches for effective point feature extraction. Extensive experiments demonstrate that our diffusion model outperforms other single-model iterative methods and delivers competitive results compared to multi-range models. Our denoiser exceeds state-of-the-art calibration methods, reducing the rotation error by 24.5% compared to the second-best method. Furthermore, with the proposed diffusion applied, it achieves 20.4% less rotation error and 9.6% less translation error.

[CV-75] Constrained Diffusion with Trust Sampling NEURIPS

链接: https://arxiv.org/abs/2411.10932
作者: William Huang,Yifeng Jiang,Tom Van Wouwe,C. Karen Liu
关键词-EN: satisfy challenging constraints, demonstrated significant promise, struggle to satisfy, satisfy challenging, diffusion model
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 6 figures, NeurIPS

点击查看摘要

Abstract:Diffusion models have demonstrated significant promise in various generative tasks; however, they often struggle to satisfy challenging constraints. Our approach addresses this limitation by rethinking training-free loss-guided diffusion from an optimization perspective. We formulate a series of constrained optimizations throughout the inference process of a diffusion model. In each optimization, we allow the sample to take multiple steps along the gradient of the proxy constraint function until we can no longer trust the proxy, according to the variance at each diffusion level. Additionally, we estimate the state manifold of diffusion model to allow for early termination when the sample starts to wander away from the state manifold at each diffusion step. Trust sampling effectively balances between following the unconditional diffusion model and adhering to the loss guidance, enabling more flexible and accurate constrained generation. We demonstrate the efficacy of our method through extensive experiments on complex tasks, and in drastically different domains of images and 3D motion generation, showing significant improvements over existing methods in terms of generation quality. Our implementation is available at this https URL.

[CV-76] Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection WACV2025

链接: https://arxiv.org/abs/2411.10922
作者: Wentao Bao,Kai Li,Yuxiao Chen,Deep Patel,Martin Renqiang Min,Yu Kong
关键词-EN: human actions spatially, action categories, recognize and localize, spatially and temporally, Action
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: WACV 2025 Accepted

点击查看摘要

Abstract:Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and temporal OpenMixer blocks (S-OMB and T-OMB), and a dynamically fused alignment (DFA) module. The three components collectively enjoy the merits of strong generalization from pre-trained VLMs and end-to-end learning from DETR design. Moreover, we established OVAD benchmarks under various settings, and the experimental results show that the OpenMixer performs the best over baselines for detecting seen and unseen actions. We release the codes, models, and dataset splits at this https URL.

[CV-77] Distributed solar generation forecasting using attention-based deep neural networks for cloud movement prediction

链接: https://arxiv.org/abs/2411.10921
作者: Maneesha Perera,Julian De Hoog,Kasun Bandara,Saman Halgamuge
关键词-EN: Accurate forecasts, solar generation, reduce negative impacts, negative impacts resulting, distributed solar generation
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Accurate forecasts of distributed solar generation are necessary to reduce negative impacts resulting from the increased uptake of distributed solar photovoltaic (PV) systems. However, the high variability of solar generation over short time intervals (seconds to minutes) caused by cloud movement makes this forecasting task difficult. To address this, using cloud images, which capture the second-to-second changes in cloud cover affecting solar generation, has shown promise. Recently, deep neural networks with “attention” that focus on important regions of an image have been applied with success in many computer vision applications. However, their use for forecasting cloud movement has not yet been extensively explored. In this work, we propose an attention-based convolutional long short-term memory network to forecast cloud movement and apply an existing self-attention-based method previously proposed for video prediction to forecast cloud movement. We investigate and discuss the impact of cloud forecasts from attention-based methods towards forecasting distributed solar generation, compared to cloud forecasts from non-attention-based methods. We further provide insights into the different solar forecast performances that can be achieved for high and low altitude clouds. We find that for clouds at high altitudes, the cloud predictions obtained using attention-based methods result in solar forecast skill score improvements of 5.86% or more compared to non-attention-based methods.

[CV-78] Generating Compositional Scenes via Text-to-image RGBA Instance Generation NEURIPS2024

链接: https://arxiv.org/abs/2411.10913
作者: Alessandro Fontanella,Petru-Daniel Tudosiu,Yongxin Yang,Shifeng Zhang,Sarah Parisot
关键词-EN: tedious prompt engineering, object attributes, cost of tedious, control, attributes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multi-layer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.

[CV-79] Attention-based U-Net Method for Autonomous Lane Detection

链接: https://arxiv.org/abs/2411.10902
作者: Mohammadhamed Tangestanizadeh,Mohammad Dehghani Tezerjani,Saba Yousefian Jazi
关键词-EN: detection involves identifying, involves identifying lanes, Lane detection involves, location and shape, Lane detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Lane detection involves identifying lanes on the road and accurately determining their location and shape. This is a crucial technique for modern assisted and autonomous driving systems. However, several unique properties of lanes pose challenges for detection methods. The lack of distinctive features can cause lane detection algorithms to be confused by other objects with similar appearances. Additionally, the varying number of lanes and the diversity in lane line patterns, such as solid, broken, single, double, merging, and splitting lines, further complicate the task. To address these challenges, Deep Learning (DL) approaches can be employed in various ways. Merging DL models with an attention mechanism has recently surfaced as a new approach. In this context, two deep learning-based lane recognition methods are proposed in this study. The first method employs the Feature Pyramid Network (FPN) model, delivering an impressive 87.59% accuracy in detecting road lanes. The second method, which incorporates attention layers into the U-Net model, significantly boosts the performance of semantic segmentation tasks. The advanced model, achieving an extraordinary 98.98% accuracy and far surpassing the basic U-Net model, clearly showcases its superiority over existing methods in a comparative analysis. The groundbreaking findings of this research pave the way for the development of more effective and reliable road lane detection methods, significantly advancing the capabilities of modern assisted and autonomous driving systems.

[CV-80] argeting Negative Flips in Active Learning using Validation Sets

链接: https://arxiv.org/abs/2411.10896
作者: Ryan Benkert,Mohit Prabhushankar,Ghassan AlRegib
关键词-EN: negative flips, active learning, active learning algorithms, negative, flips
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Presented at the IEEE International Conference on Big Data 2024, Washington DC, USA

点击查看摘要

Abstract:The performance of active learning algorithms can be improved in two ways. The often used and intuitive way is by reducing the overall error rate within the test set. The second way is to ensure that correct predictions are not forgotten when the training set is increased in between rounds. The former is measured by the accuracy of the model and the latter is captured in negative flips between rounds. Negative flips are samples that are correctly predicted when trained with the previous/smaller dataset and incorrectly predicted after additional samples are labeled. In this paper, we discuss improving the performance of active learning algorithms both in terms of prediction accuracy and negative flips. The first observation we make in this paper is that negative flips and overall error rates are decoupled and reducing one does not necessarily imply that the other is reduced. Our observation is important as current active learning algorithms do not consider negative flips directly and implicitly assume the opposite. The second observation is that performing targeted active learning on subsets of the unlabeled pool has a significant impact on the behavior of the active learning algorithm and influences both negative flips and prediction accuracy. We then develop ROSE - a plug-in algorithm that utilizes a small labeled validation set to restrict arbitrary active learning acquisition functions to negative flips within the unlabeled pool. We show that integrating a validation set results in a significant performance boost in terms of accuracy, negative flip rate reduction, or both.

[CV-81] Deep BI-RADS Network for Improved Cancer Detection from Mammograms

链接: https://arxiv.org/abs/2411.10894
作者: Gil Ben-Artzi,Feras Daragma,Shahar Mahpod
关键词-EN: enhanced diagnostic accuracy, detection leverage multi-view, cancer detection leverage, visual mammography data, breast cancer detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While state-of-the-art models for breast cancer detection leverage multi-view mammograms for enhanced diagnostic accuracy, they often focus solely on visual mammography data. However, radiologists document valuable lesion descriptors that contain additional information that can enhance mammography-based breast cancer screening. A key question is whether deep learning models can benefit from these expert-derived features. To address this question, we introduce a novel multi-modal approach that combines textual BI-RADS lesion descriptors with visual mammogram content. Our method employs iterative attention layers to effectively fuse these different modalities, significantly improving classification performance over image-only models. Experiments on the CBIS-DDSM dataset demonstrate substantial improvements across all metrics, demonstrating the contribution of handcrafted features to end-to-end.

[CV-82] ChannelDropBack: Forward-Consistent Stochastic Regularization for Deep Networks

链接: https://arxiv.org/abs/2411.10891
作者: Evgeny Hershkovitch Neiterman,Gil Ben-Artzi
关键词-EN: Incorporating stochasticity, deep convolutional networks, reduce overfitting, overfitting and improve, deep convolutional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Incorporating stochasticity into the training process of deep convolutional networks is a widely used technique to reduce overfitting and improve regularization. Existing techniques often require modifying the architecture of the network by adding specialized layers, are effective only to specific network topologies or types of layers - linear or convolutional, and result in a trained model that is different from the deployed one. We present ChannelDropBack, a simple stochastic regularization approach that introduces randomness only into the backward information flow, leaving the forward pass intact. ChannelDropBack randomly selects a subset of channels within the network during the backpropagation step and applies weight updates only to them. As a consequence, it allows for seamless integration into the training process of any model and layers without the need to change its architecture, making it applicable to various network topologies, and the exact same network is deployed during training and inference. Experimental evaluations validate the effectiveness of our approach, demonstrating improved accuracy on popular datasets and models, including ImageNet and ViT. Code is available at \urlthis https URL.

[CV-83] FIAS: Feature Imbalance-Aware Medical Image Segmentation with Dynamic Fusion and Mixing Attention

链接: https://arxiv.org/abs/2411.10881
作者: Xiwei Liu,Min Xu,Qirong Ho
关键词-EN: combine convolutional neural, convolutional neural networks, demonstrates competitive ability, transformers demonstrates competitive, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures, 3 tables

点击查看摘要

Abstract:With the growing application of transformer in computer vision, hybrid architecture that combine convolutional neural networks (CNNs) and transformers demonstrates competitive ability in medical image segmentation. However, direct fusion of features from CNNs and transformers often leads to feature imbalance and redundant information. To address these issues, we propose a Feaure Imbalance-Aware Segmentation (FIAS) network, which incorporates a dual-path encoder and a novel Mixing Attention (MixAtt) decoder. The dual-branches encoder integrates a DilateFormer for long-range global feature extraction and a Depthwise Multi-Kernel (DMK) convolution for capturing fine-grained local details. A Context-Aware Fusion (CAF) block dynamically balances the contribution of these global and local features, preventing feature imbalance. The MixAtt decoder further enhances segmentation accuracy by combining self-attention and Monte Carlo attention, enabling the model to capture both small details and large-scale dependencies. Experimental results on the Synapse multi-organ and ACDC datasets demonstrate the strong competitiveness of our approach in medical image segmentation tasks.

[CV-84] Improvement in Facial Emotion Recognition using Synthetic Data Generated by Diffusion Model ICASSP2025

链接: https://arxiv.org/abs/2411.10863
作者: Arnab Kumar Roy,Hemant Kumar Kathania,Adhitiya Sharma
关键词-EN: personalized learning environments, mental health monitoring, Facial Emotion Recognition, Facial Emotion, affective computing
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
*备注: 5 pages, 4 tables, 4 figures, ICASSP 2025

点击查看摘要

Abstract:Facial Emotion Recognition (FER) plays a crucial role in computer vision, with significant applications in human-computer interaction, affective computing, and areas such as mental health monitoring and personalized learning environments. However, a major challenge in FER task is the class imbalance commonly found in available datasets, which can hinder both model performance and generalization. In this paper, we tackle the issue of data imbalance by incorporating synthetic data augmentation and leveraging the ResEmoteNet model to enhance the overall performance on facial emotion recognition task. We employed Stable Diffusion 2 and Stable Diffusion 3 Medium models to generate synthetic facial emotion data, augmenting the training sets of the FER2013 and RAF-DB benchmark datasets. Training ResEmoteNet with these augmented datasets resulted in substantial performance improvements, achieving accuracies of 96.47% on FER2013 and 99.23% on RAF-DB. These findings shows an absolute improvement of 16.68% in FER2013, 4.47% in RAF-DB and highlight the efficacy of synthetic data augmentation in strengthening FER models and underscore the potential of advanced generative models in FER research and applications. The source code for ResEmoteNet is available at this https URL

[CV-85] NeuroNURBS: Learning Efficient Surface Representations for 3D Solids

链接: https://arxiv.org/abs/2411.10848
作者: Jiajie Fan,Babak Gholami,Thomas Bäck,Hao Wang
关键词-EN: Computer-Aided Design, Boundary Representation, Non-Uniform Rational B-Splines, CAD, Boundary
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Boundary Representation (B-Rep) is the de facto representation of 3D solids in Computer-Aided Design (CAD). B-Rep solids are defined with a set of NURBS (Non-Uniform Rational B-Splines) surfaces forming a closed volume. To represent a surface, current works often employ the UV-grid approximation, i.e., sample points uniformly on the surface. However, the UV-grid method is not efficient in surface representation and sometimes lacks precision and regularity. In this work, we propose NeuroNURBS, a representation learning method to directly encode the parameters of NURBS surfaces. Our evaluation in solid generation and segmentation tasks indicates that the NeuroNURBS performs comparably and, in some cases, superior to UV-grids, but with a significantly improved efficiency: for training the surface autoencoder, GPU consumption is reduced by 86.7%; memory requirement drops by 79.9% for storing 3D solids. Moreover, adapting BrepGen for solid generation with our NeuroNURBS improves the FID from 30.04 to 27.24, and resolves the undulating issue in generated surfaces.

[CV-86] Automatic Discovery and Assessment of Interpretable Systematic Errors in Semantic Segmentation

链接: https://arxiv.org/abs/2411.10845
作者: Jaisidh Singh,Sonam Singh,Amit Arvind Kale,Harsh K Gandhi
关键词-EN: systematic errors, errors, systematic, Berkeley Deep Drive, models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages main paper (without references), total 13 pages 9 figures

点击查看摘要

Abstract:This paper presents a novel method for discovering systematic errors in segmentation models. For instance, a systematic error in the segmentation model can be a sufficiently large number of misclassifications from the model as a parking meter for a target class of pedestrians. With the rapid deployment of these models in critical applications such as autonomous driving, it is vital to detect and interpret these systematic errors. However, the key challenge is automatically discovering such failures on unlabelled data and forming interpretable semantic sub-groups for intervention. For this, we leverage multimodal foundation models to retrieve errors and use conceptual linkage along with erroneous nature to study the systematic nature of these errors. We demonstrate that such errors are present in SOTA segmentation models (UperNet ConvNeXt and UperNet Swin) trained on the Berkeley Deep Drive and benchmark the approach qualitatively and quantitatively, showing its effectiveness by discovering coherent systematic errors for these models. Our work opens up the avenue to model analysis and intervention that have so far been underexplored in semantic segmentation.

[CV-87] AnimateAnything: Consistent and Controllable Animation for Video Generation

链接: https://arxiv.org/abs/2411.10836
作者: Guojun Lei,Chi Wang,Hong Li,Rong Zhang,Yikai Wang,Weiwei Xu
关键词-EN: including camera trajectories, user motion annotations, generation approach AnimateAnything, text prompts, unified controllable video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present a unified controllable video generation approach AnimateAnything that facilitates precise and consistent video manipulation across various conditions, including camera trajectories, text prompts, and user motion annotations. Specifically, we carefully design a multi-scale control feature fusion network to construct a common motion representation for different conditions. It explicitly converts all control information into frame-by-frame optical flows. Then we incorporate the optical flows as motion priors to guide final video generation. In addition, to reduce the flickering issues caused by large-scale motion, we propose a frequency-based stabilization module. It can enhance temporal coherence by ensuring the video’s frequency domain consistency. Experiments demonstrate that our method outperforms the state-of-the-art approaches. For more details and videos, please refer to the webpage: this https URL.

[CV-88] ARM: Appearance Reconstruction Model for Relightable 3D Generation

链接: https://arxiv.org/abs/2411.10825
作者: Xiang Feng,Chang Yu,Zoubin Bi,Yintong Shang,Feng Gao,Hongzhi Wu,Kun Zhou,Chenfanfu Jiang,Yin Yang
关键词-EN: faithfully generate realistic, advanced geometry generation, greatly advanced geometry, reconstruction models, generate realistic appearance
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Recent image-to-3D reconstruction models have greatly advanced geometry generation, but they still struggle to faithfully generate realistic appearance. To address this, we introduce ARM, a novel method that reconstructs high-quality 3D meshes and realistic appearance from sparse-view images. The core of ARM lies in decoupling geometry from appearance, processing appearance within the UV texture space. Unlike previous methods, ARM improves texture quality by explicitly back-projecting measurements onto the texture map and processing them in a UV space module with a global receptive field. To resolve ambiguities between material and illumination in input images, ARM introduces a material prior that encodes semantic appearance information, enhancing the robustness of appearance decomposition. Trained on just 8 H100 GPUs, ARM outperforms existing methods both quantitatively and qualitatively.

[CV-89] FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

链接: https://arxiv.org/abs/2411.10818
作者: Hmrishav Bandyopadhyay,Yi-Zhe Song
关键词-EN: professional studio productions, studio productions, offer a powerful, powerful medium, doodles to professional
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code: this https URL

点击查看摘要

Abstract:Sketch animations offer a powerful medium for visual storytelling, from simple flip-book doodles to professional studio productions. While traditional animation requires teams of skilled artists to draw key frames and in-between frames, existing automation attempts still demand significant artistic effort through precise motion paths or keyframe specification. We present FlipSketch, a system that brings back the magic of flip-book animation – just draw your idea and describe how you want it to move! Our approach harnesses motion priors from text-to-video diffusion models, adapting them to generate sketch animations through three key innovations: (i) fine-tuning for sketch-style frame generation, (ii) a reference frame mechanism that preserves visual integrity of input sketch through noise refinement, and (iii) a dual-attention composition that enables fluid motion without losing visual consistency. Unlike constrained vector animations, our raster frames support dynamic sketch transformations, capturing the expressive freedom of traditional animation. The result is an intuitive system that makes sketch animation as simple as doodling and describing, while maintaining the artistic essence of hand-drawn animation.

[CV-90] DEAL: Decoupled Classifier with Adaptive Linear Modulation for Group Robust Early Diagnosis of MCI to AD Conversion

链接: https://arxiv.org/abs/2411.10814
作者: Donggyu Lee,Juhyeon Park,Taesup Moon
关键词-EN: learning-based Alzheimer disease, deep learning-based Alzheimer, made significant advancements, recently made significant, Alzheimer disease
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:While deep learning-based Alzheimer’s disease (AD) diagnosis has recently made significant advancements, particularly in predicting the conversion of mild cognitive impairment (MCI) to AD based on MRI images, there remains a critical gap in research regarding the group robustness of the diagnosis. Although numerous studies pointed out that deep learning-based classifiers may exhibit poor performance in certain groups by relying on unimportant attributes, this issue has been largely overlooked in the early diagnosis of MCI to AD conversion. In this paper, we present the first comprehensive investigation of the group robustness in the early diagnosis of MCI to AD conversion using MRI images, focusing on disparities in accuracy between groups, specifically sMCI and pMCI individuals divided by age. Our experiments reveal that standard classifiers consistently underperform for certain groups across different architectures, highlighting the need for more tailored approaches. To address this, we propose a novel method, dubbed DEAL (DEcoupled classifier with Adaptive Linear modulation), comprising two key components: (1) a linear modulation of features from the penultimate layer, incorporating easily obtainable age and cognitive indicative tabular features, and (2) a decoupled classifier that provides more tailored decision boundaries for each group, further improving performance. Through extensive experiments and evaluations across different architectures, we demonstrate the efficacy of DEAL in improving the group robustness of the MCI to AD conversion prediction.

[CV-91] Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

链接: https://arxiv.org/abs/2411.10803
作者: Ting Liu,Liangtao Shi,Richang Hong,Yue Hu,Quanjun Yin,Linfeng Zhang
关键词-EN: multimodal large language, large language models, exhibit significant spatial, vision encoding stage, prefilling stage
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 4figures

点击查看摘要

Abstract:The vision tokens in multimodal large language models usually exhibit significant spatial and temporal redundancy and take up most of the input tokens, which harms their inference efficiency. To solve this problem, some recent works were introduced to drop the unimportant tokens during inference where the importance of each token is decided only by the information in either the vision encoding stage or the prefilling stage. In this paper, we propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle, including the vision encoding stage, prefilling stage, and decoding stage. Concretely, in the visual encoding stage, MustDrop merges spatially adjacent tokens with high similarity, and establishes a key token set to retain the most vision-critical tokens, preventing them from being discarded in later stages. In the prefilling stage, MustDrop further compresses vision tokens by the guidance of text semantics, with a dual-attention filtering strategy. In the decoding stage, an output-aware cache policy is proposed to further reduce the size of the KV cache. By leveraging tailored strategies in the multi-stage process, MustDrop can more precisely recognize the important and redundant tokens, thus achieving an optimal balance between performance and efficiency. For instance, MustDrop reduces about 88.5% FLOPs on LLaVA with a compression ratio of 92.2% while maintaining comparable accuracy. Our codes are available at \urlthis https URL.

[CV-92] st-time Conditional Text-to-Image Synthesis Using Diffusion Models

链接: https://arxiv.org/abs/2411.10800
作者: Tripti Shukla,Srikrishna Karanam,Balaji Vasan Srinivasan
关键词-EN: diffusion models, base diffusion model, diffusion model outputs, diffusion, model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We consider the problem of conditional text-to-image synthesis with diffusion models. Most recent works need to either finetune specific parts of the base diffusion model or introduce new trainable parameters, leading to deployment inflexibility due to the need for training. To address this gap in the current literature, we propose our method called TINTIN: Test-time Conditional Text-to-Image Synthesis using Diffusion Models which is a new training-free test-time only algorithm to condition text-to-image diffusion model outputs on conditioning factors such as color palettes and edge maps. In particular, we propose to interpret noise predictions during denoising as gradients of an energy-based model, leading to a flexible approach to manipulate the noise by matching predictions inferred from them to the ground truth conditioning input. This results in, to the best of our knowledge, the first approach to control model outputs with input color palettes, which we realize using a novel color distribution matching loss. We also show this test-time noise manipulation can be easily extensible to other types of conditioning, e.g., edge maps. We conduct extensive experiments using a variety of text prompts, color palettes, and edge maps and demonstrate significant improvement over the current state-of-the-art, both qualitatively and quantitatively.

[CV-93] Going Beyond Conventional OOD Detection

链接: https://arxiv.org/abs/2411.10794
作者: Sudarshan Regmi
关键词-EN: deep learning models, critical applications, deep learning, OOD, learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is critical to ensure the safe deployment of deep learning models in critical applications. Deep learning models can often misidentify OOD samples as in-distribution (ID) samples. This vulnerability worsens in the presence of spurious correlation in the training set. Likewise, in fine-grained classification settings, detection of fine-grained OOD samples becomes inherently challenging due to their high similarity to ID samples. However, current research on OOD detection has largely ignored these challenging scenarios, focusing instead on relatively easier (conventional) cases. In this work, we present a unified Approach to Spurious, fine-grained, and Conventional OOD Detection (ASCOOD). First, we propose synthesizing virtual outliers from ID data by approximating the destruction of invariant features. We identify invariant features with the pixel attribution method using the model being learned. This approach eliminates the burden of curating external OOD datasets. Then, we simultaneously incentivize ID classification and predictive uncertainty towards the virtual outliers leveraging standardized feature representation. Our approach effectively mitigates the impact of spurious correlations and encourages capturing fine-grained attributes. Extensive experiments across six datasets demonstrate the merit of ASCOOD in spurious, fine-grained, and conventional settings. The code is available at: this https URL

[CV-94] Anatomy-Guided Radiology Report Generation with Pathology-Aware Regional Prompts

链接: https://arxiv.org/abs/2411.10789
作者: Yijian Gao,Dominic Marshall,Xiaodan Xing,Junzhi Ning,Giorgos Papanastasiou,Guang Yang,Matthieu Komorowski
关键词-EN: streamline medical care, Radiology reporting generative, alleviate clinical workloads, holds significant potential, medical care
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Radiology reporting generative AI holds significant potential to alleviate clinical workloads and streamline medical care. However, achieving high clinical accuracy is challenging, as radiological images often feature subtle lesions and intricate structures. Existing systems often fall short, largely due to their reliance on fixed size, patch-level image features and insufficient incorporation of pathological information. This can result in the neglect of such subtle patterns and inconsistent descriptions of crucial pathologies. To address these challenges, we propose an innovative approach that leverages pathology-aware regional prompts to explicitly integrate anatomical and pathological information of various scales, significantly enhancing the precision and clinical relevance of generated reports. We develop an anatomical region detector that extracts features from distinct anatomical areas, coupled with a novel multi-label lesion detector that identifies global pathologies. Our approach emulates the diagnostic process of radiologists, producing clinically accurate reports with comprehensive diagnostic capabilities. Experimental results show that our model outperforms previous state-of-the-art methods on most natural language generation and clinical efficacy metrics, with formal expert evaluations affirming its potential to enhance radiology practice.

[CV-95] C-DiffSET: Leveraging Latent Diffusion for SAR-to-EO Image Translation with Confidence-Guided Reliable Object Generation

链接: https://arxiv.org/abs/2411.10788
作者: Jeonghyeok Do,Jaehyup Lee,Munchurl Kim
关键词-EN: Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, patterns pose interpretation, pose interpretation challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Please visit our project page this https URL

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) imagery provides robust environmental and temporal coverage (e.g., during clouds, seasons, day-night cycles), yet its noise and unique structural patterns pose interpretation challenges, especially for non-experts. SAR-to-EO (Electro-Optical) image translation (SET) has emerged to make SAR images more perceptually interpretable. However, traditional approaches trained from scratch on limited SAR-EO datasets are prone to overfitting. To address these challenges, we introduce Confidence Diffusion for SAR-to-EO Translation, called C-DiffSET, a framework leveraging pretrained Latent Diffusion Model (LDM) extensively trained on natural images, thus enabling effective adaptation to the EO domain. Remarkably, we find that the pretrained VAE encoder aligns SAR and EO images in the same latent space, even with varying noise levels in SAR inputs. To further improve pixel-wise fidelity for SET, we propose a confidence-guided diffusion (C-Diff) loss that mitigates artifacts from temporal discrepancies, such as appearing or disappearing objects, thereby enhancing structural accuracy. C-DiffSET achieves state-of-the-art (SOTA) results on multiple datasets, significantly outperforming the very recent image-to-image translation methods and SET methods with large margins.

[CV-96] Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer

链接: https://arxiv.org/abs/2411.10781
作者: Shitong Shao,Zikai Zhou,Tian Ye,Lichen Bai,Zhiqiang Xu,Zeke Xie
关键词-EN: unprecedented pace, theoretical exploration, diffusion models, SOTA MGT Meissonic, MGT
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-image diffusion models (DMs) develop at an unprecedented pace, supported by thorough theoretical exploration and empirical analysis. Unfortunately, the discrepancy between DMs and autoregressive models (ARMs) complicates the path toward achieving the goal of unified vision and language generation. Recently, the masked generative Transformer (MGT) serves as a promising intermediary between DM and ARM by predicting randomly masked image tokens (i.e., masked image modeling), combining the efficiency of DM with the discrete token nature of ARM. However, we find that the comprehensive analyses regarding the inference for MGT are virtually non-existent, and thus we aim to present positive design choices to fill this gap. We modify and re-design a set of DM-based inference techniques for MGT and further elucidate their performance on MGT. We also discuss the approach to correcting token’s distribution to enhance inference. Extensive experiments and empirical analyses lead to concrete and effective design choices, and these design choices can be merged to achieve further performance gains. For instance, in terms of enhanced inference, we achieve winning rates of approximately 70% compared to vanilla sampling on HPS v2 with the recent SOTA MGT Meissonic. Our contributions have the potential to further enhance the capabilities and future development of MGTs.

[CV-97] DSM:Triplet Diffusion for Skeleton-Text Matching in Zero-Shot Action Recognition

链接: https://arxiv.org/abs/2411.10745
作者: Jeonghyeok Do,Munchurl Kim
关键词-EN: diffusion-based action recognition, action recognition, firstly present, present a diffusion-based, skeleton-based action recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Please visit our project page at this https URL

点击查看摘要

Abstract:We firstly present a diffusion-based action recognition with zero-shot learning for skeleton inputs. In zero-shot skeleton-based action recognition, aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated from the remarkable performance of text-to-image diffusion models, we leverage their alignment capabilities between different modalities mostly by focusing on the training process during reverse diffusion rather than using their generative power. Based on this, our framework is designed as a Triplet Diffusion for Skeleton-Text Matching (TDSM) method which aligns skeleton features with text prompts through reverse diffusion, embedding the prompts into the unified skeleton-text latent space to achieve robust matching. To enhance discriminative power, we introduce a novel triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing apart incorrect ones. Our TDSM significantly outperforms the very recent state-of-the-art methods with large margins of 2.36%-point to 13.05%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.

[CV-98] It Takes Two: Accurate Gait Recognition in the Wild via Cross-granularity Alignment ACM-MM2024

链接: https://arxiv.org/abs/2411.10742
作者: Jinkai Zheng,Xinchen Liu,Boyue Zhang,Chenggang Yan,Jiyong Zhang,Wu Liu,Yongdong Zhang
关键词-EN: Existing studies, recognition primarily utilized, primarily utilized sequences, gait recognition primarily, parsing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 9 figures; Accepted by ACM MM 2024

点击查看摘要

Abstract:Existing studies for gait recognition primarily utilized sequences of either binary silhouette or human parsing to encode the shapes and dynamics of persons during walking. Silhouettes exhibit accurate segmentation quality and robustness to environmental variations, but their low information entropy may result in sub-optimal performance. In contrast, human parsing provides fine-grained part segmentation with higher information entropy, but the segmentation quality may deteriorate due to the complex environments. To discover the advantages of silhouette and parsing and overcome their limitations, this paper proposes a novel cross-granularity alignment gait recognition method, named XGait, to unleash the power of gait representations of different granularity. To achieve this goal, the XGait first contains two branches of backbone encoders to map the silhouette sequences and the parsing sequences into two latent spaces, respectively. Moreover, to explore the complementary knowledge across the features of two representations, we design the Global Cross-granularity Module (GCM) and the Part Cross-granularity Module (PCM) after the two encoders. In particular, the GCM aims to enhance the quality of parsing features by leveraging global features from silhouettes, while the PCM aligns the dynamics of human parts between silhouette and parsing features using the high information entropy in parsing sequences. In addition, to effectively guide the alignment of two representations with different granularity at the part level, an elaborate-designed learnable division mechanism is proposed for the parsing features. Comprehensive experiments on two large-scale gait datasets not only show the superior performance of XGait with the Rank-1 accuracy of 80.5% on Gait3D and 88.3% CCPG but also reflect the robustness of the learned features even under challenging conditions like occlusions and cloth changes.

[CV-99] A Wearable Gait Monitoring System for 17 Gait Parameters Based on Computer Vision

链接: https://arxiv.org/abs/2411.10739
作者: Jiangang Chen,Yung-Hong Sun,Kristen Pickett,Barbara King,Yu Hen Hu,Hongrui Jiang
关键词-EN: including gait length, Force Sensitive Resistor, gait parameters, shoe-mounted gait monitoring, gait monitoring system
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
*备注: 13 pages, 14 figures. This paper was submitted for publication to the IEEE Transactions on Instrumentation and Measurement

点击查看摘要

Abstract:We developed a shoe-mounted gait monitoring system capable of tracking up to 17 gait parameters, including gait length, step time, stride velocity, and others. The system employs a stereo camera mounted on one shoe to track a marker placed on the opposite shoe, enabling the estimation of spatial gait parameters. Additionally, a Force Sensitive Resistor (FSR) affixed to the heel of the shoe, combined with a custom-designed algorithm, is utilized to measure temporal gait parameters. Through testing on multiple participants and comparison with the gait mat, the proposed gait monitoring system exhibited notable performance, with the accuracy of all measured gait parameters exceeding 93.61%. The system also demonstrated a low drift of 4.89% during long-distance walking. A gait identification task conducted on participants using a trained Transformer model achieved 95.7% accuracy on the dataset collected by the proposed system, demonstrating that our hardware has the potential to collect long-sequence gait data suitable for integration with current Large Language Models (LLMs). The system is cost-effective, user-friendly, and well-suited for real-life measurements.

[CV-100] EVT: Efficient View Transformation for Multi-Modal 3D Object Detection

链接: https://arxiv.org/abs/2411.10715
作者: Yongjin Lee,Hyeon-Mun Jeong,Yurim Jeon,Sanghyun Kim
关键词-EN: Multi-modal sensor fusion, BEV, sensor fusion, leading approach, EVT
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modal sensor fusion in bird’s-eye-view (BEV) representation has become the leading approach in 3D object detection. However, existing methods often rely on depth estimators or transformer encoders for view transformation, incurring substantial computational overhead. Furthermore, the lack of precise geometric correspondence between 2D and 3D spaces leads to spatial and ray-directional misalignments, restricting the effectiveness of BEV representations. To address these challenges, we propose a novel 3D object detector via efficient view transformation (EVT), which leverages a well-structured BEV representation to enhance accuracy and efficiency. EVT focuses on two main areas. First, it employs Adaptive Sampling and Adaptive Projection (ASAP), using LiDAR guidance to generate 3D sampling points and adaptive kernels. The generated points and kernels are then used to facilitate the transformation of image features into BEV space and refine the BEV features. Second, EVT includes an improved transformer-based detection framework, which contains a group-wise query initialization method and an enhanced query update framework. It is designed to effectively utilize the obtained multi-modal BEV features within the transformer decoder. By leveraging the geometric properties of object queries, this framework significantly enhances detection performance, especially in a multi-layer transformer decoder structure. EVT achieves state-of-the-art performance on the nuScenes test set with real-time inference speed.

[CV-101] Diagnostic Text-guided Representation Learning in Hierarchical Classification for Pathological Whole Slide Image

链接: https://arxiv.org/abs/2411.10709
作者: Jiawen Li,Qiehe Sun,Renao Yan,Yizhi Wang,Yuqiu Fu,Yani Wei,Tian Guan,Huijuan Shi,Yonghonghe He,Anjia Han
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 13 figures. Under Review

点击查看摘要

[CV-102] AllRestorer: All-in-One Transformer for Image Restoration under Composite Degradations

链接: https://arxiv.org/abs/2411.10708
作者: Jiawei Mao,Yu Yang,Xuesong Yin,Ling Shao,Hao Tang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 11 figures

点击查看摘要

[CV-103] Poster: Reliable 3D Reconstruction for Ad-hoc Edge Implementations

链接: https://arxiv.org/abs/2411.10705
作者: Md Nurul Absur,Swastik Brahma,Saptarshi Debroy
关键词-EN: Ad-hoc edge deployments, support real-time complex, real-time complex video, complex video processing, video processing applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 3 Pages, 2 figures, IEEE SEC 2024

点击查看摘要

Abstract:Ad-hoc edge deployments to support real-time complex video processing applications such as, multi-view 3D reconstruction often suffer from spatio-temporal system disruptions that greatly impact reconstruction quality. In this poster paper, we present a novel portfolio theory-inspired edge resource management strategy to ensure reliable multi-view 3D reconstruction by accounting for possible system disruptions.

[CV-104] Diffusion-based Layer-wise Semantic Reconstruction for Unsupervised Out-of-Distribution Detection

链接: https://arxiv.org/abs/2411.10701
作者: Ying Yang,De Cheng,Chaowei Fang,Yubiao Wang,Changzhe Jiao,Lechao Cheng,Nannan Wang
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 26 pages, 23 figures, published to Neurlps2024

点击查看摘要

[CV-105] Multi-perspective Contrastive Logit Distillation

链接: https://arxiv.org/abs/2411.10693
作者: Qi Wang,Jinjia Zhou
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures, 11 tabels, 9 formulas, including pseudo-code

点击查看摘要

[CV-106] MaskMedPaint: Masked Medical Image Inpainting with Diffusion Models for Mitigation of Spurious Correlations ML4H ALT

链接: https://arxiv.org/abs/2411.10686
作者: Qixuan Jin,Walter Gerych,Marzyeh Ghassemi
关键词-EN: lead image classifiers, class labels, labels can lead, classifiers to rely, rely on shortcuts
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 12 pages

点击查看摘要

Abstract:Spurious features associated with class labels can lead image classifiers to rely on shortcuts that don’t generalize well to new domains. This is especially problematic in medical settings, where biased models fail when applied to different hospitals or systems. In such cases, data-driven methods to reduce spurious correlations are preferred, as clinicians can directly validate the modified images. While Denoising Diffusion Probabilistic Models (Diffusion Models) show promise for natural images, they are impractical for medical use due to the difficulty of describing spurious medical features. To address this, we propose Masked Medical Image Inpainting (MaskMedPaint), which uses text-to-image diffusion models to augment training images by inpainting areas outside key classification regions to match the target domain. We demonstrate that MaskMedPaint enhances generalization to target domains across both natural (Waterbirds, iWildCam) and medical (ISIC 2018, Chest X-ray) datasets, given limited unlabeled target images.

[CV-107] From Prototypes to General Distributions: An Efficient Curriculum for Masked Image Modeling

链接: https://arxiv.org/abs/2411.10685
作者: Jinhong Lin,Cheng-En Wu,Huanran Li,Jifan Zhang,Yu Hen Hu,Pedro Morgado
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-108] Underwater Image Enhancement with Cascaded Contrastive Learning

链接: https://arxiv.org/abs/2411.10682
作者: Yi Liu,Qiuping Jiang,Xinyi Wang,Ting Luo,Jingchun Zhou
关键词-EN: highly challenging task, challenging task due, Underwater image enhancement, UIE methods, color correction stage
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE Transacitons on MultiMedia

点击查看摘要

Abstract:Underwater image enhancement (UIE) is a highly challenging task due to the complexity of underwater environment and the diversity of underwater image degradation. Due to the application of deep learning, current UIE methods have made significant progress. Most of the existing deep learning-based UIE methods follow a single-stage network which cannot effectively address the diverse degradations simultaneously. In this paper, we propose to address this issue by designing a two-stage deep learning framework and taking advantage of cascaded contrastive learning to guide the network training of each stage. The proposed method is called CCL-Net in short. Specifically, the proposed CCL-Net involves two cascaded stages, i.e., a color correction stage tailored to the color deviation issue and a haze removal stage tailored to improve the visibility and contrast of underwater images. To guarantee the underwater image can be progressively enhanced, we also apply contrastive loss as an additional constraint to guide the training of each stage. In the first stage, the raw underwater images are used as negative samples for building the first contrastive loss, ensuring the enhanced results of the first color correction stage are better than the original inputs. While in the second stage, the enhanced results rather than the raw underwater images of the first color correction stage are used as the negative samples for building the second contrastive loss, thus ensuring the final enhanced results of the second haze removal stage are better than the intermediate color corrected results. Extensive experiments on multiple benchmark datasets demonstrate that our CCL-Net can achieve superior performance compared to many state-of-the-art methods. The source code of CCL-Net will be released at this https URL.

[CV-109] SPDFusion: An Infrared and Visible Image Fusion Network Based on a Non-Euclidean Representation of Riemannian Manifolds

链接: https://arxiv.org/abs/2411.10679
作者: Huan Kang,Hui Li,Tianyang Xu,Rui Wang,Xiao-Jun Wu,Josef Kittler
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 12 figures

点击查看摘要

[CV-110] Awaker2.5-VL: Stably Scaling MLLM s with Parameter-Efficient Mixture of Experts

链接: https://arxiv.org/abs/2411.10669
作者: Jinqiang Long,Yanqi Dai,Guoxing Yang,Hongpeng Lin,Nanyi Fei,Yizhao Gao,Zhiwu Lu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-111] Segmentation of Ink and Parchment in Dead Sea Scroll Fragments ICDAR

链接: https://arxiv.org/abs/2411.10668
作者: Berat Kurar-Barakat,Nachum Dershowitz
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
*备注: 17 pages, ICDAR-IJDAR-2025

点击查看摘要

[CV-112] Enhancing PTSD Outcome Prediction with Ensemble Models in Disaster Contexts

链接: https://arxiv.org/abs/2411.10661
作者: Ayesha Siddiqua,Atib Mohammad Oni,Abu Saleh Musa Miah,Jungpil Shin
关键词-EN: Post-traumatic stress disorder, Post-traumatic stress, affects individuals exposed, stress disorder, traumatic events
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Post-traumatic stress disorder (PTSD) is a significant mental health challenge that affects individuals exposed to traumatic events. Early detection and effective intervention for PTSD are crucial, as it can lead to long-term psychological distress if untreated. Accurate detection of PTSD is essential for timely and targeted mental health interventions, especially in disaster-affected populations. Existing research has explored machine learning approaches for classifying PTSD, but many face limitations in terms of model performance and generalizability. To address these issues, we implemented a comprehensive preprocessing pipeline. This included data cleaning, missing value treatment using the SimpleImputer, label encoding of categorical variables, data augmentation using SMOTE to balance the dataset, and feature scaling with StandardScaler. The dataset was split into 80% training and 20% testing. We developed an ensemble model using a majority voting technique among several classifiers, including Logistic Regression, Support Vector Machines (SVM), Random Forest, XGBoost, LightGBM, and a customized Artificial Neural Network (ANN). The ensemble model achieved an accuracy of 96.76% with a benchmark dataset, significantly outperforming individual models. The proposed method’s advantages include improved robustness through the combination of multiple models, enhanced ability to generalize across diverse data points, and increased accuracy in detecting PTSD. Additionally, the use of SMOTE for data augmentation ensured better handling of imbalanced datasets, leading to more reliable predictions. The proposed approach offers valuable insights for policymakers and healthcare providers by leveraging predictive analytics to address mental health issues in vulnerable populations, particularly those affected by disasters.

[CV-113] Deep Loss Convexification for Learning Iterative Models

链接: https://arxiv.org/abs/2411.10649
作者: Ziming Zhang,Yuping Shao,Yiqing Zhang,Fangzhou Lin,Haichong Zhang,Elke Rundensteiner
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 10 figures, accepted paper to Transactions on Pattern Analysis and Machine Intelligence. arXiv admin note: text overlap with arXiv:2303.11526

点击查看摘要

[CV-114] Voxel-Aggergated Feature Synthesis: Efficient Dense Mapping for Simulated 3D Reasoning CVPR2025

链接: https://arxiv.org/abs/2411.10616
作者: Owen Burns,Rizwan Qureshi
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 2 figures, CVPR 2025

点击查看摘要

[CV-115] FedAli: Personalized Federated Learning with Aligned Prototypes through Optimal Transport

链接: https://arxiv.org/abs/2411.10595
作者: Sannara Ek,Kaile Wang,François Portet,Philippe Lalanda,Jiannong Cao
关键词-EN:
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Pre-print version 1

点击查看摘要

[CV-116] Creation and Evaluation of a Food Product Image Dataset for Product Property Extraction

链接: https://arxiv.org/abs/2411.10591
作者: Christoph Brosch,Alexander Bouwens,Sebastian Bast,Swen Haab,Rolf Krieger
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at 12th International Conference on Data Science, Technology and Applications (DATA 2023)

点击查看摘要

[CV-117] Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera

链接: https://arxiv.org/abs/2411.10582
作者: Jaewoo Heo,Kuan-Chieh Wang,Karen Liu,Serena Yeung-Levy
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 2 figures, submitted to TMLR

点击查看摘要

[CV-118] he Oxford Spires Dataset: Benchmarking Large-Scale LiDAR-Visual Localisation Reconstruction and Radiance Field Methods

链接: https://arxiv.org/abs/2411.10546
作者: Yifu Tao,Miguel Ángel Muñoz-Bañón,Lintong Zhang,Jiahao Wang,Lanke Frank Tarimo Fu,Maurice Fallon
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Website: this https URL

点击查看摘要

[CV-119] Advancing Autonomous Driving Perception: Analysis of Sensor Fusion and Computer Vision Techniques

链接: https://arxiv.org/abs/2411.10535
作者: Urvishkumar Bharti,Vikram Shahapur
关键词-EN:
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages

点击查看摘要

[CV-120] Any2Any: Incomplete Multimodal Retrieval with Conformal Prediction

链接: https://arxiv.org/abs/2411.10513
作者: Po-han Li,Yunhao Yang,Mohammad Omama,Sandeep Chinchali,Ufuk Topcu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:

点击查看摘要

[CV-121] ESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding

链接: https://arxiv.org/abs/2411.10509
作者: Quang P. M. Pham,Khoi T. N. Nguyen,Lan C. Ngo,Dezhen Song,Truong Do,Truong Son Hy
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2407.00609

点击查看摘要

[CV-122] DR-BFR: Degradation Representation with Diffusion Models for Blind Face Restoration

链接: https://arxiv.org/abs/2411.10508
作者: Xinmin Qiu,Bonan Li,Zicheng Zhang,Congying Han,Tiande Guo
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-123] RedTest: Towards Measuring Redundancy in Deep Neural Networks Effectively

链接: https://arxiv.org/abs/2411.10507
作者: Yao Lu,Peixin Zhang,Jingyi Wang,Lei Ma,Xiaoniu Yang,Qi Xuan
关键词-EN:
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-124] OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models

链接: https://arxiv.org/abs/2411.10501
作者: Mathis Koroglu,Hugo Caselles-Dupré,Guillaume Jeanneret Sanmiguel,Matthieu Cord
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-125] FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on

链接: https://arxiv.org/abs/2411.10499
作者: Boyuan Jiang,Xiaobin Hu,Donghao Luo,Qingdong He,Chengming Xu,Jinlong Peng,Jiangning Zhang,Chengjie Wang,Yunsheng Wu,Yanwei Fu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project link: this https URL

点击查看摘要

[CV-126] Prompt-Guided Environmentally Consistent Adversarial Patch

链接: https://arxiv.org/abs/2411.10498
作者: Chaoqun Li,Huanqian Yan,Lifeng Zhou,Tairan Chen,Zhuodong Liu,Hang Su
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-127] Structure Tensor Representation for Robust Oriented Object Detection

链接: https://arxiv.org/abs/2411.10497
作者: Xavier Bou,Gabriele Facciolo,Rafael Grompone von Gioi,Jean-Michel Morel,Thibaud Ehret
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-128] Boundary Attention Constrained Zero-Shot Layout-To-Image Generation

链接: https://arxiv.org/abs/2411.10495
作者: Huancheng Chen,Jingtao Li,Weiming Zhuang,Haris Vikalo,Lingjuan Lyu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-129] MFP3D: Monocular Food Portion Estimation Leveraging 3D Point Clouds ICPR2024

链接: https://arxiv.org/abs/2411.10492
作者: Jinge Ma,Xiaoyan Zhang,Gautham Vinod,Siddeshwar Raghavan,Jiangpeng He,Fengqing Zhu
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 9th International Workshop on Multimedia Assisted Dietary Management, in conjunction with the 27th International Conference on Pattern Recognition (ICPR2024)

点击查看摘要

[CV-130] One Prompt to Verify Your Models: Black-Box Text-to-Image Models Verification via Non-Transferable Adversarial Attacks

链接: https://arxiv.org/abs/2410.22725
作者: Ji Guo,Wenbo Jiang,Rui Zhang,Guoming Lu,Hongwei Li
关键词-EN:
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[CV-131] Equivariant spatio-hemispherical networks for diffusion MRI deconvolution FAST NEURIPS2024

链接: https://arxiv.org/abs/2411.11819
作者: Axel Elaldi,Guido Gerig,Neel Dey
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024. 24 pages with 13 figures. Code available at this https URL

点击查看摘要

[CV-132] Lung Disease Detection with Vision Transformers: A Comparative Study of Machine Learning Methods

链接: https://arxiv.org/abs/2411.11376
作者: Baljinnyam Dayan
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-133] DeepSPV: An Interpretable Deep Learning Pipeline for 3D Spleen Volume Estimation from 2D Ultrasound Images

链接: https://arxiv.org/abs/2411.11190
作者: Zhen Yuan,David Stojanovski,Lei Li,Alberto Gomez,Haran Jogeesvaran,Esther Puyol-Antón,Baba Inusa,Andrew P. King
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2308.08038

点击查看摘要

[CV-134] Freqformer: Frequency-Domain Transformer for 3-D Visualization and Quantification of Human Retinal Circulation

链接: https://arxiv.org/abs/2411.11189
作者: Lingyun Wang,Bingjie Wang,Jay Chhablani,Jose Alain Sahel,Shaohua Pi
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-135] DBF-Net: A Dual-Branch Network with Feature Fusion for Ultrasound Image Segmentation

链接: https://arxiv.org/abs/2411.11116
作者: Guoping Xu,Ximing Wu,Wentao Liao,Xinglong Wu,Qing Huang,Chang Li
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-136] Retinal Vessel Segmentation via Neuron Programming

链接: https://arxiv.org/abs/2411.11110
作者: Tingting Wu,Ruyi Min,Peixuan Song,Hengtao Guo,Tieyong Zeng,Feng-Lei Fan
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-137] Neighboring Slice Noise2Noise: Self-Supervised Medical Image Denoising from Single Noisy Image Volume

链接: https://arxiv.org/abs/2411.10831
作者: Langrui Zhou,Ziteng Zhou,Xinyu Huang,Xiangyu Zhang,Huiru Wang,Guang Li
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-138] Unveiling Hidden Details: A RAW Data-Enhanced Paradigm for Real-World Super-Resolution

链接: https://arxiv.org/abs/2411.10798
作者: Long Peng,Wenbo Li,Jiaming Guo,Xin Di,Haoze Sun,Yong Li,Renjing Pei,Yang Wang,Yang Cao,Zheng-Jun Zha
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-139] Beyond Feature Mapping GAP: Integrating Real HDRTV Priors for Superior SDRTV-to-HDRTV Conversion

链接: https://arxiv.org/abs/2411.10775
作者: Kepeng Xu,Li Xu,Gang He,Zhiqiang Zhang,Wenxin Yu,Shihao Wang,Dajiang Zhou,Yunsong Li
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 8 pages,4 figures

点击查看摘要

[CV-140] An End-to-End Real-World Camera Imaging Pipeline

链接: https://arxiv.org/abs/2411.10773
作者: Kepeng Xu,Zijia Ma,Li Xu,Gang He,Yunsong Li,Wenxin Yu,Taichu Han,Cheng Yang
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: accept by ACMMM 2024

点击查看摘要

[CV-141] Diffusion-Based Semantic Segmentation of Lumbar Spine MRI Scans of Lower Back Pain Patients ML4H ALT

链接: https://arxiv.org/abs/2411.10755
作者: Maria Monzon,Thomas Iff,Ender Konukoglu,Catherine R. Jutzeler
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 5 pages

点击查看摘要

[CV-142] owards a Comprehensive Benchmark for Pathological Lymph Node Metastasis in Breast Cancer Sections

链接: https://arxiv.org/abs/2411.10752
作者: Xitong Ling,Yuanyuan Lei,Jiawen Li,Junru Cheng,Wenting Huang,Tian Guan,Jian Guan,Yonghong He
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-143] HIST-AID: Leveraging Historical Patient Reports for Enhanced Multi-Modal Automatic Diagnosis ALT

链接: https://arxiv.org/abs/2411.10684
作者: Haoxu Huang,Cem M. Deniz,Kyunghyun Cho,Sumit Chopra,Divyam Madaan
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: In Proceedings of Machine Learning for Health

点击查看摘要

[CV-144] Normative Modeling for AD Diagnosis and Biomarker Identification

链接: https://arxiv.org/abs/2411.10570
作者: Songlin Zhao,Rong Zhou,Yu Zhang,Yong Chen,Lifang He
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures

点击查看摘要

[CV-145] Efficient Denoising Method to Improve The Resolution of Satellite Images

链接: https://arxiv.org/abs/2411.10476
作者: Jhanavi Hegde
关键词-EN:
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[CV-146] Relationships between the degrees of freedom in the affine Gaussian derivative model for visual receptive fields and 2-D affine image transformations with application to covariance properties of simple cells in the primary visual cortex

链接: https://arxiv.org/abs/2411.05673
作者: Tony Lindeberg
关键词-EN:
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 9 figures

点击查看摘要

机器学习

[LG-0] Competing Bandits in Decentralized Large Contextual Matching Markets

链接: https://arxiv.org/abs/2411.11794
作者: Satush Parikh,Soumya Basu,Avishek Ghosh,Abishek Sankararaman
关键词-EN: multi-agent resource constrained, received significant interest, resource constrained matching, constrained matching market, Sequential learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sequential learning in a multi-agent resource constrained matching market has received significant interest in the past few years. We study decentralized learning in two-sided matching markets where the demand side (aka players or agents) competes for a `large’ supply side (aka arms) with potentially time-varying preferences, to obtain a stable match. Despite a long line of work in the recent past, existing learning algorithms such as Explore-Then-Commit or Upper-Confidence-Bound remain inefficient for this problem. In particular, the per-agent regret achieved by these algorithms scales linearly with the number of arms, K . Motivated by the linear contextual bandit framework, we assume that for each agent an arm-mean can be represented by a linear function of a known feature vector and an unknown (agent-specific) parameter. Moreover, our setup captures the essence of a dynamic (non-stationary) matching market where the preferences over arms change over time. Our proposed algorithms achieve instance-dependent logarithmic regret, scaling independently of the number of arms, K . Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2411.11794 [cs.LG] (or arXiv:2411.11794v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.11794 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] A Potential Game Perspective in Federated Learning

链接: https://arxiv.org/abs/2411.11793
作者: Kang Liu,Ziqi Wang,Enrique Zuazua
关键词-EN: machine learning models, Federated learning, training machine learning, machine learning, learning models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is an emerging paradigm for training machine learning models across distributed clients. Traditionally, in FL settings, a central server assigns training efforts (or strategies) to clients. However, from a market-oriented perspective, clients may independently choose their training efforts based on rational self-interest. To explore this, we propose a potential game framework where each client’s payoff is determined by their individual efforts and the rewards provided by the server. The rewards are influenced by the collective efforts of all clients and can be modulated through a reward factor. Our study begins by establishing the existence of Nash equilibria (NEs), followed by an investigation of uniqueness in homogeneous settings. We demonstrate a significant improvement in clients’ training efforts at a critical reward factor, identifying it as the optimal choice for the server. Furthermore, we prove the convergence of the best-response algorithm to compute NEs for our FL game. Finally, we apply the training efforts derived from specific NEs to a real-world FL scenario, validating the effectiveness of the identified optimal reward factor.

[LG-2] LLM -IE: A Python Package for Generative Information Extraction with Large Language Models

链接: https://arxiv.org/abs/2411.11779
作者: Enshuo Hsu,Kirk Roberts
关键词-EN: Toggle, information extraction, Code, Papers, Python package
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objectives: Despite the recent adoption of large language models (LLMs) for biomedical information extraction, challenges in prompt engineering and algorithms persist, with no dedicated software available. To address this, we developed LLM-IE: a Python package for building complete information extraction pipelines. Our key innovation is an interactive LLM agent to support schema definition and prompt design. Materials and Methods: The LLM-IE supports named entity recognition, entity attribute extraction, and relation extraction tasks. We benchmarked on the i2b2 datasets and conducted a system evaluation. Results: The sentence-based prompting algorithm resulted in the best performance while requiring a longer inference time. System evaluation provided intuitive visualization. Discussion: LLM-IE was designed from practical NLP experience in healthcare and has been adopted in internal projects. It should hold great value to the biomedical NLP community. Conclusion: We developed a Python package, LLM-IE, that provides building blocks for robust information extraction pipeline construction. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.11779 [cs.LG] (or arXiv:2411.11779v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.11779 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Enshuo Hsu [view email] [v1] Mon, 18 Nov 2024 17:56:13 UTC (782 KB) Full-text links: Access Paper: View a PDF of the paper titled LLM-IE: A Python Package for Generative Information Extraction with Large Language Models, by Enshuo Hsu and 1 other authorsView PDFOther Formats view license Current browse context: cs.LG prev | next new | recent | 2024-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-3] Freezing of Gait Detection Using Gramian Angular Fields and Federated Learning from Wearable Sensors

链接: https://arxiv.org/abs/2411.11764
作者: Shovito Barua Soumma,S M Raihanul Alam,Rudmila Rahman,Umme Niraj Mahi,Sayyed Mostafa Mostafavi,Hassan Ghasemzadeh
关键词-EN: Parkinson disease, mobility and safety, impairs mobility, Gramian Angular Field, Freezing of gait
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Freezing of gait (FOG) is a debilitating symptom of Parkinson’s disease (PD) that impairs mobility and safety. Traditional detection methods face challenges due to intra and inter-patient variability, and most systems are tested in controlled settings, limiting their real-world applicability. Addressing these gaps, we present FOGSense, a novel FOG detection system designed for uncontrolled, free-living conditions. It uses Gramian Angular Field (GAF) transformations and federated deep learning to capture temporal and spatial gait patterns missed by traditional methods. We evaluated our FOGSense system using a public PD dataset, ‘tdcsfog’. FOGSense improves accuracy by 10.4% over a single-axis accelerometer, reduces failure points compared to multi-sensor systems, and demonstrates robustness to missing values. The federated architecture allows personalized model adaptation and efficient smartphone synchronization during off-peak hours, making it effective for long-term monitoring as symptoms evolve. Overall, FOGSense achieves a 22.2% improvement in F1-score compared to state-of-the-art methods, along with enhanced sensitivity for FOG episode detection. Code is available: this https URL.

[LG-4] Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework

链接: https://arxiv.org/abs/2411.11761
作者: Yannick Metz,David Lindner,Raphaël Baur,Mennatallah El-Assady
关键词-EN: Human feedback, train agentic machine, feedback, Human, Learning
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human feedback (RLHF) has become a powerful tool to fine-tune or train agentic machine learning models. Similar to how humans interact in social contexts, we can use many types of feedback to communicate our preferences, intentions, and knowledge to an RL agent. However, applications of human feedback in RL are often limited in scope and disregard human factors. In this work, we bridge the gap between machine learning and human-computer interaction efforts by developing a shared understanding of human feedback in interactive learning scenarios. We first introduce a taxonomy of feedback types for reward-based learning from human feedback based on nine key dimensions. Our taxonomy allows for unifying human-centered, interface-centered, and model-centered aspects. In addition, we identify seven quality metrics of human feedback influencing both the human ability to express feedback and the agent’s ability to learn from the feedback. Based on the feedback taxonomy and quality criteria, we derive requirements and design choices for systems learning from human feedback. We relate these requirements and design choices to existing work in interactive machine learning. In the process, we identify gaps in existing work and future research opportunities. We call for interdisciplinary collaboration to harness the full potential of reinforcement learning with data-driven co-adaptive modeling and varied interaction mechanics.

[LG-5] BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration HPCA2025

链接: https://arxiv.org/abs/2411.11745
作者: Yuzong Chen,Ahmed F. AbouElhamayed,Xilai Dai,Yang Wang,Marta Andronic,George A. Constantinides,Mohamed S. Abdelfattah
关键词-EN: Large language models, Large language, quantize LLM weights, machine learning tasks, LLM
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: HPCA 2025

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks. Yet the substantial memory footprint of LLMs significantly hinders their deployment. In this paper, we improve the accessibility of LLMs through BitMoD, an algorithm-hardware co-design solution that enables efficient LLM acceleration at low weight precision. On the algorithm side, BitMoD introduces fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights. Through the careful design of these new data types, BitMoD is able to quantize LLM weights to very low precision (e.g., 4 bits and 3 bits) while maintaining high accuracy. On the hardware side, BitMoD employs a bit-serial processing element to easily support multiple numerical precisions and data types; our hardware design includes two key innovations: First, it employs a unified representation to process different weight data types, thus reducing the hardware cost. Second, it adopts a bit-serial dequantization unit to rescale the per-group partial sum with minimal hardware overhead. Our evaluation on six representative LLMs demonstrates that BitMoD significantly outperforms state-of-the-art LLM quantization and acceleration methods. For discriminative tasks, BitMoD can quantize LLM weights to 4-bit with !0.5% accuracy loss on average. For generative tasks, BitMoD is able to quantize LLM weights to 3-bit while achieving better perplexity than prior LLM quantization scheme. Combining the superior model performance with an efficient accelerator design, BitMoD achieves an average of 1.69\times and 1.48\times speedups compared to prior LLM accelerators ANT and OliVe, respectively.

[LG-6] FLMarket: Enabling Privacy-preserved Pre-training Data Pricing for Federated Learning

链接: https://arxiv.org/abs/2411.11713
作者: Zhenyu Wen,Wanglei Feng,Di Wu,Haozhen Hu,Chang Xu,Bin Qian,Zhen Hong,Cong Wang,Shouling Ji
关键词-EN: offers promising solutions, mainstream privacy-preserving machine, machine learning paradigm, privacy-preserving machine learning, Federated Learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL), as a mainstream privacy-preserving machine learning paradigm, offers promising solutions for privacy-critical domains such as healthcare and finance. Although extensive efforts have been dedicated from both academia and industry to improve the vanilla FL, little work focuses on the data pricing mechanism. In contrast to the straightforward in/post-training pricing techniques, we study a more difficult problem of pre-training pricing without direct information from the learning process. We propose FLMarket that integrates a two-stage, auction-based pricing mechanism with a security protocol to address the utility-privacy conflict. Through comprehensive experiments, we show that the client selection according to FLMarket can achieve more than 10% higher accuracy in subsequent FL training compared to state-of-the-art methods. In addition, it outperforms the in-training baseline with more than 2% accuracy increase and 3x run-time speedup.

[LG-7] Robust Reinforcement Learning under Diffusion Models for Data with Jumps

链接: https://arxiv.org/abs/2411.11697
作者: Chenyang Jiang,Donggyu Kim,Alejandra Quintos,Yazhen Wang
关键词-EN: Reinforcement Learning, stochastic differential equations, solving complex decision-making, complex decision-making tasks, Bipower Variation Error
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has proven effective in solving complex decision-making tasks across various domains, but challenges remain in continuous-time settings, particularly when state dynamics are governed by stochastic differential equations (SDEs) with jump components. In this paper, we address this challenge by introducing the Mean-Square Bipower Variation Error (MSBVE) algorithm, which enhances robustness and convergence in scenarios involving significant stochastic noise and jumps. We first revisit the Mean-Square TD Error (MSTDE) algorithm, commonly used in continuous-time RL, and highlight its limitations in handling jumps in state dynamics. The proposed MSBVE algorithm minimizes the mean-square quadratic variation error, offering improved performance over MSTDE in environments characterized by SDEs with jumps. Simulations and formal proofs demonstrate that the MSBVE algorithm reliably estimates the value function in complex settings, surpassing MSTDE’s performance when faced with jump processes. These findings underscore the importance of alternative error metrics to improve the resilience and effectiveness of RL algorithms in continuous-time frameworks.

[LG-8] Few-shot Model Extraction Attacks against Sequential Recommender Systems

链接: https://arxiv.org/abs/2411.11677
作者: Hui Zhang,Fu Liu
关键词-EN: few-shot model extraction, model extraction framework, model extraction, extraction attacks represent, model extraction attacks
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Among adversarial attacks against sequential recommender systems, model extraction attacks represent a method to attack sequential recommendation models without prior knowledge. Existing research has primarily concentrated on the adversary’s execution of black-box attacks through data-free model extraction. However, a significant gap remains in the literature concerning the development of surrogate models by adversaries with access to few-shot raw data (10% even less). That is, the challenge of how to construct a surrogate model with high functional similarity within the context of few-shot data scenarios remains an issue that requires this http URL study addresses this gap by introducing a novel few-shot model extraction framework against sequential recommenders, which is designed to construct a superior surrogate model with the utilization of few-shot data. The proposed few-shot model extraction framework is comprised of two components: an autoregressive augmentation generation strategy and a bidirectional repair loss-facilitated model distillation procedure. Specifically, to generate synthetic data that closely approximate the distribution of raw data, autoregressive augmentation generation strategy integrates a probabilistic interaction sampler to extract inherent dependencies and a synthesis determinant signal module to characterize user behavioral patterns. Subsequently, bidirectional repair loss, which target the discrepancies between the recommendation lists, is designed as auxiliary loss to rectify erroneous predictions from surrogate models, transferring knowledge from the victim model to the surrogate model effectively. Experiments on three datasets show that the proposed few-shot model extraction framework yields superior surrogate models.

[LG-9] Efficient and Robust Continual Graph Learning for Graph Classification in Biology

链接: https://arxiv.org/abs/2411.11668
作者: Ding Zhang,Jane Downer,Can Chen,Ren Wang
关键词-EN: complex biological systems, understanding complex biological, Continual Graph Learning, essential for understanding, understanding complex
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Graph classification is essential for understanding complex biological systems, where molecular structures and interactions are naturally represented as graphs. Traditional graph neural networks (GNNs) perform well on static tasks but struggle in dynamic settings due to catastrophic forgetting. We present Perturbed and Sparsified Continual Graph Learning (PSCGL), a robust and efficient continual graph learning framework for graph data classification, specifically targeting biological datasets. We introduce a perturbed sampling strategy to identify critical data points that contribute to model learning and a motif-based graph sparsification technique to reduce storage needs while maintaining performance. Additionally, our PSCGL framework inherently defends against graph backdoor attacks, which is crucial for applications in sensitive biological contexts. Extensive experiments on biological datasets demonstrate that PSCGL not only retains knowledge across tasks but also enhances the efficiency and robustness of graph classification models in biology.

[LG-10] Feature Selection for Network Intrusion Detection

链接: https://arxiv.org/abs/2411.11603
作者: Charles Westphal,Stephen Hailes,Mirco Musolesi
关键词-EN: Machine Learning, information security community, Network Intrusion Detection, relevant to Machine, Intrusion Detection
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Network Intrusion Detection (NID) remains a key area of research within the information security community, while also being relevant to Machine Learning (ML) practitioners. The latter generally aim to detect attacks using network features, which have been extracted from raw network data typically using dimensionality reduction methods, such as principal component analysis (PCA). However, PCA is not able to assess the relevance of features for the task at hand. Consequently, the features available are of varying quality, with some being entirely non-informative. From this, two major drawbacks arise. Firstly, trained and deployed models have to process large amounts of unnecessary data, therefore draining potentially costly resources. Secondly, the noise caused by the presence of irrelevant features can, in some cases, impede a model’s ability to detect an attack. In order to deal with these challenges, we present Feature Selection for Network Intrusion Detection (FSNID) a novel information-theoretic method that facilitates the exclusion of non-informative features when detecting network intrusions. The proposed method is based on function approximation using a neural network, which enables a version of our approach that incorporates a recurrent layer. Consequently, this version uniquely enables the integration of temporal dependencies. Through an extensive set of experiments, we demonstrate that the proposed method selects a significantly reduced feature set, while maintaining NID performance. Code will be made available upon publication.

[LG-11] Generative Spatio-temporal GraphNet for Transonic Wing Pressure Distribution Forecasting

链接: https://arxiv.org/abs/2411.11592
作者: Gabriele Immordino,Andrea Vaiuso,Andrea Da Ronch,Marcello Righi
关键词-EN: predicting unsteady transonic, transonic wing pressure, wing pressure distributions, study presents, graph-based temporal layers
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:This study presents a framework for predicting unsteady transonic wing pressure distributions, integrating an autoencoder architecture with graph convolutional networks and graph-based temporal layers to model time dependencies. The framework compresses high-dimensional pressure distribution data into a lower-dimensional latent space using an autoencoder, ensuring efficient data representation while preserving essential features. Within this latent space, graph-based temporal layers are employed to predict future wing pressures based on past data, effectively capturing temporal dependencies and improving predictive accuracy. This combined approach leverages the strengths of autoencoders for dimensionality reduction, graph convolutional networks for handling unstructured grid data, and temporal layers for modeling time-based sequences. The effectiveness of the proposed framework is validated through its application to the Benchmark Super Critical Wing test case, achieving accuracy comparable to computational fluid dynamics, while significantly reducing prediction time. This framework offers a scalable, computationally efficient solution for the aerodynamic analysis of unsteady phenomena.

[LG-12] GNN-Based Code Annotation Logic for Establishing Security Boundaries in C Code

链接: https://arxiv.org/abs/2411.11567
作者: Varun Gadey,Raphael Goetz,Christoph Sendner,Sampo Sovio,Alexandra Dmitrienko
关键词-EN: Securing sensitive operations, Trusted Execution Environments, Trusted Computing Base, today interconnected software, interconnected software landscape
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Submitted to the IEEE Symposium on Security and Privacy 2025

点击查看摘要

Abstract:Securing sensitive operations in today’s interconnected software landscape is crucial yet challenging. Modern platforms rely on Trusted Execution Environments (TEEs), such as Intel SGX and ARM TrustZone, to isolate security sensitive code from the main system, reducing the Trusted Computing Base (TCB) and providing stronger assurances. However, identifying which code should reside in TEEs is complex and requires specialized expertise, which is not supported by current automated tools. Existing solutions often migrate entire applications to TEEs, leading to suboptimal use and an increased TCB. To address this gap, we propose Code Annotation Logic (CAL), a pioneering tool that automatically identifies security sensitive components for TEE isolation. CAL analyzes codebases, leveraging a graph-based approach with novel feature construction and employing a custom graph neural network model to accurately determine which parts of the code should be isolated. CAL effectively optimizes TCB, reducing the burden of manual analysis and enhancing overall security. Our contributions include the definition of security sensitive code, the construction and labeling of a comprehensive dataset of source files, a feature rich graph based data preparation pipeline, and the CAL model for TEE integration. Evaluation results demonstrate CAL’s efficacy in identifying sensitive code with a recall of 86.05%, an F1 score of 81.56%, and an identification rate of 91.59% for security sensitive functions. By enabling efficient code isolation, CAL advances the secure development of applications using TEEs, offering a practical solution for developers to reduce attack vectors.

[LG-13] Hierarchical-Graph-Structured Edge Partition Models for Learning Evolving Community Structure

链接: https://arxiv.org/abs/2411.11536
作者: Xincan Yu,Sikun Yang
关键词-EN: capture evolving latent, latent communities, evolving latent communities, capture evolving, latent
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel dynamic network model to capture evolving latent communities within temporal networks. To achieve this, we decompose each observed dynamic edge between vertices using a Poisson-gamma edge partition model, assigning each vertex to one or more latent communities through \emphnonnegative vertex-community memberships. Specifically, hierarchical transition kernels are employed to model the interactions between these latent communities in the observed temporal network. A hierarchical graph prior is placed on the transition structure of the latent communities, allowing us to model how they evolve and interact over time. Consequently, our dynamic network enables the inferred community structure to merge, split, and interact with one another, providing a comprehensive understanding of complex network dynamics. Experiments on various real-world network datasets demonstrate that the proposed model not only effectively uncovers interpretable latent structures but also surpasses other state-of-the art dynamic network models in the tasks of link prediction and community detection.

[LG-14] SeqProFT: Applying LoRA Finetuning for Sequence-only Protein Property Predictions

链接: https://arxiv.org/abs/2411.11530
作者: Shuo Zhang,Jian K. Liu
关键词-EN: treating amino acid, amino acid sequences, self-supervised manner, capable of learning, learning the relationships
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Protein language models (PLMs) are capable of learning the relationships between protein sequences and functions by treating amino acid sequences as textual data in a self-supervised manner. However, fine-tuning these models typically demands substantial computational resources and time, with results that may not always be optimized for specific tasks. To overcome these challenges, this study employs the LoRA method to perform end-to-end fine-tuning of the ESM-2 model specifically for protein property prediction tasks, utilizing only sequence information. Additionally, a multi-head attention mechanism is integrated into the downstream network to combine sequence features with contact map information, thereby enhancing the model’s comprehension of protein sequences. Experimental results of extensive classification and regression tasks demonstrate that the fine-tuned model achieves strong performance and faster convergence across multiple regression and classification tasks.

[LG-15] Preempting Text Sanitization Utility in Resource-Constrained Privacy-Preserving LLM Interactions

链接: https://arxiv.org/abs/2411.11521
作者: Robin Carpentier,Benjamin Zi Hao Zhao,Hassan Jameel Asghar,Dali Kaafar
关键词-EN: online Large Language, Large Language Models, online Large, Large Language, Small Language Model
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Individuals have been increasingly interacting with online Large Language Models (LLMs), both in their work and personal lives. These interactions raise privacy issues as the LLMs are typically hosted by third-parties who can gather a variety of sensitive information about users and their companies. Text Sanitization techniques have been proposed in the literature and can be used to sanitize user prompts before sending them to the LLM. However, sanitization has an impact on the downstream task performed by the LLM, and often to such an extent that it leads to unacceptable results for the user. This is not just a minor annoyance, with clear monetary consequences as LLM services charge on a per use basis as well as great amount of computing resources wasted. We propose an architecture leveraging a Small Language Model (SLM) at the user-side to help estimate the impact of sanitization on a prompt before it is sent to the LLM, thus preventing resource losses. Our evaluation of this architecture revealed a significant problem with text sanitization based on Differential Privacy, on which we want to draw the attention of the community for further investigation. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2411.11521 [cs.CR] (or arXiv:2411.11521v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2411.11521 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] Efficient Sample-optimal Learning of Gaussian Tree Models via Sample-optimal Testing of Gaussian Mutual Information

链接: https://arxiv.org/abs/2411.11516
作者: Sutanu Gayen,Sanket Kale,Sayantan Sen
关键词-EN: varepsilon, machine learning, Learning high-dimensional distributions, Learning, mutual information
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 47 pages, 16 figures, abstract shortened as per arXiv criteria

点击查看摘要

Abstract:Learning high-dimensional distributions is a significant challenge in machine learning and statistics. Classical research has mostly concentrated on asymptotic analysis of such data under suitable assumptions. While existing works [Bhattacharyya et al.: SICOMP 2023, Daskalakis et al.: STOC 2021, Choo et al.: ALT 2024] focus on discrete distributions, the current work addresses the tree structure learning problem for Gaussian distributions, providing efficient algorithms with solid theoretical guarantees. This is crucial as real-world distributions are often continuous and differ from the discrete scenarios studied in prior works. In this work, we design a conditional mutual information tester for Gaussian random variables that can test whether two Gaussian random variables are independent, or their conditional mutual information is at least \varepsilon , for some parameter \varepsilon \in (0,1) using \mathcalO(\varepsilon^-1) samples which we show to be near-optimal. In contrast, an additive estimation would require \Omega(\varepsilon^-2) samples. Our upper bound technique uses linear regression on a pair of suitably transformed random variables. Importantly, we show that the chain rule of conditional mutual information continues to hold for the estimated (conditional) mutual information. As an application of such a mutual information tester, we give an efficient \varepsilon -approximate structure-learning algorithm for an n -variate Gaussian tree model that takes \widetilde\Theta(n\varepsilon^-1) samples which we again show to be near-optimal. In contrast, when the underlying Gaussian model is not known to be tree-structured, we show that \widetilde\Theta(n^2\varepsilon^-2) samples are necessary and sufficient to output an \varepsilon -approximate tree structure. We perform extensive experiments that corroborate our theoretical convergence bounds. Comments: 47 pages, 16 figures, abstract shortened as per arXiv criteria Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2411.11516 [cs.LG] (or arXiv:2411.11516v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.11516 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Physics Encoded Blocks in Residual Neural Network Architectures for Digital Twin Models

链接: https://arxiv.org/abs/2411.11497
作者: Muhammad Saad Zia,Ashiq Anjum,Lu Liu,Anthony Conway,Anasol Pena Rios
关键词-EN: Informed Machine Learning, Physics Informed Machine, digital twins, twins to generate, processes and behaviours
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Physics Informed Machine Learning has emerged as a popular approach in modelling and simulation for digital twins to generate accurate models of processes and behaviours of real-world systems. However, despite their success in generating accurate and reliable models, the existing methods either use simple regularizations in loss functions to offer limited physics integration or are too specific in architectural definitions to be generalized to a wide variety of physical systems. This paper presents a generic approach based on a novel physics-encoded residual neural network architecture to combine data-driven and physics-based analytical models to address these limitations. Our method combines physics blocks as mathematical operators from physics-based models with learning blocks comprising feed-forward layers. Intermediate residual blocks are incorporated for stable gradient flow as they train on physical system observation data. This way, the model learns to comply with the geometric and kinematic aspects of the physical system. Compared to conventional neural network-based methods, our method improves generalizability with substantially low data requirements and model complexity in terms of parameters, especially in scenarios where prior physics knowledge is either elementary or incomplete. We investigate our approach in two application domains. The first is a basic robotic motion model using Euler Lagrangian equations of motion as physics prior. The second application is a complex scenario of a steering model for a self-driving vehicle in a simulation. In both applications, our method outperforms both conventional neural network based approaches as-well as state-of-the-art Physics Informed Machine Learning methods.

[LG-18] Graph Artificial Intelligence for Quantifying Compatibility Mechanisms in Traditional Chinese Medicine

链接: https://arxiv.org/abs/2411.11474
作者: Jingqi Zeng,Xiaobin Jia
关键词-EN: Traditional Chinese Medicine, involves complex compatibility, compatibility mechanisms characterized, complex compatibility mechanisms, Chinese Medicine
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 10 pages, 5 figures. Includes open-source dataset and code for reproducibility

点击查看摘要

Abstract:Traditional Chinese Medicine (TCM) involves complex compatibility mechanisms characterized by multi-component and multi-target interactions, which are challenging to quantify. To address this challenge, we applied graph artificial intelligence to develop a TCM multi-dimensional knowledge graph that bridges traditional TCM theory and modern biomedical science (this https URL ). Using feature engineering and embedding, we processed key TCM terminology and Chinese herbal pieces (CHP), introducing medicinal properties as virtual nodes and employing graph neural networks with attention mechanisms to model and analyze 6,080 Chinese herbal formulas (CHF). Our method quantitatively assessed the roles of CHP within CHF and was validated using 215 CHF designed for COVID-19 management. With interpretable models, open-source data, and code (this https URL ), this study provides robust tools for advancing TCM theory and drug discovery.

[LG-19] Physics meets Topology: Physics-informed topological neural networks for learning rigid body dynamics

链接: https://arxiv.org/abs/2411.11467
作者: Amaury Wei,Olga Fink
关键词-EN: unknown environmental factors, abrupt nonlinear nature, numerous scientific disciplines, environmental factors, fundamental to numerous
类目: Machine Learning (cs.LG)
*备注: 17 pages, 9 figures

点击查看摘要

Abstract:Rigid body interactions are fundamental to numerous scientific disciplines, but remain challenging to simulate due to their abrupt nonlinear nature and sensitivity to complex, often unknown environmental factors. These challenges call for adaptable learning-based methods capable of capturing complex interactions beyond explicit physical models and simulations. While graph neural networks can handle simple scenarios, they struggle with complex scenes and long-term predictions. We introduce a novel framework for modeling rigid body dynamics and learning collision interactions, addressing key limitations of existing graph-based methods. Our approach extends the traditional representation of meshes by incorporating higher-order topology complexes, offering a physically consistent representation. Additionally, we propose a physics-informed message-passing neural architecture, embedding physical laws directly in the model. Our method demonstrates superior accuracy, even during long rollouts, and exhibits strong generalization to unseen scenarios. Importantly, this work addresses the challenge of multi-entity dynamic interactions, with applications spanning diverse scientific and engineering domains.

[LG-20] Upside-Down Reinforcement Learning for More Interpretable Optimal Control

链接: https://arxiv.org/abs/2411.11457
作者: Juan Cardenas-Cartagena,Massimiliano Falzari,Marco Zullich,Matthia Sabatelli
关键词-EN: Model-Free Reinforcement Learning, Model-Free Reinforcement, Reinforcement Learning, Upside-Down Reinforcement Learning, expected rewards
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model-Free Reinforcement Learning (RL) algorithms either learn how to map states to expected rewards or search for policies that can maximize a certain performance function. Model-Based algorithms instead, aim to learn an approximation of the underlying model of the RL environment and then use it in combination with planning algorithms. Upside-Down Reinforcement Learning (UDRL) is a novel learning paradigm that aims to learn how to predict actions from states and desired commands. This task is formulated as a Supervised Learning problem and has successfully been tackled by Neural Networks (NNs). In this paper, we investigate whether function approximation algorithms other than NNs can also be used within a UDRL framework. Our experiments, performed over several popular optimal control benchmarks, show that tree-based methods like Random Forests and Extremely Randomized Trees can perform just as well as NNs with the significant benefit of resulting in policies that are inherently more interpretable than NNs, therefore paving the way for more transparent, safe, and robust RL.

[LG-21] mporal and Spatial Reservoir Ensembling Techniques for Liquid State Machines

链接: https://arxiv.org/abs/2411.11414
作者: Anmol Biswas,Sharvari Ashok Medhe,Raghav Singhal,Udayan Ganguly
关键词-EN: Liquid State Machines, Echo State Networks, State Machines, Echo State, Liquid State
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Reservoir computing (RC), is a class of computational methods such as Echo State Networks (ESN) and Liquid State Machines (LSM) describe a generic method to perform pattern recognition and temporal analysis with any non-linear system. This is enabled by Reservoir Computing being a shallow network model with only Input, Reservoir, and Readout layers where input and reservoir weights are not learned (only the readout layer is trained). LSM is a special case of Reservoir computing inspired by the organization of neurons in the brain and generally refers to spike-based Reservoir computing approaches. LSMs have been successfully used to showcase decent performance on some neuromorphic vision and speech datasets but a common problem associated with LSMs is that since the model is more-or-less fixed, the main way to improve the performance is by scaling up the Reservoir size, but that only gives diminishing rewards despite a tremendous increase in model size and computation. In this paper, we propose two approaches for effectively ensembling LSM models - Multi-Length Scale Reservoir Ensemble (MuLRE) and Temporal Excitation Partitioned Reservoir Ensemble (TEPRE) and benchmark them on Neuromorphic-MNIST (N-MNIST), Spiking Heidelberg Digits (SHD), and DVSGesture datasets, which are standard neuromorphic benchmarks. We achieve 98.1% test accuracy on N-MNIST with a 3600-neuron LSM model which is higher than any prior LSM-based approach and 77.8% test accuracy on the SHD dataset which is on par with a standard Recurrent Spiking Neural Network trained by Backprop Through Time (BPTT). We also propose receptive field-based input weights to the Reservoir to work alongside the Multi-Length Scale Reservoir ensemble model for vision tasks. Thus, we introduce effective means of scaling up the performance of LSM models and evaluate them against relevant neuromorphic benchmarks

[LG-22] he Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models

链接: https://arxiv.org/abs/2411.11407
作者: Xikang Yang,Xuehai Tang,Jizhong Han,Songlin Hu
关键词-EN: large language models, significant safety vulnerabilities, exposing significant safety, language models, safety vulnerabilities
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread deployment of large language models (LLMs) across various domains has showcased their immense potential while exposing significant safety vulnerabilities. A major concern is ensuring that LLM-generated content aligns with human values. Existing jailbreak techniques reveal how this alignment can be compromised through specific prompts or adversarial suffixes. In this study, we introduce a new threat: LLMs’ bias toward authority. While this inherent bias can improve the quality of outputs generated by LLMs, it also introduces a potential vulnerability, increasing the risk of producing harmful content. Notably, the biases in LLMs is the varying levels of trust given to different types of authoritative information in harmful queries. For example, malware development often favors trust GitHub. To better reveal the risks with LLM, we propose DarkCite, an adaptive authority citation matcher and generator designed for a black-box setting. DarkCite matches optimal citation types to specific risk types and generates authoritative citations relevant to harmful instructions, enabling more effective jailbreak attacks on aligned this http URL experiments show that DarkCite achieves a higher attack success rate (e.g., LLama-2 at 76% versus 68%) than previous methods. To counter this risk, we propose an authenticity and harm verification defense strategy, raising the average defense pass rate (DPR) from 11% to 74%. More importantly, the ability to link citations to the content they encompass has become a foundational function in LLMs, amplifying the influence of LLMs’ bias toward authority.

[LG-23] Bridging the Resource Gap: Deploying Advanced Imitation Learning Models onto Affordable Embedded Platforms

链接: https://arxiv.org/abs/2411.11406
作者: Haizhou Ge,Ruixiang Wang,Zhu-ang Xu,Hongrui Zhu,Ruichen Deng,Yuhang Dong,Zeyu Pang,Guyue Zhou,Junyu Zhang,Lu Shi
关键词-EN: Advanced imitation learning, advantages in robotics, transformer is increasingly, increasingly demonstrating, demonstrating its advantages
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted by the 2024 IEEE International Conference on Robotics and Biomimetics (IEEE ROBIO 2024)

点击查看摘要

Abstract:Advanced imitation learning with structures like the transformer is increasingly demonstrating its advantages in robotics. However, deploying these large-scale models on embedded platforms remains a major challenge. In this paper, we propose a pipeline that facilitates the migration of advanced imitation learning algorithms to edge devices. The process is achieved via an efficient model compression method and a practical asynchronous parallel method Temporal Ensemble with Dropped Actions (TEDA) that enhances the smoothness of operations. To show the efficiency of the proposed pipeline, large-scale imitation learning models are trained on a server and deployed on an edge device to complete various manipulation tasks.

[LG-24] Extended Neural Contractive Dynamical Systems: On Multiple Tasks and Riemannian Safety Regions

链接: https://arxiv.org/abs/2411.11405
作者: Hadi Beik Mohammadi,Søren Hauberg,Georgios Arvanitidis,Gerhard Neumann,Leonel Rozo
关键词-EN: potentially harmful actions, fully autonomous robot, harmful actions, Contractive Dynamical Systems, Neural Contractive Dynamical
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stability guarantees are crucial when ensuring that a fully autonomous robot does not take undesirable or potentially harmful actions. We recently proposed the Neural Contractive Dynamical Systems (NCDS), which is a neural network architecture that guarantees contractive stability. With this, learning-from-demonstrations approaches can trivially provide stability guarantees. However, our early work left several unanswered questions, which we here address. Beyond providing an in-depth explanation of NCDS, this paper extends the framework with more careful regularization, a conditional variant of the framework for handling multiple tasks, and an uncertainty-driven approach to latent obstacle avoidance. Experiments verify that the developed system has the flexibility of ordinary neural networks while providing the stability guarantees needed for autonomous robotics.

[LG-25] Graph Neural Networks on Graph Databases

链接: https://arxiv.org/abs/2411.11375
作者: Dmytro Lopushanskyy,Borun Shi
关键词-EN: graph neural networks, neural networks, networks on large, large datasets, datasets has long
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Training graph neural networks on large datasets has long been a challenge. Traditional approaches include efficiently representing the whole graph in-memory, designing parameter efficient and sampling-based models, and graph partitioning in a distributed setup. Separately, graph databases with native graph storage and query engines have been developed, which enable time and resource efficient graph analytics workloads. We show how to directly train a GNN on a graph DB, by retrieving minimal data into memory and sampling using the query engine. Our experiments show resource advantages for single-machine and distributed training. Our approach opens up a new way of scaling GNNs as well as a new application area for graph DBs.

[LG-26] Zero-Shot Load Forecasting with Large Language Models

链接: https://arxiv.org/abs/2411.11350
作者: Wenlong Liao,Zhe Yang,Mengshuo Jia,Christian Rehtanz,Jiannong Fang,Fernando Porté-Agel
关键词-EN: Deep learning models, shown strong performance, generally require large, require large amounts, Chronos model
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 21 pages,5 figures

点击查看摘要

Abstract:Deep learning models have shown strong performance in load forecasting, but they generally require large amounts of data for model training before being applied to new scenarios, which limits their effectiveness in data-scarce scenarios. Inspired by the great success of pre-trained language models (LLMs) in natural language processing, this paper proposes a zero-shot load forecasting approach using an advanced LLM framework denoted as the Chronos model. By utilizing its extensive pre-trained knowledge, the Chronos model enables accurate load forecasting in data-scarce scenarios without the need for extensive data-specific training. Simulation results across five real-world datasets demonstrate that the Chronos model significantly outperforms nine popular baseline models for both deterministic and probabilistic load forecasting with various forecast horizons (e.g., 1 to 48 hours), even though the Chronos model is neither tailored nor fine-tuned to these specific load datasets. Notably, Chronos reduces root mean squared error (RMSE), continuous ranked probability score (CRPS), and quantile score (QS) by approximately 7.34%-84.30%, 19.63%-60.06%, and 22.83%-54.49%, respectively, compared to baseline models. These results highlight the superiority and flexibility of the Chronos model, positioning it as an effective solution in data-scarce scenarios.

[LG-27] A Hybrid Loss Framework for Decomposition-based Time Series Forecasting Methods: Balancing Global and Component Errors

链接: https://arxiv.org/abs/2411.11340
作者: Ronghui Han,Duanyu Feng,Hongyu Du,Hao Wang
关键词-EN: Accurate time series, time series methods, time series, Accurate time, series methods
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accurate time series forecasting, predicting future values based on past data, is crucial for diverse industries. Many current time series methods decompose time series into multiple sub-series, applying different model architectures and training with an end-to-end overall loss for forecasting. However, this raises a question: does this overall loss prioritize the importance of critical sub-series within the decomposition for the better performance? To investigate this, we conduct a study on the impact of overall loss on existing time series methods with sequence decomposition. Our findings reveal that overall loss may introduce bias in model learning, hindering the learning of the prioritization of more significant sub-series and limiting the forecasting performance. To address this, we propose a hybrid loss framework combining the global and component losses. This framework introduces component losses for each sub-series alongside the original overall loss. It employs a dual min-max algorithm to dynamically adjust weights between the overall loss and component losses, and within component losses. This enables the model to achieve better performance of current time series methods by focusing on more critical sub-series while still maintaining a low overall loss. We integrate our loss framework into several time series methods and evaluate the performance on multiple datasets. Results show an average improvement of 0.5-2% over existing methods without any modifications to the model architectures.

[LG-28] Enhancing Decision Transformer with Diffusion-Based Trajectory Branch Generation

链接: https://arxiv.org/abs/2411.11327
作者: Zhihong Liu,Long Qian,Zeyang Liu,Lipeng Wan,Xingyu Chen,Xuguang Lan
关键词-EN: http URL trajectory, URL trajectory branch, http URL expanding, http URL address, http URL concatenate
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decision Transformer (DT) can learn effective policy from offline datasets by converting the offline reinforcement learning (RL) into a supervised sequence modeling task, where the trajectory elements are generated auto-regressively conditioned on the return-to-go (RTG).However, the sequence modeling learning approach tends to learn policies that converge on the sub-optimal trajectories within the dataset, for lack of bridging data to move to better trajectories, even if the condition is set to the highest this http URL address this issue, we introduce Diffusion-Based Trajectory Branch Generation (BG), which expands the trajectories of the dataset with branches generated by a diffusion this http URL trajectory branch is generated based on the segment of the trajectory within the dataset, and leads to trajectories with higher this http URL concatenate the generated branch with the trajectory segment as an expansion of the this http URL expanding, DT has more opportunities to learn policies to move to better trajectories, preventing it from converging to the sub-optimal this http URL, after processing with BG, DT outperforms state-of-the-art sequence modeling methods on D4RL benchmark, demonstrating the effectiveness of adding branches to the dataset without further modifications.

[LG-29] Cuvis.Ai: An Open-Source Low-Code Software Ecosystem for Hyperspectral Processing and Classification

链接: https://arxiv.org/abs/2411.11324
作者: Nathaniel Hanson,Philip Manke,Simon Birkholz,Maximilian Mühlbauer,Rene Heine,Arnd Brandes
关键词-EN: existing software solutions, analyzing high-dimension hyperspectral, inextensible research products, important tool, tool for analyzing
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 5 pages, 2024 14th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS)

点击查看摘要

Abstract:Machine learning is an important tool for analyzing high-dimension hyperspectral data; however, existing software solutions are either closed-source or inextensible research products. In this paper, we present this http URL, an open-source and low-code software ecosystem for data acquisition, preprocessing, and model training. The package is written in Python and provides wrappers around common machine learning libraries, allowing both classical and deep learning models to be trained on hyperspectral data. The codebase abstracts processing interconnections and data dependencies between operations to minimize code complexity for users. This software package instantiates nodes in a directed acyclic graph to handle all stages of a machine learning ecosystem, from data acquisition, including live or static data sources, to final class assignment or property prediction. User-created models contain convenient serialization methods to ensure portability and increase sharing within the research community. All code and data are available online: this https URL

[LG-30] A Review on Machine Unlearning

链接: https://arxiv.org/abs/2411.11315
作者: Haibo Zhang,Toru Nakamura,Takamasa Isohara,Kouichi Sakurai
关键词-EN: machine learning, Data Protection Regulation, machine learning models, General Data Protection, machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, an increasing number of laws have governed the useability of users’ privacy. For example, Article 17 of the General Data Protection Regulation (GDPR), the right to be forgotten, requires machine learning applications to remove a portion of data from a dataset and retrain it if the user makes such a request. Furthermore, from the security perspective, training data for machine learning models, i.e., data that may contain user privacy, should be effectively protected, including appropriate erasure. Therefore, researchers propose various privacy-preserving methods to deal with such issues as machine unlearning. This paper provides an in-depth review of the security and privacy concerns in machine learning models. First, we present how machine learning can use users’ private data in daily life and the role that the GDPR plays in this problem. Then, we introduce the concept of machine unlearning by describing the security threats in machine learning models and how to protect users’ privacy from being violated using machine learning platforms. As the core content of the paper, we introduce and analyze current machine unlearning approaches and several representative research results and discuss them in the context of the data lineage. Furthermore, we also discuss the future research challenges in this field.

[LG-31] oward Personalized Federated Node Classification in One-shot Communication

链接: https://arxiv.org/abs/2411.11304
作者: Guochen Yan,Xunkai Li,Luyuan Xie,Wentao Zhang,Qingni Shen,Yuejian Fang,Zhonghai Wu
关键词-EN: Federated Graph Learning, One-shot Federated Learning, private graph data, Federated Graph, Graph Learning
类目: Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Federated Graph Learning (FGL) has become a promising paradigm for collaborative training with distributed and private graph data. One-shot Federated Learning (OFL) enables collaboration in a single communication round to largely reduce communication costs and potential security concerns. However, existing OFL methods are not designed for graph data and existing FGL methods are ineffective within one communication round under both data and model heterogeneity. To mitigate this gap, we are the first to propose a one-shot personalized federated graph learning method for node classification, which is also compatible with the Secure Aggregation scheme. We estimate and aggregate the statistics of class-wise feature distribution to generate a global pseudo-graph on the server, which could be used to train a global graph model. Furthermore, We reveal the under-explored problem of existing personalized FGL methods that their personalized models are biased and neglect the ability to generalize to minorities. To achieve better personalization and generalization simultaneously, we propose a two-stage personalized training to adaptively utilize the personal information from local data and global information from the global pseudo-graph. Comprehensive experiments on 8 multi-scale graph datasets under different partitions with various settings demonstrate our superior performance over state-of-the-art baselines.

[LG-32] Steering Language Model Refusal with Sparse Autoencoders

链接: https://arxiv.org/abs/2411.11296
作者: Kyle O’Brien,David Majercak,Xavier Fernandes,Richard Edgar,Jingya Chen,Harsha Nori,Dean Carignan,Eric Horvitz,Forough Poursabzi-Sangde
关键词-EN: refuse answering prompts, Responsible practices, models include guiding, include guiding models, considered unsafe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Responsible practices for deploying language models include guiding models to recognize and refuse answering prompts that are considered unsafe, while complying with safe prompts. Achieving such behavior typically requires updating model weights, which is costly and inflexible. We explore opportunities to steering model activations at inference time, which does not require updating weights. Using sparse autoencoders, we identify and steer features in Phi-3 Mini that mediate refusal behavior. We find that feature steering can improve Phi-3 Minis robustness to jailbreak attempts across various harms, including challenging multi-turn attacks. However, we discover that feature steering can adversely affect overall performance on benchmarks. These results suggest that identifying steerable mechanisms for refusal via sparse autoencoders is a promising approach for enhancing language model safety, but that more research is needed to mitigate feature steerings adverse effects on performance.

[LG-33] SADDE: Semi-supervised Anomaly Detection with Dependable Explanations

链接: https://arxiv.org/abs/2411.11293
作者: Yachao Yuan,Yu Huang,Yali Yuan,Jin Wang
关键词-EN: identifying anomaly patterns, anomaly detection, labeled samples poses, anomaly detection applications, network anomaly detection
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-supervised learning holds a pivotal position in anomaly detection applications, yet identifying anomaly patterns with a limited number of labeled samples poses a significant challenge. Furthermore, the absence of interpretability poses major obstacles to the practical adoption of semi-supervised frameworks. The majority of existing interpretation techniques are tailored for supervised/unsupervised frameworks or non-security domains, falling short in providing dependable interpretations. In this research paper, we introduce SADDE, a general framework designed to accomplish two primary objectives: (1) to render the anomaly detection process interpretable and enhance the credibility of interpretation outcomes, and (2) to assign high-confidence pseudo labels to unlabeled samples, thereby boosting the performance of anomaly detection systems when supervised data is scarce. To achieve the first objective, we devise a cutting-edge interpretation method that utilizes both global and local interpreters to furnish trustworthy explanations. For the second objective, we conceptualize a novel two-stage semi-supervised learning framework tailored for network anomaly detection, ensuring that the model predictions of both stages align with specific constraints. We apply SADDE to two illustrative network anomaly detection tasks and conduct extensive evaluations in comparison with notable prior works. The experimental findings underscore that SADDE is capable of delivering precise detection results alongside dependable interpretations for semi-supervised network anomaly detection systems. The source code for SADDE is accessible at: this https URL.

[LG-34] Dual-Frequency Filtering Self-aware Graph Neural Networks for Homophilic and Heterophilic Graphs

链接: https://arxiv.org/abs/2411.11284
作者: Yachao Yang,Yanfeng Sun,Jipeng Guo,Junbin Gao,Shaofan Wang,Fujiao Ju,Baocai Yin
关键词-EN: Graph Neural Networks, significant research interest, attracting significant research, Neural Networks, handling graph-structured data
类目: Machine Learning (cs.LG)
*备注: 11pages,17figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have excelled in handling graph-structured data, attracting significant research interest. However, two primary challenges have emerged: interference between topology and attributes distorting node representations, and the low-pass filtering nature of most GNNs leading to the oversight of valuable high-frequency information in graph signals. These issues are particularly pronounced in heterophilic graphs. To address these challenges, we propose Dual-Frequency Filtering Self-aware Graph Neural Networks (DFGNN). DFGNN integrates low-pass and high-pass filters to extract smooth and detailed topological features, using frequency-specific constraints to minimize noise and redundancy in the respective frequency bands. The model dynamically adjusts filtering ratios to accommodate both homophilic and heterophilic graphs. Furthermore, DFGNN mitigates interference by aligning topological and attribute representations through dynamic correspondences between their respective frequency bands, enhancing overall model performance and expressiveness. Extensive experiments conducted on benchmark datasets demonstrate that DFGNN outperforms state-of-the-art methods in classification performance, highlighting its effectiveness in handling both homophilic and heterophilic graphs.

[LG-35] Effective Predictive Modeling for Emergency Department Visits and Evaluating Exogenous Variables Impact: Using Explainable Meta-learning Gradient Boosting

链接: https://arxiv.org/abs/2411.11275
作者: Mehdi Neshat,Michael Phipps,Nikhil Jha,Danial Khojasteh,Michael Tong,Amir Gandomi
关键词-EN: predict Emergency Department, Emergency Department, optimise resource distribution, predict Emergency, Meta-learning Gradient Booster
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Over an extensive duration, administrators and clinicians have endeavoured to predict Emergency Department (ED) visits with precision, aiming to optimise resource distribution. Despite the proliferation of diverse AI-driven models tailored for precise prognostication, this task persists as a formidable challenge, besieged by constraints such as restrained generalisability, susceptibility to overfitting and underfitting, scalability issues, and complex fine-tuning hyper-parameters. In this study, we introduce a novel Meta-learning Gradient Booster (Meta-ED) approach for precisely forecasting daily ED visits and leveraging a comprehensive dataset of exogenous variables, including socio-demographic characteristics, healthcare service use, chronic diseases, diagnosis, and climate parameters spanning 23 years from Canberra Hospital in ACT, Australia. The proposed Meta-ED consists of four foundational learners-Catboost, Random Forest, Extra Tree, and lightGBoost-alongside a dependable top-level learner, Multi-Layer Perceptron (MLP), by combining the unique capabilities of varied base models (sub-learners). Our study assesses the efficacy of the Meta-ED model through an extensive comparative analysis involving 23 models. The evaluation outcomes reveal a notable superiority of Meta-ED over the other models in accuracy at 85.7% (95% CI ;85.4%, 86.0%) and across a spectrum of 10 evaluation metrics. Notably, when compared with prominent techniques, XGBoost, Random Forest (RF), AdaBoost, LightGBoost, and Extra Tree (ExT), Meta-ED showcases substantial accuracy enhancements of 58.6%, 106.3%, 22.3%, 7.0%, and 15.7%, respectively. Furthermore, incorporating weather-related features demonstrates a 3.25% improvement in the prediction accuracy of visitors’ numbers. The encouraging outcomes of our study underscore Meta-ED as a foundation model for the precise prediction of daily ED visitors.

[LG-36] GROOT: Effective Design of Biological Sequences with Limited Experimental Data

链接: https://arxiv.org/abs/2411.11265
作者: Thanh V. T. Tran,Nhat Khang Ngo,Viet Anh Nguyen,Truong Son Hy
关键词-EN: wet lab experiments, expensive black-box functions, high-dimensional biological sequences, maximize expensive black-box, designing discrete
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Latent space optimization (LSO) is a powerful method for designing discrete, high-dimensional biological sequences that maximize expensive black-box functions, such as wet lab experiments. This is accomplished by learning a latent space from available data and using a surrogate model to guide optimization algorithms toward optimal outputs. However, existing methods struggle when labeled data is limited, as training the surrogate model with few labeled data points can lead to subpar outputs, offering no advantage over the training data itself. We address this challenge by introducing GROOT, a Graph-based Latent Smoothing for Biological Sequence Optimization. In particular, GROOT generates pseudo-labels for neighbors sampled around the training latent embeddings. These pseudo-labels are then refined and smoothed by Label Propagation. Additionally, we theoretically and empirically justify our approach, demonstrate GROOT’s ability to extrapolate to regions beyond the training set while maintaining reliability within an upper bound of their expected distances from the training regions. We evaluate GROOT on various biological sequence design tasks, including protein optimization (GFP and AAV) and three tasks with exact oracles from Design-Bench. The results demonstrate that GROOT equalizes and surpasses existing methods without requiring access to black-box oracles or vast amounts of labeled data, highlighting its practicality and effectiveness. We release our code at this https URL

[LG-37] Graph Retention Networks for Dynamic Graphs

链接: https://arxiv.org/abs/2411.11259
作者: Qian Chang,Xia Li,Xiufeng Cheng
关键词-EN: Graph Retention Network, propose Graph Retention, Retention Network, Graph Retention, dynamic graph
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we propose Graph Retention Network as a unified architecture for deep learning on dynamic graphs. The GRN extends the core computational manner of retention to dynamic graph data as graph retention, which empowers the model with three key computational paradigms that enable training parallelism, O(1) low-cost inference, and long-term batch training. This architecture achieves an optimal balance of effectiveness, efficiency, and scalability. Extensive experiments conducted on benchmark datasets present the superior performance of the GRN in both edge-level prediction and node-level classification tasks. Our architecture achieves cutting-edge results while maintaining lower training latency, reduced GPU memory consumption, and up to an 86.7x improvement in inference throughput compared to baseline models. The GRNs have demonstrated strong potential to become a widely adopted architecture for dynamic graph learning tasks. Code will be available at this https URL.

[LG-38] Progressive Generalization Risk Reduction for Data-Efficient Causal Effect Estimation KDD’25

链接: https://arxiv.org/abs/2411.11256
作者: Hechuan Wen,Tong Chen,Guanhua Ye,Li Kheng Chai,Shazia Sadiq,Hongzhi Yin
关键词-EN: unobserved counterfactual outcome, Causal effect estimation, CEE, medical treatment effect, crucial tool
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by KDD’25

点击查看摘要

Abstract:Causal effect estimation (CEE) provides a crucial tool for predicting the unobserved counterfactual outcome for an entity. As CEE relaxes the requirement for ``perfect’’ counterfactual samples (e.g., patients with identical attributes and only differ in treatments received) that are impractical to obtain and can instead operate on observational data, it is usually used in high-stake domains like medical treatment effect prediction. Nevertheless, in those high-stake domains, gathering a decently sized, fully labelled observational dataset remains challenging due to hurdles associated with costs, ethics, expertise and time needed, etc., of which medical treatment surveys are a typical example. Consequently, if the training dataset is small in scale, low generalization risks can hardly be achieved on any CEE algorithms. Unlike existing CEE methods that assume the constant availability of a dataset with abundant samples, in this paper, we study a more realistic CEE setting where the labelled data samples are scarce at the beginning, while more can be gradually acquired over the course of training – assuredly under a limited budget considering their expensive nature. Then, the problem naturally comes down to actively selecting the best possible samples to be labelled, e.g., identifying the next subset of patients to conduct the treatment survey. However, acquiring quality data for reducing the CEE risk under limited labelling budgets remains under-explored until now. To fill the gap, we theoretically analyse the generalization risk from an intriguing perspective of progressively shrinking its upper bound, and develop a principled label acquisition pipeline exclusively for CEE tasks. With our analysis, we propose the Model Agnostic Causal Active Learning (MACAL) algorithm for batch-wise label acquisition, which aims to reduce both the CEE model’s uncertainty and the post-acquisition … Comments: Accepted by KDD’25 Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2411.11256 [cs.LG] (or arXiv:2411.11256v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.11256 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] Mirror Descent on Reproducing Kernel Banach Spaces

链接: https://arxiv.org/abs/2411.11242
作者: Akash Kumar,Mikhail Belkin,Parthe Pandit
关键词-EN: Recent advances, reproducing kernel Hilbert, kernel Hilbert spaces, kernel Hilbert, Hilbert spaces
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 42 pages, 3 figures

点击查看摘要

Abstract:Recent advances in machine learning have led to increased interest in reproducing kernel Banach spaces (RKBS) as a more general framework that extends beyond reproducing kernel Hilbert spaces (RKHS). These works have resulted in the formulation of representer theorems under several regularized learning schemes. However, little is known about an optimization method that encompasses these results in this setting. This paper addresses a learning problem on Banach spaces endowed with a reproducing kernel, focusing on efficient optimization within RKBS. To tackle this challenge, we propose an algorithm based on mirror descent (MDA). Our approach involves an iterative method that employs gradient steps in the dual space of the Banach space using the reproducing kernel. We analyze the convergence properties of our algorithm under various assumptions and establish two types of results: first, we identify conditions under which a linear convergence rate is achievable, akin to optimization in the Euclidean setting, and provide a proof of the linear rate; second, we demonstrate a standard convergence rate in a constrained setting. Moreover, to instantiate this algorithm in practice, we introduce a novel family of RKBSs with p -norm ( p \neq 2 ), characterized by both an explicit dual map and a kernel. Comments: 42 pages, 3 figures Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2411.11242 [cs.LG] (or arXiv:2411.11242v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.11242 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Reliable Learning of Halfspaces under Gaussian Marginals

链接: https://arxiv.org/abs/2411.11238
作者: Ilias Diakonikolas,Lisheng Ren,Nikos Zarifis
关键词-EN: reliable PAC model, study the problem, PAC learning halfspaces, model of Kalai, PAC model captures
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of PAC learning halfspaces in the reliable agnostic model of Kalai et al. (2012). The reliable PAC model captures learning scenarios where one type of error is costlier than the others. Our main positive result is a new algorithm for reliable learning of Gaussian halfspaces on \mathbbR^d with sample and computational complexity d^O(\log (\min\1/\alpha, 1/\epsilon))\min (2^\log(1/\epsilon)^O(\log (1/\alpha)),2^\mathrmpoly(1/\epsilon));, where \epsilon is the excess error and \alpha is the bias of the optimal halfspace. We complement our upper bound with a Statistical Query lower bound suggesting that the d^\Omega(\log (1/\alpha)) dependence is best possible. Conceptually, our results imply a strong computational separation between reliable agnostic learning and standard agnostic learning of halfspaces in the Gaussian setting.

[LG-41] Dont Be So Positive: Negative Step Sizes in Second-Order Methods

链接: https://arxiv.org/abs/2411.11224
作者: Betty Shea,Mark Schmidt
关键词-EN: curvature information, negative curvature information, negative step sizes, second-order methods lies, negative step
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The value of second-order methods lies in the use of curvature information. Yet, this information is costly to extract and once obtained, valuable negative curvature information is often discarded so that the method is globally convergent. This limits the effectiveness of second-order methods in modern machine learning. In this paper, we show that second-order and second-order-like methods are promising optimizers for neural networks provided that we add one ingredient: negative step sizes. We show that under very general conditions, methods that produce ascent directions are globally convergent when combined with a Wolfe line search that allows both positive and negative step sizes. We experimentally demonstrate that using negative step sizes is often more effective than common Hessian modification methods.

[LG-42] Data Driven Automatic Electrical Machine Preliminary Design with Artificial Intelligence Expert Guidance

链接: https://arxiv.org/abs/2411.11221
作者: Yiwei Wang,Tao Yang,Hailin Huang,Tianjie Zou,Jincai Li,Nuo Chen,Zhuoran Zhang
关键词-EN: wound-rotor synchronous generator, preliminary EMD processes, synchronous generator, design, paper presents
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a data-driven electrical machine design (EMD) framework using wound-rotor synchronous generator (WRSG) as a design example. Unlike traditional preliminary EMD processes that heavily rely on expertise, this framework leverages an artificial-intelligence based expert database, to provide preliminary designs directly from user specifications. Initial data is generated using 2D finite element (FE) machine models by sweeping fundamental design variables including machine length and diameter, enabling scalable machine geometry with machine performance for each design is recorded. This data trains a Metamodel of Optimal Prognosis (MOP)-based surrogate model, which maps design variables to key performance indicators (KPIs). Once trained, guided by metaheuristic algorithms, the surrogate model can generate thousands of geometric scalable designs, covering a wide power range, forming an AI expert database to guide future preliminary design. The framework is validated with a 30kVA WRSG design case. A prebuilt WRSG database, covering power from 10 to 60kVA, is validated by FE simulation. Design No.1138 is selected from database and compared with conventional design. Results show No.1138 achieves a higher power density of 2.21 kVA/kg in just 5 seconds, compared to 2.02 kVA/kg obtained using traditional method, which take several days. The developed AI expert database also serves as a high-quality data source for further developing AI models for automatic electrical machine design.

[LG-43] Countering Backdoor Attacks in Image Recognition: A Survey and Evaluation of Mitigation Strategies

链接: https://arxiv.org/abs/2411.11200
作者: Kealan Dunnett,Reza Arablouei,Dimity Miller,Volkan Dedeoglu,Raja Jurdak
关键词-EN: introduced substantial challenges, deep learning, explainability and security, deep learning models, widespread adoption
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread adoption of deep learning across various industries has introduced substantial challenges, particularly in terms of model explainability and security. The inherent complexity of deep learning models, while contributing to their effectiveness, also renders them susceptible to adversarial attacks. Among these, backdoor attacks are especially concerning, as they involve surreptitiously embedding specific triggers within training data, causing the model to exhibit aberrant behavior when presented with input containing the triggers. Such attacks often exploit vulnerabilities in outsourced processes, compromising model integrity without affecting performance on clean (trigger-free) input data. In this paper, we present a comprehensive review of existing mitigation strategies designed to counter backdoor attacks in image recognition. We provide an in-depth analysis of the theoretical foundations, practical efficacy, and limitations of these approaches. In addition, we conduct an extensive benchmarking of sixteen state-of-the-art approaches against eight distinct backdoor attacks, utilizing three datasets, four model architectures, and three poisoning ratios. Our results, derived from 122,236 individual experiments, indicate that while many approaches provide some level of protection, their performance can vary considerably. Furthermore, when compared to two seminal approaches, most newer approaches do not demonstrate substantial improvements in overall performance or consistency across diverse settings. Drawing from these findings, we propose potential directions for developing more effective and generalizable defensive mechanisms in the future.

[LG-44] Stealing Training Graphs from Graph Neural Networks KDD2025

链接: https://arxiv.org/abs/2411.11197
作者: Minhua Lin,Enyan Dai,Junjie Xu,Jinyuan Jia,Xiang Zhang,Suhang Wang
关键词-EN: shown promising results, trained GNN, Neural Networks, training, Graph Neural Networks
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: To be appeared in KDD 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have shown promising results in modeling graphs in various tasks. The training of GNNs, especially on specialized tasks such as bioinformatics, demands extensive expert annotations, which are expensive and usually contain sensitive information of data providers. The trained GNN models are often shared for deployment in the real world. As neural networks can memorize the training samples, the model parameters of GNNs have a high risk of leaking private training data. Our theoretical analysis shows the strong connections between trained GNN parameters and the training graphs used, confirming the training graph leakage issue. However, explorations into training data leakage from trained GNNs are rather limited. Therefore, we investigate a novel problem of stealing graphs from trained GNNs. To obtain high-quality graphs that resemble the target training set, a graph diffusion model with diffusion noise optimization is deployed as a graph generator. Furthermore, we propose a selection method that effectively leverages GNN model parameters to identify training graphs from samples generated by the graph diffusion model. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed framework in stealing training graphs from the trained GNN.

[LG-45] AMAGO-2: Breaking the Multi-Task Barrier in Meta-Reinforcement Learning with Transformers NEURIPS2024

链接: https://arxiv.org/abs/2411.11188
作者: Jake Grigsby,Justin Sasek,Samyak Parajuli,Daniel Adebi,Amy Zhang,Yuke Zhu
关键词-EN: Language models trained, Language models, diverse datasets unlock, datasets unlock generalization, trained on diverse
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Language models trained on diverse datasets unlock generalization by in-context learning. Reinforcement Learning (RL) policies can achieve a similar effect by meta-learning within the memory of a sequence model. However, meta-RL research primarily focuses on adapting to minor variations of a single task. It is difficult to scale towards more general behavior without confronting challenges in multi-task optimization, and few solutions are compatible with meta-RL’s goal of learning from large training sets of unlabeled tasks. To address this challenge, we revisit the idea that multi-task RL is bottlenecked by imbalanced training losses created by uneven return scales across different tasks. We build upon recent advancements in Transformer-based (in-context) meta-RL and evaluate a simple yet scalable solution where both an agent’s actor and critic objectives are converted to classification terms that decouple optimization from the current scale of returns. Large-scale comparisons in Meta-World ML45, Multi-Game Procgen, Multi-Task POPGym, Multi-Game Atari, and BabyAI find that this design unlocks significant progress in online multi-task adaptation and memory problems without explicit task labels.

[LG-46] Mixing Neural Networks and Exponential Moving Averages for Predicting Wireless Links Behavior

链接: https://arxiv.org/abs/2411.11185
作者: Gabriele Formis,Stefano Scanzio,Lukasz Wisniewski,Gianluca Cena
关键词-EN: frame delivery ratio, delivery ratio, frame delivery, critical task, task for optimizing
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: preprint, 6 pages, 2024

点击查看摘要

Abstract:Predicting the behavior of a wireless link in terms of, e.g., the frame delivery ratio, is a critical task for optimizing the performance of wireless industrial communication systems. This is because industrial applications are typically characterized by stringent dependability and end-to-end latency requirements, which are adversely affected by channel quality degradation. In this work, we studied two neural network models for Wi-Fi link quality prediction in dense indoor environments. Experimental results show that their accuracy outperforms conventional methods based on exponential moving averages, due to their ability to capture complex patterns about communications, including the effects of shadowing and multipath propagation, which are particularly pronounced in industrial scenarios. This highlights the potential of neural networks for predicting spectrum behavior in challenging operating conditions, and suggests that they can be exploited to improve determinism and dependability of wireless communications, fostering their adoption in the industry. Comments: preprint, 6 pages, 2024 Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG) Cite as: arXiv:2411.11185 [cs.NI] (or arXiv:2411.11185v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2411.11185 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: IEEE 7th International Conference on Industrial Cyber-Physical Systems (ICPS 2024) Related DOI: https://doi.org/10.1109/ICPS59941.2024.10640038 Focus to learn more DOI(s) linking to related resources

[LG-47] Robust Defense Against Extreme Grid Events Using Dual-Policy Reinforcement Learning Agents

链接: https://arxiv.org/abs/2411.11180
作者: Benjamin M. Peter,Mert Korkali
关键词-EN: Reinforcement learning, powerful tools, tools for managing, Reinforcement, managing power grids
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, submitted to the 2025 Texas Power and Energy Conference (TPEC)

点击查看摘要

Abstract:Reinforcement learning (RL) agents are powerful tools for managing power grids. They use large amounts of data to inform their actions and receive rewards or penalties as feedback to learn favorable responses for the system. Once trained, these agents can efficiently make decisions that would be too computationally complex for a human operator. This ability is especially valuable in decarbonizing power networks, where the demand for RL agents is increasing. These agents are well suited to control grid actions since the action space is constantly growing due to uncertainties in renewable generation, microgrid integration, and cybersecurity threats. To assess the efficacy of RL agents in response to an adverse grid event, we use the Grid2Op platform for agent training. We employ a proximal policy optimization (PPO) algorithm in conjunction with graph neural networks (GNNs). By simulating agents’ responses to grid events, we assess their performance in avoiding grid failure for as long as possible. The performance of an agent is expressed concisely through its reward function, which helps the agent learn the most optimal ways to reconfigure a grid’s topology amidst certain events. To model multi-actor scenarios that threaten modern power networks, particularly those resulting from cyberattacks, we integrate an opponent that acts iteratively against a given agent. This interplay between the RL agent and opponent is utilized in N-k contingency screening, providing a novel alternative to the traditional security assessment.

[LG-48] Infinite Width Limits of Self Supervised Neural Networks

链接: https://arxiv.org/abs/2411.11176
作者: Maximilian Fleissner,Gautham Govind Anil,Debarghya Ghoshdastidar
关键词-EN: supervised deep neural, wide neural networks, deep neural networks, neural networks, Barlow Twins
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The NTK is a widely used tool in the theoretical analysis of deep learning, allowing us to look at supervised deep neural networks through the lenses of kernel regression. Recently, several works have investigated kernel models for self-supervised learning, hypothesizing that these also shed light on the behaviour of wide neural networks by virtue of the NTK. However, it remains an open question to what extent this connection is mathematically sound – it is a commonly encountered misbelief that the kernel behaviour of wide neural networks emerges irrespective of the loss function it is trained on. In this paper, we bridge the gap between the NTK and self-supervised learning, focusing on two-layer neural networks trained under the Barlow Twins loss. We prove that the NTK of Barlow Twins indeed becomes constant as the width of the network approaches infinity. Our analysis technique is different from previous works on the NTK and may be of independent interest. Overall, our work provides a first rigorous justification for the use of classic kernel theory to understand self-supervised learning of wide neural networks. Building on this result, we derive generalization error bounds for kernelized Barlow Twins and connect them to neural networks of finite width.

[LG-49] Learning the Sherrington-Kirkpatrick Model Even at Low Temperature

链接: https://arxiv.org/abs/2411.11174
作者: Gautam Chandrasekaran,Adam Klivans
关键词-EN: Markov Random Field, Random Field, Markov Random, undirected graphical model, fundamental problem
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the fundamental problem of learning the parameters of an undirected graphical model or Markov Random Field (MRF) in the setting where the edge weights are chosen at random. For Ising models, we show that a multiplicative-weight update algorithm due to Klivans and Meka learns the parameters in polynomial time for any inverse temperature \beta \leq \sqrt\log n . This immediately yields an algorithm for learning the Sherrington-Kirkpatrick (SK) model beyond the high-temperature regime of \beta 1 . Prior work breaks down at \beta = 1 and requires heavy machinery from statistical physics or functional inequalities. In contrast, our analysis is relatively simple and uses only subgaussian concentration. Our results extend to MRFs of higher order (such as pure p -spin models), where even results in the high-temperature regime were not known. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2411.11174 [cs.LG] (or arXiv:2411.11174v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.11174 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Federated Learning for UAV-Based Spectrum Sensing: Enhancing Accuracy Through SNR-Weighted Model Aggregation

链接: https://arxiv.org/abs/2411.11159
作者: Kürşat Tekbıyık,Güneş Karabulut Kurt,Antoine Lesage-Landry
关键词-EN: wireless communications requires, backhaul links, spectrum, achieve wider bandwidth, inhibit merging bands
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing demand for data usage in wireless communications requires using wider bands in the spectrum, especially for backhaul links. Yet, allocations in the spectrum for non-communication systems inhibit merging bands to achieve wider bandwidth. To overcome this issue, spectrum-sharing or opportunistic spectrum utilization by secondary users stands out as a promising solution. However, both approaches must minimize interference to primary users. Therefore, spectrum sensing becomes vital for such opportunistic usage, ensuring the proper operation of the primary users. Although this problem has been investigated for 2D networks, unmanned aerial vehicle (UAV) networks need different points of view concerning 3D space, its challenges, and opportunities. For this purpose, we propose a federated learning (FL)-based method for spectrum sensing in UAV networks to account for their distributed nature and limited computational capacity. FL enables local training without sharing raw data while guaranteeing the privacy of local users,lowering communication overhead, and increasing data diversity. Furthermore, we develop a federated aggregation method, namely FedSNR, that considers the signal-to-noise ratio observed by UAVs to acquire a global model. The numerical results show that the proposed architecture and the aggregation method outperform traditional methods.

[LG-51] From Primes to Paths: Enabling Fast Multi-Relational Graph Analysis KDD2025 ECML

链接: https://arxiv.org/abs/2411.11149
作者: Konstantinos Bougiatiotis,Georgios Paliouras
关键词-EN: capture intricate relationships, networks capture intricate, Multi-relational networks capture, social sciences, capture intricate
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 35 pages: 28 main, 7 appendix; 6 figures. Submitted to ECML PKDD 2025 Journal Track for Data Mining and Knowledge Discovery. For the code accompanying the paper see this http URL . For a demo app on relation prediction on HetioNet using BoP representations see this http URL

点击查看摘要

Abstract:Multi-relational networks capture intricate relationships in data and have diverse applications across fields such as biomedical, financial, and social sciences. As networks derived from increasingly large datasets become more common, identifying efficient methods for representing and analyzing them becomes crucial. This work extends the Prime Adjacency Matrices (PAMs) framework, which employs prime numbers to represent distinct relations within a network uniquely. This enables a compact representation of a complete multi-relational graph using a single adjacency matrix, which, in turn, facilitates quick computation of multi-hop adjacency matrices. In this work, we enhance the framework by introducing a lossless algorithm for calculating the multi-hop matrices and propose the Bag of Paths (BoP) representation, a versatile feature extraction methodology for various graph analytics tasks, at the node, edge, and graph level. We demonstrate the efficiency of the framework across various tasks and datasets, showing that simple BoP-based models perform comparably to or better than commonly used neural models while offering improved speed and interpretability.

[LG-52] Spectral Subspace Clustering for Attributed Graphs KDD2025

链接: https://arxiv.org/abs/2411.11074
作者: Xiaoyang Lin,Renchi Yang,Haoran Zheng,Xiangyu Ke
关键词-EN: Subspace clustering seeks, Subspace clustering, images and videos, seeks to identify, segment a set
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 15 pages. Full version of the paper accepted to KDD 2025

点击查看摘要

Abstract:Subspace clustering seeks to identify subspaces that segment a set of n data points into k (kn) groups, which has emerged as a powerful tool for analyzing data from various domains, especially images and videos. Recently, several studies have demonstrated the great potential of subspace clustering models for partitioning vertices in attributed graphs, referred to as SCAG. However, these works either demand significant computational overhead for constructing the nxn self-expressive matrix, or fail to incorporate graph topology and attribute data into the subspace clustering framework effectively, and thus, compromise result quality. Motivated by this, this paper presents two effective and efficient algorithms, S2CAG and M-S2CAG, for SCAG computation. Particularly, S2CAG obtains superb performance through three major contributions. First, we formulate a new objective function for SCAG with a refined representation model for vertices and two non-trivial constraints. On top of that, an efficient linear-time optimization solver is developed based on our theoretically grounded problem transformation and well-thought-out adaptive strategy. We then conduct an in-depth analysis to disclose the theoretical connection of S2CAG to conductance minimization, which further inspires the design of M-S2CAG that maximizes the modularity. Our extensive experiments, comparing S2CAG and M-S2CAG against 17 competitors over 8 benchmark datasets, exhibit that our solutions outperform all baselines in terms of clustering quality measured against the ground truth while delivering high efficiency Comments: 15 pages. Full version of the paper accepted to KDD 2025 Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG) Cite as: arXiv:2411.11074 [cs.SI] (or arXiv:2411.11074v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2411.11074 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-53] Generating medical screening questionnaires through analysis of social media data

链接: https://arxiv.org/abs/2411.11048
作者: Ortal Ashkenazi,Elad Yom-Tov,Liron Vardi David
关键词-EN: diagnostic aid, Screening questionnaires, generating screening questionnaires, social media, questionnaires
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Screening questionnaires are used in medicine as a diagnostic aid. Creating them is a long and expensive process, which could potentially be improved through analysis of social media posts related to symptoms and behaviors prior to diagnosis. Here we show a preliminary investigation into the feasibility of generating screening questionnaires for a given medical condition from social media postings. The method first identifies a cohort of relevant users through their posts in dedicated patient groups and a control group of users who reported similar symptoms but did not report being diagnosed with the condition of interest. Posts made prior to diagnosis are used to generate decision rules to differentiate between the different groups, by clustering symptoms mentioned by these users and training a decision tree to differentiate between the two groups. We validate the generated rules by correlating them with scores given by medical doctors to matching hypothetical cases. We demonstrate the proposed method by creating questionnaires for three conditions (endometriosis, lupus, and gout) using the data of several hundreds of users from Reddit. These questionnaires were then validated by medical doctors. The average Pearson’s correlation between the latter’s scores and the decision rules were 0.58 (endometriosis), 0.40 (lupus) and 0.27 (gout). Our results suggest that the process of questionnaire generation can be, at least partly, automated. These questionnaires are advantageous in that they are based on real-world experience but are currently lacking in their ability to capture the context, duration, and timing of symptoms.

[LG-54] Efficient Federated Unlearning with Adaptive Differential Privacy Preservation

链接: https://arxiv.org/abs/2411.11044
作者: Yu Jiang,Xindi Tong,Ziyao Liu,Huanyi Ye,Chee Wei Tan,Kwok-Yan Lam
关键词-EN: federated learning, offers a promising, promising solution, erase the impact, impact of specific
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated unlearning (FU) offers a promising solution to effectively address the need to erase the impact of specific clients’ data on the global model in federated learning (FL), thereby granting individuals the ``Right to be Forgotten". The most straightforward approach to achieve unlearning is to train the model from scratch, excluding clients who request data removal, but it is resource-intensive. Current state-of-the-art FU methods extend traditional FL frameworks by leveraging stored historical updates, enabling more efficient unlearning than training from scratch. However, the use of stored updates introduces significant privacy risks. Adversaries with access to these updates can potentially reconstruct clients’ local data, a well-known vulnerability in the privacy domain. While privacy-enhanced techniques exist, their applications to FU scenarios that balance unlearning efficiency with privacy protection remain underexplored. To address this gap, we propose FedADP, a method designed to achieve both efficiency and privacy preservation in FU. Our approach incorporates an adaptive differential privacy (DP) mechanism, carefully balancing privacy and unlearning performance through a novel budget allocation strategy tailored for FU. FedADP also employs a dual-layered selection process, focusing on global models with significant changes and client updates closely aligned with the global model, reducing storage and communication costs. Additionally, a novel calibration method is introduced to facilitate effective unlearning. Extensive experimental results demonstrate that FedADP effectively manages the trade-off between unlearning efficiency and privacy protection.

[LG-55] FedUHB: Accelerating Federated Unlearning via Polyak Heavy Ball Method

链接: https://arxiv.org/abs/2411.11039
作者: Yu Jiang,Chee Wei Tan,Kwok-Yan Lam
关键词-EN: enabling multiple participants, facilitates collaborative machine, collaborative machine learning, enabling multiple, learning facilitates collaborative
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning facilitates collaborative machine learning, enabling multiple participants to collectively develop a shared model while preserving the privacy of individual data. The growing importance of the “right to be forgotten” calls for effective mechanisms to facilitate data removal upon request. In response, federated unlearning (FU) has been developed to efficiently eliminate the influence of specific data from the model. Current FU methods primarily rely on approximate unlearning strategies, which seek to balance data removal efficacy with computational and communication costs, but often fail to completely erase data influence. To address these limitations, we propose FedUHB, a novel exact unlearning approach that leverages the Polyak heavy ball optimization technique, a first-order method, to achieve rapid retraining. In addition, we introduce a dynamic stopping mechanism to optimize the termination of the unlearning process. Our extensive experiments show that FedUHB not only enhances unlearning efficiency but also preserves robust model performance after unlearning. Furthermore, the dynamic stopping mechanism effectively reduces the number of unlearning iterations, conserving both computational and communication resources. FedUHB can be proved as an effective and efficient solution for exact data removal in federated learning settings.

[LG-56] EfQAT: An Efficient Framework for Quantization-Aware Training

链接: https://arxiv.org/abs/2411.11038
作者: Saleh Ashkboos,Bram Verhoef,Torsten Hoefler,Evangelos Eleftheriou,Martino Dazzi
关键词-EN: achieve near-full precision, Quantization-aware training, shown to achieve, achieve near-full, near-full precision accuracy
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Quantization-aware training (QAT) schemes have been shown to achieve near-full precision accuracy. They accomplish this by training a quantized model for multiple epochs. This is computationally expensive, mainly because of the full precision backward pass. On the other hand, post-training quantization (PTQ) schemes do not involve training and are therefore computationally cheap, but they usually result in a significant accuracy drop. We address these challenges by proposing EfQAT, which generalizes both schemes by optimizing only a subset of the parameters of a quantized model. EfQAT starts by applying a PTQ scheme to a pre-trained model and only updates the most critical network parameters while freezing the rest, accelerating the backward pass. We demonstrate the effectiveness of EfQAT on various CNNs and Transformer-based models using different GPUs. Specifically, we show that EfQAT is significantly more accurate than PTQ with little extra compute. Furthermore, EfQAT can accelerate the QAT backward pass between 1.44-1.64x while retaining most accuracy.

[LG-57] raining a Label-Noise-Resistant GNN with Reduced Complexity

链接: https://arxiv.org/abs/2411.11020
作者: Rui Zhao,Bin Shi,Zhiming Liang,Jianfei Ruan,Bo Dong,Lu Lin
关键词-EN: Graph Neural Networks, Ensemble Graph Neural, Graph Neural, label, Neural Networks
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have been widely employed for semi-supervised node classification tasks on graphs. However, the performance of GNNs is significantly affected by label noise, that is, a small amount of incorrectly labeled nodes can substantially misguide model training. Mainstream solutions define node classification with label noise (NCLN) as a reliable labeling task, often introducing node similarity with quadratic computational complexity to more accurately assess label reliability. To this end, in this paper, we introduce the Label Ensemble Graph Neural Network (LEGNN), a lower complexity method for robust GNNs training against label noise. LEGNN reframes NCLN as a label ensemble task, gathering informative multiple labels instead of constructing a single reliable label, avoiding high-complexity computations for reliability assessment. Specifically, LEGNN conducts a two-step process: bootstrapping neighboring contexts and robust learning with gathered multiple labels. In the former step, we apply random neighbor masks for each node and gather the predicted labels as a high-probability label set. This mitigates the impact of inaccurately labeled neighbors and diversifies the label set. In the latter step, we utilize a partial label learning based strategy to aggregate the high-probability label information for model training. Additionally, we symmetrically gather a low-probability label set to counteract potential noise from the bootstrapped high-probability label set. Extensive experiments on six datasets demonstrate that LEGNN achieves outstanding performance while ensuring efficiency. Moreover, it exhibits good scalability on dataset with over one hundred thousand nodes and one million edges.

[LG-58] Beyond Normal: Learning Spatial Density Models of Node Mobility

链接: https://arxiv.org/abs/2411.10997
作者: Wanxin Gao,Ioanis Nikolaidis,Janelle Harms
关键词-EN: mobile nodes moving, spatial density functions, mobile node density, complex spatial density, density network models
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning models of complex spatial density functions, representing the steady-state density of mobile nodes moving on a two-dimensional terrain, can assist in network design and optimization problems, e.g., by accelerating the computation of the density function during a parameter sweep. We address the question of applicability for off-the-shelf mixture density network models for the description of mobile node density over a disk. We propose the use of Möbius distributions to retain symmetric spatial relations, yet be flexible enough to capture changes as one radially traverses the disk. The mixture models for Möbius versus Gaussian distributions are compared and the benefits of choosing Möbius distributions become evident, yet we also observe that learning mixtures of Möbius distributions is a fragile process, when using current tools, compared to learning mixtures of Gaussians.

[LG-59] owards a framework on tabular synthetic data generation: a minimalist approach: theory use cases and limitations

链接: https://arxiv.org/abs/2411.10982
作者: Agus Sudjianto,Yueyang Shen,Arun Prakash R,Anwesha Bhattacharyya,Maorong Rao,Yaqun Wang,Joel Vaughan,Nengfeng Zhou
关键词-EN: tabular data generation, synthetic tabular data, minimalist approach, approach towards synthetic, synthetic tabular
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose and study a minimalist approach towards synthetic tabular data generation. The model consists of a minimalistic unsupervised SparsePCA encoder (with contingent clustering step or log transformation to handle nonlinearity) and XGboost decoder which is SOTA for structured data regression and classification tasks. We study and contrast the methodologies with (variational) autoencoders in several toy low dimensional scenarios to derive necessary intuitions. The framework is applied to high dimensional simulated credit scoring data which parallels real-life financial applications. We applied the method to robustness testing to demonstrate practical use cases. The case study result suggests that the method provides an alternative to raw and quantile perturbation for model robustness testing. We show that the method is simplistic, guarantees interpretability all the way through, does not require extra tuning and provide unique benefits.

[LG-60] Efficient Low-Regret Online Reinforcement Learning for Linear MDPs

链接: https://arxiv.org/abs/2411.10906
作者: Philips George John,Arnab Bhattacharyya,Silviu Maniu,Dimitrios Myrisiotis,Zhenan Wu
关键词-EN: Reinforcement learning algorithms, Reinforcement learning, polynomial-time reinforcement learning, theoretical guarantees, provided theoretical guarantees
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 27 pages, 9 figures

点击查看摘要

Abstract:Reinforcement learning algorithms are usually stated without theoretical guarantees regarding their performance. Recently, Jin, Yang, Wang, and Jordan (COLT 2020) showed a polynomial-time reinforcement learning algorithm (namely, LSVI-UCB) for the setting of linear Markov decision processes, and provided theoretical guarantees regarding its running time and regret. In real-world scenarios, however, the space usage of this algorithm can be prohibitive due to a utilized linear regression step. We propose and analyze two modifications of LSVI-UCB, which alternate periods of learning and not-learning, to reduce space and time usage while maintaining sublinear regret. We show experimentally, on synthetic data and real-world benchmarks, that our algorithms achieve low space usage and running time, while not significantly sacrificing regret.

[LG-61] Watermarking Generative Categorical Data

链接: https://arxiv.org/abs/2411.10898
作者: Bochao Gu,Hengzhi He,Guang Cheng
关键词-EN: generative categorical data, watermarking generative categorical, data, data distribution, statistical framework
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel statistical framework for watermarking generative categorical data. Our method systematically embeds pre-agreed secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other, ensuring the watermark is embedded at the distribution-level. To verify the watermark, we introduce an insertion inverse algorithm and detect its presence by measuring the total variation distance between the inverse-decoded data and the original distribution. Unlike previous categorical watermarking methods, which primarily focus on embedding watermarks into a given dataset, our approach operates at the distribution-level, allowing for verification from a statistical distributional perspective. This makes it particularly well-suited for the modern paradigm of synthetic data generation, where the underlying data distribution, rather than specific data points, is of primary importance. The effectiveness of our method is demonstrated through both theoretical analysis and empirical validation.

[LG-62] Neuc-MDS: Non-Euclidean Multidimensional Scaling Through Bilinear Forms NEURIPS2024

链接: https://arxiv.org/abs/2411.10889
作者: Chengyuan Deng,Jie Gao,Kevin Lu,Feng Luo,Hongbin Sun,Cheng Xin
关键词-EN: classical Multidimensional Scaling, Multidimensional Scaling, non-metric inputs, accommodates non-Euclidean, non-Euclidean and non-metric
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:We introduce Non-Euclidean-MDS (Neuc-MDS), an extension of classical Multidimensional Scaling (MDS) that accommodates non-Euclidean and non-metric inputs. The main idea is to generalize the standard inner product to symmetric bilinear forms to utilize the negative eigenvalues of dissimilarity Gram matrices. Neuc-MDS efficiently optimizes the choice of (both positive and negative) eigenvalues of the dissimilarity Gram matrix to reduce STRESS, the sum of squared pairwise error. We provide an in-depth error analysis and proofs of the optimality in minimizing lower bounds of STRESS. We demonstrate Neuc-MDS’s ability to address limitations of classical MDS raised by prior research, and test it on various synthetic and real-world datasets in comparison with both linear and non-linear dimension reduction methods.

[LG-63] A Data-Efficient Sequential Learning Framework for Melt Pool Defect Classification in Laser Powder Bed Fusion

链接: https://arxiv.org/abs/2411.10822
作者: Ahmed Shoyeb Raihan,Austin Harper,Israt Zarin Era,Omar Al-Shebeeb,Thorsten Wuest,Srinjoy Das,Imtiaz Ahmed
关键词-EN: Metal Additive Manufacturing, Laser Powder Bed, Powder Bed Fusion, Additive Manufacturing, Metal Additive
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Ensuring the quality and reliability of Metal Additive Manufacturing (MAM) components is crucial, especially in the Laser Powder Bed Fusion (L-PBF) process, where melt pool defects such as keyhole, balling, and lack of fusion can significantly compromise structural integrity. This study presents SL-RF+ (Sequentially Learned Random Forest with Enhanced Sampling), a novel Sequential Learning (SL) framework for melt pool defect classification designed to maximize data efficiency and model accuracy in data-scarce environments. SL-RF+ utilizes RF classifier combined with Least Confidence Sampling (LCS) and Sobol sequence-based synthetic sampling to iteratively select the most informative samples to learn from, thereby refining the model’s decision boundaries with minimal labeled data. Results show that SL-RF+ outperformed traditional machine learning models across key performance metrics, including accuracy, precision, recall, and F1 score, demonstrating significant robustness in identifying melt pool defects with limited data. This framework efficiently captures complex defect patterns by focusing on high-uncertainty regions in the process parameter space, ultimately achieving superior classification performance without the need for extensive labeled datasets. While this study utilizes pre-existing experimental data, SL-RF+ shows strong potential for real-world applications in pure sequential learning settings, where data is acquired and labeled incrementally, mitigating the high costs and time constraints of sample acquisition.

[LG-64] GeomCLIP: Contrastive Geometry-Text Pre-training for Molecules

链接: https://arxiv.org/abs/2411.10821
作者: Teng Xiao,Chao Cui,Huaisheng Zhu,Vasant G. Honavar
关键词-EN: material discovery, crucial for drug, drug and material, biomedical texts, geometric structures
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: BIBM 2024

点击查看摘要

Abstract:Pretraining molecular representations is crucial for drug and material discovery. Recent methods focus on learning representations from geometric structures, effectively capturing 3D position information. Yet, they overlook the rich information in biomedical texts, which detail molecules’ properties and substructures. With this in mind, we set up a data collection effort for 200K pairs of ground-state geometric structures and biomedical texts, resulting in a PubChem3D dataset. Based on this dataset, we propose the GeomCLIP framework to enhance for multi-modal representation learning from molecular structures and biomedical text. During pre-training, we design two types of tasks, i.e., multimodal representation alignment and unimodal denoising pretraining, to align the 3D geometric encoder with textual information and, at the same time, preserve its original representation power. Experimental results show the effectiveness of GeomCLIP in various tasks such as molecular property prediction, zero-shot text-molecule retrieval, and 3D molecule captioning. Our code and collected dataset are available at \urlthis https URL

[LG-65] An Oversampling-enhanced Multi-class Imbalanced Classification Framework for Patient Health Status Prediction Using Patient-reported Outcomes

链接: https://arxiv.org/abs/2411.10819
作者: Yang Yan,Zhong Chen,Cai Xu,Xinglei Shen,Jay Shiao,John Einck,Ronald C Chen,Hao Gao
关键词-EN: radiation therapy play, Patient-reported outcomes, treated with radiation, play a vital, vital role
类目: Machine Learning (cs.LG)
*备注: 10 pages, 12 figures, 4 tables

点击查看摘要

Abstract:Patient-reported outcomes (PROs) directly collected from cancer patients being treated with radiation therapy play a vital role in assisting clinicians in counseling patients regarding likely toxicities. Precise prediction and evaluation of symptoms or health status associated with PROs are fundamental to enhancing decision-making and planning for the required services and support as patients transition into survivorship. However, the raw PRO data collected from hospitals exhibits some intrinsic challenges such as incomplete item reports and imbalance patient toxicities. To the end, in this study, we explore various machine learning techniques to predict patient outcomes related to health status such as pain levels and sleep discomfort using PRO datasets from a cancer photon/proton therapy center. Specifically, we deploy six advanced machine learning classifiers – Random Forest (RF), XGBoost, Gradient Boosting (GB), Support Vector Machine (SVM), Multi-Layer Perceptron with Bagging (MLP-Bagging), and Logistic Regression (LR) – to tackle a multi-class imbalance classification problem across three prevalent cancer types: head and neck, prostate, and breast cancers. To address the class imbalance issue, we employ an oversampling strategy, adjusting the training set sample sizes through interpolations of in-class neighboring samples, thereby augmenting minority classes without deviating from the original skewed class distribution. Our experimental findings across multiple PRO datasets indicate that the RF and XGB methods achieve robust generalization performance, evidenced by weighted AUC and detailed confusion matrices, in categorizing outcomes as mild, intermediate, and severe post-radiation therapy. These results underscore the models’ effectiveness and potential utility in clinical settings.

[LG-66] Conformation Generation using Transformer Flows

链接: https://arxiv.org/abs/2411.10817
作者: Sohil Atul Shah,Vladlen Koltun
关键词-EN: Estimating three-dimensional conformations, Estimating three-dimensional, chemical functions, graph allows insight, biological and chemical
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: Technical Report. Code available at this https URL

点击查看摘要

Abstract:Estimating three-dimensional conformations of a molecular graph allows insight into the molecule’s biological and chemical functions. Fast generation of valid conformations is thus central to molecular modeling. Recent advances in graph-based deep networks have accelerated conformation generation from hours to seconds. However, current network architectures do not scale well to large molecules. Here we present ConfFlow, a flow-based model for conformation generation based on transformer networks. In contrast with existing approaches, ConfFlow directly samples in the coordinate space without enforcing any explicit physical constraints. The generative procedure is highly interpretable and is akin to force field updates in molecular dynamics simulation. When applied to the generation of large molecule conformations, ConfFlow improve accuracy by up to 40% relative to state-of-the-art learning-based methods. The source code is made available at this https URL.

[LG-67] Stable Continual Reinforcement Learning via Diffusion-based Trajectory Replay ICLR2024

链接: https://arxiv.org/abs/2411.10809
作者: Feng Chen,Fuguang Han,Cong Guan,Lei Yuan,Zhilong Zhang,Yang Yu,Zongzhang Zhang
关键词-EN: continual Reinforcement Learning, inherent non-stationarity prevalent, sequentially presented decision-making, Reinforcement Learning, presented decision-making tasks
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 1 table, inclusion at ICLR 2024 Workshop on Generative Models for Decision Making

点击查看摘要

Abstract:Given the inherent non-stationarity prevalent in real-world applications, continual Reinforcement Learning (RL) aims to equip the agent with the capability to address a series of sequentially presented decision-making tasks. Within this problem setting, a pivotal challenge revolves around \textitcatastrophic forgetting issue, wherein the agent is prone to effortlessly erode the decisional knowledge associated with past encountered tasks when learning the new one. In recent progresses, the \textitgenerative replay methods have showcased substantial potential by employing generative models to replay data distribution of past tasks. Compared to storing the data from past tasks directly, this category of methods circumvents the growing storage overhead and possible data privacy concerns. However, constrained by the expressive capacity of generative models, existing \textitgenerative replay methods face challenges in faithfully reconstructing the data distribution of past tasks, particularly in scenarios with a myriad of tasks or high-dimensional data. Inspired by the success of diffusion models in various generative tasks, this paper introduces a novel continual RL algorithm DISTR (Diffusion-based Trajectory Replay) that employs a diffusion model to memorize the high-return trajectory distribution of each encountered task and wakeups these distributions during the policy learning on new tasks. Besides, considering the impracticality of replaying all past data each time, a prioritization mechanism is proposed to prioritize the trajectory replay of pivotal tasks in our method. Empirical experiments on the popular continual RL benchmark \textttContinual World demonstrate that our proposed method obtains a favorable balance between \textitstability and \textitplasticity, surpassing various existing continual RL baselines in average success rate.

[LG-68] On Reductions and Representations of Learning Problems in Euclidean Spaces

链接: https://arxiv.org/abs/2411.10784
作者: Bogdan Chornomaz,Shay Moran,Tom Waknine
关键词-EN: real-valued surrogate loss, practical prediction algorithms, prediction algorithms represent, algorithms represent inputs, effectively reducing classification
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many practical prediction algorithms represent inputs in Euclidean space and replace the discrete 0/1 classification loss with a real-valued surrogate loss, effectively reducing classification tasks to stochastic optimization. In this paper, we investigate the expressivity of such reductions in terms of key resources, including dimension and the role of randomness. We establish bounds on the minimum Euclidean dimension D needed to reduce a concept class with VC dimension d to a Stochastic Convex Optimization (SCO) problem in \mathbbR^D , formally addressing the intuitive interpretation of the VC dimension as the number of parameters needed to learn the class. To achieve this, we develop a generalization of the Borsuk-Ulam Theorem that combines the classical topological approach with convexity considerations. Perhaps surprisingly, we show that, in some cases, the number of parameters D must be exponentially larger than the VC dimension d , even if the reduction is only slightly non-trivial. We also present natural classification tasks that can be represented in much smaller dimensions by leveraging randomness, as seen in techniques like random initialization. This result resolves an open question posed by Kamath, Montasser, and Srebro (COLT 2020). Our findings introduce new variants of \emphdimension complexity (also known as \emphsign-rank), a well-studied parameter in learning and complexity theory. Specifically, we define an approximate version of sign-rank and another variant that captures the minimum dimension required for a reduction to SCO. We also propose several open questions and directions for future research. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 68Q32 Cite as: arXiv:2411.10784 [cs.LG] (or arXiv:2411.10784v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.10784 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-69] Steam Turbine Anomaly Detection: An Unsupervised Learning Approach Using Enhanced Long Short-Term Memory Variational Autoencoder

链接: https://arxiv.org/abs/2411.10765
作者: Weiming Xu,Peng Zhang
关键词-EN: power generation equipment, core thermal power, thermal power generation, incur significant expenses, turbines incur significant
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:As core thermal power generation equipment, steam turbines incur significant expenses and adverse effects on operation when facing interruptions like downtime, maintenance, and damage. Accurate anomaly detection is the prerequisite for ensuring the safe and stable operation of steam turbines. However, challenges in steam turbine anomaly detection, including inherent anomalies, lack of temporal information analysis, and high-dimensional data complexity, limit the effectiveness of existing methods. To address these challenges, we proposed an Enhanced Long Short-Term Memory Variational Autoencoder using Deep Advanced Features and Gaussian Mixture Model (ELSTMVAE-DAF-GMM) for precise unsupervised anomaly detection in unlabeled datasets. Specifically, LSTMVAE, integrating LSTM with VAE, was used to project high-dimensional time-series data to a low-dimensional phase space. The Deep Autoencoder-Local Outlier Factor (DAE-LOF) sample selection mechanism was used to eliminate inherent anomalies during training, further improving the model’s precision and reliability. The novel deep advanced features (DAF) hybridize latent embeddings and reconstruction discrepancies from the LSTMVAE model and provide a more comprehensive data representation within a continuous and structured phase space, significantly enhancing anomaly detection by synergizing temporal dynamics with data pattern variations. These DAF were incorporated into GMM to ensure robust and effective unsupervised anomaly detection. We utilized real operating data from industry steam turbines and conducted both comparison and ablation experiments, demonstrating superior anomaly detection outcomes characterized by high accuracy and minimal false alarm rates compared with existing methods.

[LG-70] ML2Tuner: Efficient Code Tuning via Multi-Level Machine Learning Models NEURIPS2024

链接: https://arxiv.org/abs/2411.10764
作者: JooHyoung Cha,Munyoung Lee,Jinse Kwon,Jubin Lee,Jemin Lee,Yongin Kwon
关键词-EN: necessitates specialized hardware, models necessitates specialized, software optimizations, deep learning, increasing complexity
类目: Machine Learning (cs.LG)
*备注: Accepted in NeurIPS 2024 workshop on Machine Learning for Systems, 12 pages, 5 figures

点击查看摘要

Abstract:The increasing complexity of deep learning models necessitates specialized hardware and software optimizations, particularly for deep learning accelerators. Existing autotuning methods often suffer from prolonged tuning times due to profiling invalid configurations, which can cause runtime errors. We introduce ML ^2 Tuner, a multi-level machine learning tuning technique that enhances autotuning efficiency by incorporating a validity prediction model to filter out invalid configurations and an advanced performance prediction model utilizing hidden features from the compilation process. Experimental results on an extended VTA accelerator demonstrate that ML ^2 Tuner achieves equivalent performance improvements using only 12.3% of the samples required with a similar approach as TVM and reduces invalid profiling attempts by an average of 60.8%, Highlighting its potential to enhance autotuning performance by filtering out invalid configurations

[LG-71] On-device Anomaly Detection in Conveyor Belt Operations

链接: https://arxiv.org/abs/2411.10729
作者: Luciano S. Martinez-Rau,Yuxuan Zhang,Bengt Oelmann,Sebastian Bader
关键词-EN: technologies from Industry, enhancing efficiency, leverages advancements, advancements in automation, mining conveyor belt
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP)
*备注: Preprint submitted to IEEE Transactions on Instrumentation and Measurement

点击查看摘要

Abstract:Mining 4.0 leverages advancements in automation, digitalization, and interconnected technologies from Industry 4.0 to address the unique challenges of the mining sector, enhancing efficiency, safety, and sustainability. Conveyor belts are crucial in mining operations by enabling the continuous and efficient movement of bulk materials over long distances, which directly impacts productivity. While detecting anomalies in specific conveyor belt components, such as idlers, pulleys, and belt surfaces, has been widely studied, identifying the root causes of these failures remains critical due to factors like changing production conditions and operator errors. Continuous monitoring of mining conveyor belt work cycles for anomaly detection is still at an early stage and requires robust solutions. This study proposes two distinctive pattern recognition approaches for real-time anomaly detection in the operational cycles of mining conveyor belts, combining feature extraction, threshold-based cycle detection, and tiny machine-learning classification. Both approaches outperformed a state-of-the-art technique on two datasets for duty cycle classification in terms of F1-scores. The first approach, with 97.3% and 80.2% for normal and abnormal cycles, respectively, reaches the highest performance in the first dataset while the second approach excels on the second dataset, scoring 91.3% and 67.9%. Implemented on two low-power microcontrollers, the methods demonstrated efficient, real-time operation with energy consumption of 13.3 and 20.6 \mu J during inference. These results offer valuable insights for detecting mechanical failure sources, supporting targeted preventive maintenance, and optimizing production cycles.

[LG-72] Multi Scale Graph Neural Network for Alzheimers Disease ML4H ALT

链接: https://arxiv.org/abs/2411.10720
作者: Anya Chauhan,Ayush Noori,Zhaozhi Li,Yingnan He,Michelle M Li,Marinka Zitnik,Sudeshna Das
关键词-EN:
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 9 pages

点击查看摘要

[LG-73] FlowScope: Enhancing Decision Making by Time Series Forecasting based on Prediction Optimization using HybridFlow Forecast Framework

链接: https://arxiv.org/abs/2411.10716
作者: Nitin Sagar Boyeena,Begari Susheel Kumar
关键词-EN: Time series forecasting, Time series, linear time series, seasonal time series, Integrated Moving Average
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP)
*备注: 12 pages and 6 figures

点击查看摘要

Abstract:Time series forecasting is crucial in several sectors, such as meteorology, retail, healthcare, and finance. Accurately forecasting future trends and patterns is crucial for strategic planning and making well-informed decisions. In this case, it is crucial to include many forecasting methodologies. The strengths of Auto-regressive Integrated Moving Average (ARIMA) for linear time series, Seasonal ARIMA models (SARIMA) for seasonal time series, Exponential Smoothing State Space Models (ETS) for handling errors and trends, and Long Short-Term Memory (LSTM) Neural Network model for complex pattern recognition have been combined to create a comprehensive framework called FlowScope. SARIMA excels in capturing seasonal variations, whereas ARIMA ensures effective handling of linear time series. ETS models excel in capturing trends and correcting errors, whereas LSTM networks excel in reflecting intricate temporal connections. By combining these methods from both machine learning and deep learning, we propose a deep-hybrid learning approach FlowScope which offers a versatile and robust platform for predicting time series data. This empowers enterprises to make informed decisions and optimize long-term strategies for maximum performance. Keywords: Time Series Forecasting, HybridFlow Forecast Framework, Deep-Hybrid Learning, Informed Decisions. Comments: 12 pages and 6 figures Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP) MSC classes: 62M10 (Primary), 68T07 (Secondary) ACMclasses: I.2.6; G.3; I.5 Cite as: arXiv:2411.10716 [cs.LG] (or arXiv:2411.10716v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.10716 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] Hybrid Attention Model Using Feature Decomposition and Knowledge Distillation for Glucose Forecasting

链接: https://arxiv.org/abs/2411.10703
作者: Ebrahim Farahmand,Shovito Barua Soumma,Nooshin Taheri Chatrudi,Hassan Ghasemzadeh
关键词-EN:
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

[LG-75] Wireless Resource Allocation with Collaborative Distributed and Centralized DRL under Control Channel Attacks

链接: https://arxiv.org/abs/2411.10702
作者: Ke Wang,Wanchun Liu,Teng Joon Lim
关键词-EN: resource allocation commands, carrying resource allocation, resource allocation, wireless resource allocation, cyber-physical system
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:In this paper, we consider a wireless resource allocation problem in a cyber-physical system (CPS) where the control channel, carrying resource allocation commands, is subjected to denial-of-service (DoS) attacks. We propose a novel concept of collaborative distributed and centralized (CDC) resource allocation to effectively mitigate the impact of these attacks. To optimize the CDC resource allocation policy, we develop a new CDC-deep reinforcement learning (DRL) algorithm, whereas existing DRL frameworks only formulate either centralized or distributed decision-making problems. Simulation results demonstrate that the CDC-DRL algorithm significantly outperforms state-of-the-art DRL benchmarks, showcasing its ability to address resource allocation problems in large-scale CPSs under control channel attacks.

[LG-76] How to Defend Against Large-scale Model Poisoning Attacks in Federated Learning: A Vertical Solution

链接: https://arxiv.org/abs/2411.10673
作者: Jinbo Wang,Ruijin Wang,Fengli Zhang
关键词-EN:
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

[LG-77] Patient-Specific Models of Treatment Effects Explain Heterogeneity in Tuberculosis ML4H ALT

链接: https://arxiv.org/abs/2411.10645
作者: Ethan Wu,Caleb Ellington,Ben Lengerich,Eric P. Xing
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 4 pages

点击查看摘要

[LG-78] Drift-Resilient TabPFN: In-Context Learning Temporal Distribution Shifts on Tabular Data NEURIPS2024

链接: https://arxiv.org/abs/2411.10634
作者: Kai Helli,David Schnurr,Noah Hollmann,Samuel Müller,Frank Hutter
关键词-EN:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

[LG-79] KAT to KANs: A Review of Kolmogorov-Arnold Networks and the Neural Leap Forward

链接: https://arxiv.org/abs/2411.10622
作者: Divesh Basina,Joseph Raj Vishal,Aarya Choudhary,Bharatesh Chakravarthi
关键词-EN:
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-80] Electrical Load Forecasting in Smart Grid: A Personalized Federated Learning Approach

链接: https://arxiv.org/abs/2411.10619
作者: Ratun Rahman,Neeraj Kumar,Dinh C. Nguyen
关键词-EN: Electric load forecasting, Electric load, essential for power, power management, management and stability
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This paper has been accepted by the IEEE Consumer Communications \ Networking Conference (CCNC), Jan. 2025

点击查看摘要

Abstract:Electric load forecasting is essential for power management and stability in smart grids. This is mainly achieved via advanced metering infrastructure, where smart meters (SMs) are used to record household energy consumption. Traditional machine learning (ML) methods are often employed for load forecasting but require data sharing which raises data privacy concerns. Federated learning (FL) can address this issue by running distributed ML models at local SMs without data exchange. However, current FL-based approaches struggle to achieve efficient load forecasting due to imbalanced data distribution across heterogeneous SMs. This paper presents a novel personalized federated learning (PFL) method to load prediction under non-independent and identically distributed (non-IID) metering data settings. Specifically, we introduce meta-learning, where the learning rates are manipulated using the meta-learning idea to maximize the gradient for each client in each global round. Clients with varying processing capacities, data sizes, and batch sizes can participate in global model aggregation and improve their local load forecasting via personalized learning. Simulation results show that our approach outperforms state-of-the-art ML and FL methods in terms of better load forecasting accuracy.

[LG-81] o Shuffle or not to Shuffle: Auditing DP-SGD with Shuffling

链接: https://arxiv.org/abs/2411.10614
作者: Meenatchi Sundaram Muthu Selva Annamalai,Borja Balle,Emiliano De Cristofaro,Jamie Hayes
关键词-EN:
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-82] Learning Quantitative Automata Modulo Theories

链接: https://arxiv.org/abs/2411.10601
作者: Eric Hsiung,Swarat Chaudhuri,Joydeep Biswas
关键词-EN: modeling probability distributions, including modeling probability, Markov chains, numerous applications, reward machines
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 30 pages, 13 figures, 1 table

点击查看摘要

Abstract:Quantitative automata are useful representations for numerous applications, including modeling probability distributions over sequences to Markov chains and reward machines. Actively learning such automata typically occurs using explicitly gathered input-output examples under adaptations of the L-star algorithm. However, obtaining explicit input-output pairs can be expensive, and there exist scenarios, including preference-based learning or learning from rankings, where providing constraints is a less exerting and a more natural way to concisely describe desired properties. Consequently, we propose the problem of learning deterministic quantitative automata from sets of constraints over the valuations of input sequences. We present QUINTIC, an active learning algorithm, wherein the learner infers a valid automaton through deductive reasoning, by applying a theory to a set of currently available constraints and an assumed preference model and quantitative automaton class. QUINTIC performs a complete search over the space of automata, and is guaranteed to be minimal and correctly terminate. Our evaluations utilize theory of rationals in order to learn summation, discounted summation, product, and classification quantitative automata, and indicate QUINTIC is effective at learning these types of automata.

[LG-83] BioNeMo Framework: a modular high-performance library for AI model development in drug discovery

链接: https://arxiv.org/abs/2411.10548
作者: Peter St. John,Dejun Lin,Polina Binder,Malcolm Greaves,Vega Shah,John St. John,Adrian Lange,Patrick Hsu,Rajesh Illango,Arvind Ramanathan,Anima Anandkumar,David H Brookes,Akosua Busia,Abhishaike Mahajan,Stephen Malina,Neha Prasad,Sam Sinai,Lindsay Edwards,Thomas Gaudelet,Cristian Regep,Martin Steinegger,Burkhard Rost,Alexander Brace,Kyle Hippe,Luca Naef,Keisuke Kamata,George Armstrong,Kevin Boyd,Zhonglin Cao,Han-Yi Chou,Simon Chu,Allan dos Santos Costa,Sajad Darabi,Eric Dawson,Kieran Didi,Cong Fu,Mario Geiger,Michelle Gill,Darren Hsu,Gagan Kaushik,Maria Korshunova,Steven Kothen-Hill,Youhan Lee,Meng Liu,Micha Livne,Zachary McClure,Jonathan Mitchell,Alireza Moradzadeh,Ohad Mosafi,Youssef Nashed,Saee Paliwal,Yuxing Peng,Sara Rabhi,Farhad Ramezanghorbani,Danny Reidenbach,Camir Ricketts,Brian Roland,Kushal Shah,Tyler Shimko,Hassan Sirelkhatim,Savitha Srinivasan,Abraham C Stern,Dorota Toczydlowska,Srimukh Prasad Veccham,Niccolò Alberto Elia Venanzi,Anton Vorontsov,Jared Wilber,Isabel Wilkinson,Wei Jing Wong,Eva Xue,Cory Ye,Xin Yu,Yang Zhang,Guoqing Zhou,Becca Zandstein,Christian Dallago,Bruno Trentini,Emine Kucukbenli,Saee Paliwal,Timur Rvachov,Eddie Calleja,Johnny Israeli,Harry Clifford,Risto Haukioja,Nicholas Haemel,Kyle Tretina,Neha Tadimeti,Anthony B Costa
关键词-EN:
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

[LG-84] Debias-CLR: A Contrastive Learning Based Debiasing Method for Algorithmic Fairness in Healthcare Applications ALT

链接: https://arxiv.org/abs/2411.10544
作者: Ankita Agarwal,Tanvi Banerjee,William Romine,Mia Cajita
关键词-EN:
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 9 pages, 1 figure, 4 tables. Manuscript accepted at 7th Special Session on HealthCare Data in IEEE Big Data 2024, Washington, D.C

点击查看摘要

[LG-85] On the Privacy Risk of In-context Learning

链接: https://arxiv.org/abs/2411.10512
作者: Haonan Duan,Adam Dziedzic,Mohammad Yaghini,Nicolas Papernot,Franziska Boenisch
关键词-EN: excellent few-shot learners, Large language models, Large language, few-shot learners, excellent few-shot
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are excellent few-shot learners. They can perform a wide variety of tasks purely based on natural language prompts provided to them. These prompts contain data of a specific downstream task – often the private dataset of a party, e.g., a company that wants to leverage the LLM for their purposes. We show that deploying prompted models presents a significant privacy risk for the data used within the prompt by instantiating a highly effective membership inference attack. We also observe that the privacy risk of prompted models exceeds fine-tuned models at the same utility levels. After identifying the model’s sensitivity to their prompts – in the form of a significantly higher prediction confidence on the prompted data – as a cause for the increased risk, we propose ensembling as a mitigation strategy. By aggregating over multiple different versions of a prompted model, membership inference risk can be decreased.

[LG-86] SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

链接: https://arxiv.org/abs/2411.10510
作者: Joseph Liu,Joshua Geddes,Ziyu Guo,Haomiao Jiang,Mahesh Kumar Nandwana
关键词-EN:
类目: Machine Learning (cs.LG)
*备注: Code can be found at this https URL

点击查看摘要

[LG-87] Physics-Informed Neural Networks for Electrical Circuit Analysis: Applications in Dielectric Material Modeling

链接: https://arxiv.org/abs/2411.10483
作者: Reyhaneh Taj
关键词-EN:
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

[LG-88] Boolean-aware Boolean Circuit Classification: A Comprehensive Study on Graph Neural Network

链接: https://arxiv.org/abs/2411.10481
作者: Liwei Ni,Xinquan Li,Biwei Xie,Huawei Li
关键词-EN:
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

[LG-89] Challenges in the Differential Classification of Individual Diagnoses from Co-Occurring Autism and ADHD Using Survey Data

链接: https://arxiv.org/abs/2411.10479
作者: Aditi Jaiswal,Dennis P. Wall,Peter Washington
关键词-EN:
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-90] Constrained composite Bayesian optimization for rational synthesis of polymeric particles

链接: https://arxiv.org/abs/2411.10471
作者: Fanjin Wang,Maryam Parhizkar,Anthony Harker,Mohan Edirisinghe
关键词-EN:
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

[LG-91] User-wise Perturbations for User Identity Protection in EEG-Based BCIs

链接: https://arxiv.org/abs/2411.10469
作者: Xiaoqing Chen,Siyang Li,Yunlu Tu,Ziwei Wang,Dongrui Wu
关键词-EN:
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-92] Coordinated Reply Attacks in Influence Operations: Characterization and Detection

链接: https://arxiv.org/abs/2410.19272
作者: Manita Pote,Tuğrulcan Elmas,Alessandro Flammini,Filippo Menczer
关键词-EN:
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

[LG-93] Pairwise Markov Chains for Volatility Forecasting

链接: https://arxiv.org/abs/2411.11838
作者: Elie Azeraf
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages, 9 figures

点击查看摘要

[LG-94] KAN/MultKAN with Physics-Informed Spline fitting (KAN-PISF) for ordinary/partial differential equation discovery of nonlinear dynamic systems

链接: https://arxiv.org/abs/2411.11801
作者: Ashish Pal,Satish Nagarajaiah
关键词-EN:
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-95] Parallelly Tempered Generative Adversarial Networks

链接: https://arxiv.org/abs/2411.11786
作者: Jinwon Sohn,Qifan Song
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-96] Debiased Regression for Root-N-Consistent Conditional Mean Estimation

链接: https://arxiv.org/abs/2411.11748
作者: Masahiro Kato
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-97] Learning Differentiable Surrogate Losses for Structured Prediction

链接: https://arxiv.org/abs/2411.11682
作者: Junjie Yang,Matthieu Labeau,Florence d’Alché-Buc
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-98] Analysis of Hardware Synthesis Strategies for Machine Learning in Collider Trigger and Data Acquisition

链接: https://arxiv.org/abs/2411.11678
作者: Haoyi Jia,Abhilasha Dave,Julia Gonski,Ryan Herbst
关键词-EN:
类目: Instrumentation and Detectors (physics.ins-det); Hardware Architecture (cs.AR); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 12 pages, 5 figures

点击查看摘要

[LG-99] On the physics of nested Markov models: a generalized probabilistic theory perspective

链接: https://arxiv.org/abs/2411.11614
作者: Xingjian Zhang,Yuhao Wang
关键词-EN:
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 21 pages, 5 figures, 5 tables; Comments are welcome!

点击查看摘要

[LG-100] Robust Causal Analysis of Linear Cyclic Systems With Hidden Confounders

链接: https://arxiv.org/abs/2411.11590
作者: Boris Lorbeer
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages, 2 figures

点击查看摘要

[LG-101] Data-driven model reconstruction for nonlinear wave dynamics

链接: https://arxiv.org/abs/2411.11556
作者: Ekaterina Smolina,Lev Smirnov,Daniel Leykam,Franco Nori,Daria Smirnova
关键词-EN:
类目: Optics (physics.optics); Machine Learning (cs.LG); Mathematical Physics (math-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 6 pages, 5 figures

点击查看摘要

[LG-102] A Modular Open Source Framework for Genomic Variant Calling

链接: https://arxiv.org/abs/2411.11513
作者: Ankita Vaishnobi Bisoi,Bharath Ramsundar
关键词-EN:
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-103] PALMS: Parallel Adaptive Lasso with Multi-directional Signals for Latent Networks Reconstruction

链接: https://arxiv.org/abs/2411.11464
作者: Zhaoyu Xing,Wei Zhong
关键词-EN:
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 48 pages

点击查看摘要

[LG-104] Modeling Multivariable High-resolution 3D Urban Microclimate Using Localized Fourier Neural Operator

链接: https://arxiv.org/abs/2411.11348
作者: Shaoxiang Qin,Dongxue Zhan,Dingyang Geng,Wenhui Peng,Geng Tian,Yurong Shi,Naiping Gao,Xue Liu,Liangzhu Leon Wang
关键词-EN:
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-105] Accelerating spherical K-means clustering for large-scale sparse document data

链接: https://arxiv.org/abs/2411.11300
作者: Kazuo Aoyama,Kazumi Saito
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 28 pages, 23 figures

点击查看摘要

[LG-106] Coupled Integral PINN for conservation law

链接: https://arxiv.org/abs/2411.11276
作者: Yeping Wang,Shihao Yang
关键词-EN:
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-107] ACE2: Accurately learning subseasonal to decadal atmospheric variability and forced responses

链接: https://arxiv.org/abs/2411.11268
作者: Oliver Watt-Meyer,Brian Henn,Jeremy McGibbon,Spencer K. Clark,Anna Kwa,W. Andre Perkins,Elynn Wu,Lucas Harris,Christopher S. Bretherton
关键词-EN:
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 31 pages, 23 figures

点击查看摘要

[LG-108] Accelerating Quantum Emitter Characterization with Latent Neural Ordinary Differential Equations

链接: https://arxiv.org/abs/2411.11191
作者: Andrew H. Proppe,Kin Long Kelvin Lee,Weiwei Sun,Chantalle J. Krajewska,Oliver Tye,Moungi G. Bawendi
关键词-EN:
类目: Quantum Physics (quant-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-109] Variational Bayesian Bow tie Neural Networks with Shrinkage

链接: https://arxiv.org/abs/2411.11132
作者: Alisa Sheinkman,Sara Wade
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-110] An Investigation of Offline Reinforcement Learning in Factorisable Action Spaces

链接: https://arxiv.org/abs/2411.11088
作者: Alex Beeson,David Ireland,Giovanni Montana
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Published in Transactions on Machine Learning Research (11/2024)

点击查看摘要

[LG-111] Program Evaluation with Remotely Sensed Outcomes

链接: https://arxiv.org/abs/2411.10959
作者: Ashesh Rambachan,Rahul Singh,Davide Viviano
关键词-EN:
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-112] Constructing accurate machine-learned potentials and performing highly efficient atomistic simulations to predict structural and thermal properties

链接: https://arxiv.org/abs/2411.10911
作者: Junlan Liu,Qian Yin,Mengshu He,Jun Zhou
关键词-EN:
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

[LG-113] Building Interpretable Climate Emulators for Economics

链接: https://arxiv.org/abs/2411.10768
作者: Aryan Eftekhari,Doris Folini,Aleksandra Friedl,Felix Kübler,Simon Scheidegger,Olaf Schenk
关键词-EN: Integrated Assessment Models, Integrated Assessment, advanced climate science, interpretable carbon-cycle emulators, custom-build CCEs accurately
类目: Econometrics (econ.EM); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a framework for developing efficient and interpretable carbon-cycle emulators (CCEs) as part of climate emulators in Integrated Assessment Models, enabling economists to custom-build CCEs accurately calibrated to advanced climate science. We propose a generalized multi-reservoir linear box-model CCE that preserves key physical quantities and can be use-case tailored for specific use cases. Three CCEs are presented for illustration: the 3SR model (replicating DICE-2016), the 4PR model (including the land biosphere), and the 4PR-X model (accounting for dynamic land-use changes like deforestation that impact the reservoir’s storage capacity). Evaluation of these models within the DICE framework shows that land-use changes in the 4PR-X model significantly impact atmospheric carbon and temperatures – emphasizing the importance of using tailored climate emulators. By providing a transparent and flexible tool for policy analysis, our framework allows economists to assess the economic impacts of climate policies more accurately.

[LG-114] Series Expansion of Probability of Correct Selection for Improved Finite Budget Allocation in Ranking and Selection

链接: https://arxiv.org/abs/2411.10695
作者: Xinbo Shi,Yijie Peng,Bruno Tuffin
关键词-EN:
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

[LG-115] Deep Learning-Based Image Compression for Wireless Communications: Impacts on ReliabilityThroughput and Latency

链接: https://arxiv.org/abs/2411.10650
作者: Mostafa Naseri,Pooya Ashtari,Mohamed Seif,Eli De Poorter,H. Vincent Poor,Adnan Shahid
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

[LG-116] Neural decoding from stereotactic EEG: accounting for electrode variability across subjects NEURIPS2024

链接: https://arxiv.org/abs/2411.10458
作者: Georgios Mentzelopoulos,Evangelos Chatzipantazis,Ashwin G. Ramayya,Michelle J. Hedlund,Vivek P. Buch,Kostas Daniilidis,Konrad P. Kording,Flavia Vitale
关键词-EN:
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Accepted for publication at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

信息检索

[IR-0] Do Captioning Metrics Reflect Music Semantic Alignment?

链接: https://arxiv.org/abs/2411.11692
作者: Jinwoo Lee,Kyogu Lee
关键词-EN: language generation models, advanced language generation, promising task, generation models, advent of advanced
类目: ound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: International Society for Music Information Retrieval (ISMIR) 2024, Late Breaking Demo (LBD)

点击查看摘要

Abstract:Music captioning has emerged as a promising task, fueled by the advent of advanced language generation models. However, the evaluation of music captioning relies heavily on traditional metrics such as BLEU, METEOR, and ROUGE which were developed for other domains, without proper justification for their use in this new field. We present cases where traditional metrics are vulnerable to syntactic changes, and show they do not correlate well with human judgments. By addressing these issues, we aim to emphasize the need for a critical reevaluation of how music captions are assessed.

[IR-1] Collaborative Contrastive Network for Click-Through Rate Prediction

链接: https://arxiv.org/abs/2411.11508
作者: Chen Gao,Zixin Zhao,Sihao Hu,Lv Shao,Tong Liu
关键词-EN: E-commerce platforms provide, platforms provide entrances, E-commerce platforms, trigger item, platforms provide
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:E-commerce platforms provide entrances for customers to enter mini-apps to meet their specific shopping needs. At the entrance of a mini-app, a trigger item recommended based on customers’ historical preferences, is displayed to attract customers to enter the mini-app. Existing Click-Through Rate (CTR) prediction approaches have two significant weaknesses: (i) A portion of customer entries is driven by their interest in the mini-app itself rather than the trigger item. In such cases, approaches highly hinging on the trigger item tend to recommend similar items, thus misunderstanding the customers’ real intention; (ii) Approaches that consider customers’ intention toward mini-apps, require the regular existence of mini-apps for customers to cultivate routine shopping habits, making such approaches less robust for mini-apps that are available for only short periods (1 or 3 days) in Explosive Promotional Scenarios (EPS), such as the Black Friday and China’s Double 11 Shopping Carnival. To address the above-mentioned issues, we introduce a more general and robust CTR prediction approach, dubbed Collaborative Contrastive Network (CCN). Given a user, CCN learns to identify two item clusters that can represent the user’s interests and disinterests, via leveraging the collaborative relationship of co-click/co-non-click or the non-collaborative relationship of mono-click as the supervision signal for contrastive learning. This paradigm does not need to explicitly estimate user’s binary entry intention and avoids amplifying the impact of the trigger item. Online A/B testing on large-scale real-world data demonstrates that CCN sets a new state-of-the-art performance on Taobao, boosting CTR by 12.3% and order volume by 12.7%.

[IR-2] All-domain Moveline Evolution Network for Click-Through Rate Prediction

链接: https://arxiv.org/abs/2411.11502
作者: Chen Gao,Zixin Zhao,Lv Shao,Tong Liu
关键词-EN: E-commerce app users, inherently logically consistent, E-commerce app, all-domain user moveline, user moveline
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:E-commerce app users exhibit behaviors that are inherently logically consistent. A series of multi-scenario user behaviors interconnect to form the scene-level all-domain user moveline, which ultimately reveals the user’s true intention. Traditional CTR prediction methods typically focus on the item-level interaction between the target item and the historically interacted items. However, the scene-level interaction between the target item and the user moveline remains underexplored. There are two challenges when modeling the interaction with preceding all-domain user moveline: (i) Heterogeneity between items and scenes: Unlike traditional user behavior sequences that utilize items as carriers, the user moveline utilizes scenes as carriers. The heterogeneity between items and scenes complicates the process of aligning interactions within a unified representation space. (ii) Temporal misalignment of linked scene-level and item-level behaviors: In the preceding user moveline with a fixed sampling length, certain critical scene-level behaviors are closely linked to subsequent item-level behaviors. However, it is impossible to establish a complete temporal alignment that clearly identifies which specific scene-level behaviors correspond to which item-level behaviors. To address these challenges and pioneer modeling user intent from the perspective of the all-domain moveline, we propose All-domain Moveline Evolution Network (AMEN). AMEN not only transfers interactions between items and scenes to homogeneous representation spaces, but also introduces a Temporal Sequential Pairwise (TSP) mechanism to understand the nuanced associations between scene-level and item-level behaviors, ensuring that the all-domain user moveline differentially influences CTR predictions for user’s favored and unfavored items. Online A/B testing demonstrates that our method achieves a +11.6% increase in CTCVR.

[IR-3] Controlling Diversity at Inference: Guiding Diffusion Recommender Models with Targeted Category Preferences KDD2025

链接: https://arxiv.org/abs/2411.11240
作者: Gwangseok Han,Wonbin Kweon,Minsoo Kim,Hwanjo Yu
关键词-EN: filter bubble problems, alleviate bias amplification, bubble problems, important task, task to alleviate
类目: Information Retrieval (cs.IR)
*备注: KDD 2025

点击查看摘要

Abstract:Diversity control is an important task to alleviate bias amplification and filter bubble problems. The desired degree of diversity may fluctuate based on users’ daily moods or business strategies. However, existing methods for controlling diversity often lack flexibility, as diversity is decided during training and cannot be easily modified during inference. We propose \textbfD3Rec (\underlineDisentangled \underlineDiffusion model for \underlineDiversified \underlineRecommendation), an end-to-end method that controls the accuracy-diversity trade-off at inference. D3Rec meets our three desiderata by (1) generating recommendations based on category preferences, (2) controlling category preferences during the inference phase, and (3) adapting to arbitrary targeted category preferences. In the forward process, D3Rec removes category preferences lurking in user interactions by adding noises. Then, in the reverse process, D3Rec generates recommendations through denoising steps while reflecting desired category preferences. Extensive experiments on real-world and synthetic datasets validate the effectiveness of D3Rec in controlling diversity at inference.

[IR-4] Online Item Cold-Start Recommendation with Popularity-Aware Meta-Learning KDD’25

链接: https://arxiv.org/abs/2411.11225
作者: Yunze Luo,Yuezihan Jiang,Yinjie Jiang,Gaode Chen,Jingchi Wang,Kaigui Bian,Peiyi Li,Qi Zhang
关键词-EN: capture users’ interests, online recommender systems, increasingly important role, short videos, rise of e-commerce
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 4 figures, to be published in KDD '25

点击查看摘要

Abstract:With the rise of e-commerce and short videos, online recommender systems that can capture users’ interests and update new items in real-time play an increasingly important role. In both online and offline recommendation, the cold-start problem due to interaction sparsity has been affecting the recommendation effect of cold-start items, which is also known as the long-tail problem of item distribution. Many cold-start scheme based on fine-tuning or knowledge transferring shows excellent performance on offline recommendation. Yet, these schemes are infeasible for online recommendation on streaming data pipelines due to different training method, computational overhead and time constraints. Inspired by the above questions, we propose a model-agnostic recommendation algorithm called Popularity-Aware Meta-learning (PAM), to address the item cold-start problem under streaming data settings. PAM divides the incoming data into different meta-learning tasks by predefined item popularity thresholds. The model can distinguish and reweight behavior-related features and content-related features in each task based on their different roles in different popularity levels, thus adapting to recommendations for cold-start samples. These task-fixing design significantly reduces additional computation and storage costs compared to offline methods. Furthermore, PAM also introduced data augmentation and an additional self-supervised loss specifically designed for low-popularity tasks, leveraging insights from high-popularity samples. This approach effectively mitigates the issue of inadequate supervision due to the scarcity of cold-start samples. Experimental results across multiple public datasets demonstrate the superiority of our approach over other baseline methods in addressing cold-start challenges in online streaming data scenarios. Comments: 11 pages, 4 figures, to be published in KDD '25 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2411.11225 [cs.IR] (or arXiv:2411.11225v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2411.11225 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-5] ForPKG-1.0: A Framework for Constructing Forestry Policy Knowledge Graph and Application Analysis

链接: https://arxiv.org/abs/2411.11090
作者: Jingyun Sun,Zhongze Luo
关键词-EN: policy knowledge graph, knowledge graph, policy knowledge, knowledge, large language models
类目: Information Retrieval (cs.IR)
*备注: 22 pages

点击查看摘要

Abstract:A policy knowledge graph can provide decision support for tasks such as project compliance, policy analysis, and intelligent question answering, and can also serve as an external knowledge base to assist the reasoning process of related large language models. Although there have been many related works on knowledge graphs, there is currently a lack of research on the construction methods of policy knowledge graphs. This paper, focusing on the forestry field, designs a complete policy knowledge graph construction framework, including: firstly, proposing a fine-grained forestry policy domain ontology; then, proposing an unsupervised policy information extraction method, and finally, constructing a complete forestry policy knowledge graph. The experimental results show that the proposed ontology has good expressiveness and extensibility, and the policy information extraction method proposed in this paper achieves better results than other unsupervised methods. Furthermore, by analyzing the application of the knowledge graph in the retrieval-augmented-generation task of the large language models, the practical application value of the knowledge graph in the era of large language models is confirmed. The knowledge graph resource will be released on an open-source platform and can serve as the basic knowledge base for forestry policy-related intelligent systems. It can also be used for academic research. In addition, this study can provide reference and guidance for the construction of policy knowledge graphs in other fields.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-19

目录

概览 (2024-11-19)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载