Arxiv今日论文 | 2024-11-15

本篇博文主要展示 2024-11-15 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决机器翻译系统中重排序（reranking）过程带来的高计算成本问题。解决方案的关键在于将重排序问题建模为贝叶斯优化 (BayesOpt) 问题，通过策略性地选择候选者进行评分，平衡探索与利用，从而在仅对候选列表的一小部分进行评分的情况下找到高分的候选者。具体来说，论文提出了一种多保真度设置的贝叶斯优化方法，首先使用一个成本较低但噪声较大的代理评分模型对候选者进行初步评分，进一步优化了成本与性能的平衡，尤其是在使用较小但训练良好的蒸馏代理评分模型时。

链接: https://arxiv.org/abs/2411.09694
作者: Julius Cheng,Maike Züfle,Vilém Zouhar,Andreas Vlachos
关键词-EN: highest-scoring candidate remains, output quality, external scoring model, returning the highest-scoring, remains a simple
类目: Computation and Language (cs.CL)
备注: v1: Preprint version

点击查看摘要

Abstract:Reranking a list of candidates from a machine translation system with an external scoring model and returning the highest-scoring candidate remains a simple and effective method for improving the overall output quality. Translation scoring models continue to grow in size, with the best models being comparable to generation models. Thus, reranking can add substantial computational cost to the translation pipeline. In this work, we pose reranking as a Bayesian optimization (BayesOpt) problem. By strategically selecting candidates to score based on a balance of exploration and exploitation, we show that it is possible to find top-scoring candidates when scoring only a fraction of the candidate list. For instance, our method achieves the same CometKiwi score using only 70 scoring evaluations compared a baseline system using 180. We present a multi-fidelity setting for BayesOpt, where the candidates are first scored with a cheaper but noisier proxy scoring model, which further improves the cost-performance tradeoff when using smaller but well-trained distilled proxy scorers.
摘要：通过使用外部评分模型对机器翻译系统生成的候选列表进行重新排序，并返回得分最高的候选，仍然是提高整体输出质量的一种简单而有效的方法。随着翻译评分模型规模的不断扩大，最佳模型在性能上已可与生成模型相媲美。因此，重新排序可能会显著增加翻译流程的计算成本。在本研究中，我们将重新排序问题视为贝叶斯优化 (BayesOpt) 问题。通过策略性地选择候选进行评分，平衡探索与利用，我们证明了在仅对候选列表的一小部分进行评分的情况下，仍能找到得分最高的候选。例如，我们的方法在使用仅70次评分评估的情况下，达到了与使用180次评分的基线系统相同的CometKiwi评分。我们提出了一种多保真度设置的贝叶斯优化方法，其中候选首先通过一个成本较低但噪声较大的代理评分模型进行评分，这进一步改善了使用较小但训练良好的蒸馏代理评分器时的成本-性能权衡。

[NLP-1] LLM Hallucination Reasoning with Zero-shot Knowledge Test

【速读】：该论文试图解决大语言模型（LLM）在生成文本时偶尔产生的幻觉（hallucination）问题，特别是现有检测方法未能区分不同类型的幻觉，从而影响检测性能。解决方案的关键在于引入了一个新的任务——幻觉推理（Hallucination Reasoning），该任务将LLM生成的文本分类为对齐（aligned）、错位（misaligned）和虚构（fabricated）三种类型。论文提出了一种新颖的零样本方法，通过评估LLM对给定提示和文本的知识掌握程度，来判断其生成的文本是否存在幻觉。实验结果表明，这种方法在幻觉推理任务中表现出色，并强调了其在提升检测性能方面的重要性。

链接: https://arxiv.org/abs/2411.09689
作者: Seongmin Lee,Hsiang Hsu,Chun-Fu Chen
关键词-EN: poses significant challenges, occasionally generate unfaithful, LLMs occasionally generate, poses significant, practical applications
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:LLM hallucination, where LLMs occasionally generate unfaithful text, poses significant challenges for their practical applications. Most existing detection methods rely on external knowledge, LLM fine-tuning, or hallucination-labeled datasets, and they do not distinguish between different types of hallucinations, which are crucial for improving detection performance. We introduce a new task, Hallucination Reasoning, which classifies LLM-generated text into one of three categories: aligned, misaligned, and fabricated. Our novel zero-shot method assesses whether LLM has enough knowledge about a given prompt and text. Our experiments conducted on new datasets demonstrate the effectiveness of our method in hallucination reasoning and underscore its importance for enhancing detection performance.
摘要：大语言模型（LLM）在生成文本时偶尔会出现不忠实的情况，即所谓的“幻觉”，这对其在实际应用中构成了重大挑战。现有的检测方法大多依赖于外部知识、LLM 微调或带有幻觉标签的数据集，但这些方法未能区分不同类型的幻觉，而这种区分对于提升检测性能至关重要。我们提出了一项新的任务——幻觉推理，该任务将 LLM 生成的文本分类为三种类别之一：对齐、错位和捏造。我们创新的零样本方法评估了 LLM 对给定提示和文本的知识的充分性。我们在新数据集上进行的实验证明了我们的方法在幻觉推理中的有效性，并强调了其在提升检测性能方面的重要性。

[NLP-2] Squeezed Attention: Accelerating Long Context Length LLM Inference

【速读】：该论文试图解决在大语言模型 (LLM) 应用中，由于输入提示的长度增加而导致推理效率低下的问题。解决方案的关键在于提出了一种名为“压缩注意力 (Squeezed Attention)”的机制，通过离线优化来加速处理用户输入。具体来说，论文首先利用K-means聚类对固定上下文的键进行语义相似性分组，并用单一质心值表示每个聚类。在推理过程中，通过将用户输入的查询标记与这些质心进行比较，预测哪些固定上下文中的键在语义上是相关的，从而在推理时仅加载这些重要的键。这种方法不仅减少了带宽和计算成本，还将注意力计算的复杂度从线性降低到对数级别。此外，论文还实现了优化的Triton内核用于质心比较和稀疏FlashAttention，从而在长上下文推理的预填充和生成阶段实现了超过4倍的加速。

链接: https://arxiv.org/abs/2411.09688
作者: Coleman Hooper,Sehoon Kim,Hiva Mohammadzadeh,Monishwaran Maheswaran,June Paik,Michael W. Mahoney,Kurt Keutzer,Amir Gholami
关键词-EN: Emerging Large Language, Large Language Model, complex downstream tasks, perform complex downstream, require long input
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emerging Large Language Model (LLM) applications require long input prompts to perform complex downstream tasks like document analysis and code generation. For these long context length applications, the length of the input prompt poses a significant challenge in terms of inference efficiency since the inference costs increase linearly with sequence length. However, for many of these applications, much of the context in the prompt is fixed across different user inputs, thereby providing the opportunity to perform offline optimizations to process user inputs quickly, as they are received. In this work, we propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We first leverage K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. During inference, we compare query tokens from the user input with the centroids to predict which of the keys from the fixed context are semantically relevant and need to be loaded during inference. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs. We also extend our method to use a hierarchical centroid lookup to identify important keys, which can reduce the complexity of attention from linear to logarithmic with respect to the context length. We implement optimized Triton kernels for centroid comparison and sparse FlashAttention with important keys, achieving more than 4x speedups during both the prefill and generation phases for long-context inference. Furthermore, we have extensively evaluated our method on various long-context benchmarks including LongBench, where it achieves a 3x reduction in KV cache budget without accuracy loss and up to an 8x reduction with 0.5 point accuracy gap for various models.
摘要：新兴的大语言模型（LLM）应用需要较长的输入提示来执行复杂的下游任务，如文档分析和代码生成。对于这些长上下文长度的应用，输入提示的长度在推理效率方面构成了显著挑战，因为推理成本随序列长度线性增加。然而，对于许多此类应用，提示中的大部分上下文在不同用户输入之间是固定的，从而提供了机会，可以在用户输入接收时进行离线优化以快速处理这些输入。在本研究中，我们提出了压缩注意力（Squeezed Attention）机制，以加速那些输入提示中大部分内容固定的LLM应用。我们首先利用K-means聚类离线对固定上下文的键进行分组，基于语义相似性，并用单一质心值表示每个聚类。在推理过程中，我们将用户输入的查询Token与质心进行比较，预测哪些来自固定上下文的键在语义上是相关的，并在推理过程中需要加载。然后，我们仅使用这些重要的键从固定上下文中计算精确的注意力，从而减少带宽和计算成本。我们还扩展了我们的方法，使用分层质心查找来识别重要键，这可以将注意力的复杂度从线性降低到对上下文长度的对数。我们为质心比较和稀疏FlashAttention实现了优化的Triton内核，在长上下文推理的预填充和生成阶段实现了超过4倍的加速。此外，我们在包括LongBench在内的各种长上下文基准上广泛评估了我们的方法，结果显示，在不影响准确性的情况下，KV缓存预算减少了3倍，对于各种模型，准确性差距为0.5分时，减少了高达8倍。

[NLP-3] Adaptive Decoding via Latent Preference Optimization

【速读】：该论文试图解决在语言模型解码过程中，使用单一固定温度采样策略无法适应不同任务需求的问题。解决方案的关键在于引入自适应解码 (Adaptive Decoding)，这是一种在推理时动态选择采样温度的机制，可以在令牌或示例级别上进行调整，以优化模型性能。为了训练这一机制的参数，论文提出了潜在偏好优化 (Latent Preference Optimization, LPO)，这是一种用于训练离散潜在变量的通用方法，如温度选择。实验结果表明，该方法在需要不同温度的任务（如UltraFeedback、创意故事写作和GSM8K）中均优于所有固定温度解码策略。

链接: https://arxiv.org/abs/2411.09661
作者: Shehzaad Dhuliawala,Ilia Kulikov,Ping Yu,Asli Celikyilmaz,Jason Weston,Sainbayar Sukhbaatar,Jack Lanchantin
关键词-EN: factually accurate, language model decoding, Creative Story Writing, creative responses, higher temperature sampling
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:During language model decoding, it is known that using higher temperature sampling gives more creative responses, while lower temperatures are more factually accurate. However, such models are commonly applied to general instruction following, which involves both creative and fact seeking tasks, using a single fixed temperature across all examples and tokens. In this work, we introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time, at either the token or example level, in order to optimize performance. To learn its parameters we introduce Latent Preference Optimization (LPO) a general approach to train discrete latent variables such as choices of temperature. Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures, including UltraFeedback, Creative Story Writing, and GSM8K.
摘要：在语言模型解码过程中，已知使用较高温度采样会生成更具创意的响应，而较低温度则更偏向于事实准确性。然而，这些模型通常应用于一般指令跟随任务，这些任务既涉及创意生成也涉及事实查找，且在所有示例和Token上使用单一固定的温度。在本研究中，我们提出了自适应解码（Adaptive Decoding），这是一种添加到模型中的层，能够在推理时动态选择采样温度，无论是基于Token级别还是示例级别，以优化性能。为了学习其参数，我们引入了潜在偏好优化（Latent Preference Optimization, LPO），这是一种用于训练离散潜在变量（如温度选择）的通用方法。我们的方法在需要不同温度的多种任务中表现优于所有固定解码温度，包括UltraFeedback、创意故事写作和GSM8K。

[NLP-4] On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse

【速读】：该论文试图解决的问题是：在统计语言生成模型中，是否能够同时实现生成的一致性（consistency）和广度（breadth），即模型能否在训练数据增加时，生成出未见过的有效字符串，并充分捕捉目标语言的丰富性。解决方案的关键在于揭示了对于大多数语言模型（包括基于下一个词预测的模型），在大多数候选语言集合中，同时实现一致性和广度是不可能的。这与Kleinberg和Mullainathan [KM24]的结果形成对比，后者表明在任何可数语言集合中，可以实现一致性但不一定有广度。论文通过建立近似紧的样本数量边界，展示了生成一致性和广度的难度，并提出当负样本（即不属于目标语言的字符串）与正样本一起提供时，可以实现一致性和广度，这表明后训练反馈机制在减少幻觉（hallucination）和限制模式崩溃（mode collapse）方面可能至关重要。

链接: https://arxiv.org/abs/2411.09642
作者: Alkis Kalavasis,Anay Mehrotra,Grigoris Velegkas
关键词-EN: Toggle, language, Code, Code Toggle Papers, Papers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
备注: Abstract shortened to fit arXiv limit

点击查看摘要

Abstract:Specifying all desirable properties of a language model is challenging, but certain requirements seem essential. Given samples from an unknown language, the trained model should produce valid strings not seen in training and be expressive enough to capture the language’s full richness. Otherwise, outputting invalid strings constitutes “hallucination,” and failing to capture the full range leads to “mode collapse.” We ask if a language model can meet both requirements. We investigate this within a statistical language generation setting building on Gold and Angluin. Here, the model receives random samples from a distribution over an unknown language K, which belongs to a possibly infinite collection of languages. The goal is to generate unseen strings from K. We say the model generates from K with consistency and breadth if, as training size increases, its output converges to all unseen strings in K. Kleinberg and Mullainathan [KM24] asked if consistency and breadth in language generation are possible. We answer this negatively: for a large class of language models, including next-token prediction models, this is impossible for most collections of candidate languages. This contrasts with [KM24]'s result, showing consistent generation without breadth is possible for any countable collection of languages. Our finding highlights that generation with breadth fundamentally differs from generation without breadth. As a byproduct, we establish near-tight bounds on the number of samples needed for generation with or without breadth. Finally, our results offer hope: consistent generation with breadth is achievable for any countable collection of languages when negative examples (strings outside K) are available alongside positive ones. This suggests that post-training feedback, which encodes negative examples, can be crucial in reducing hallucinations while limiting mode collapse. Comments: Abstract shortened to fit arXiv limit Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2411.09642 [cs.LG] (or arXiv:2411.09642v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.09642 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Grigoris Velegkas [view email] [v1] Thu, 14 Nov 2024 18:06:55 UTC (1,657 KB) Full-text links: Access Paper: View a PDF of the paper titled On the Limits of Language Generation: Trade-Offs Between Hallucination and Mode Collapse, by Alkis Kalavasis and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2024-11 Change to browse by: cs cs.AI cs.CL cs.DS stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
摘要：指定语言模型的所有理想属性是具有挑战性的，但某些要求似乎是必不可少的。给定未知语言的样本，训练后的模型应能生成训练中未见过的有效字符串，并具备足够的表达能力以捕捉该语言的全部丰富性。否则，输出无效字符串构成“幻觉”，而未能捕捉全部范围则导致“模式崩溃”。我们探讨了一个语言模型是否能同时满足这两个要求。我们在基于 Gold 和 Angluin 的统计语言生成环境中进行研究。在此环境中，模型从属于可能无限多种语言集合的未知语言 K 的分布中接收随机样本。目标是生成 K 中的未见字符串。我们称模型以一致性和广度生成 K 中的字符串，如果随着训练规模的增加，其输出收敛到 K 中的所有未见字符串。Kleinberg 和 Mullainathan [KM24] 曾提出语言生成中的一致性和广度是否可能的问题。我们对此给出了否定答案：对于包括下一个 Token 预测模型在内的大类语言模型，对于大多数候选语言集合，这是不可能的。这与 [KM24] 的结果形成对比，后者表明在任何可数语言集合中，无需广度的一致生成是可能的。我们的发现强调了广度生成与无广度生成在根本上存在差异。作为副产品，我们建立了在有无广度生成情况下所需样本数量的近似紧界。最后，我们的结果带来了希望：当正样本（K 中的字符串）与负样本（K 外的字符串）同时可用时，任何可数语言集合中的一致广度生成是可实现的。这表明，训练后的反馈（编码负样本）在减少幻觉的同时限制模式崩溃方面可能至关重要。

评论：摘要已缩短以适应 arXiv 限制
主题：机器学习 (cs.LG)；人工智能 (cs.AI)；计算与语言 (cs.CL)；数据结构与算法 (cs.DS)；机器学习 (stat.ML)
引用为：arXiv:2411.09642 [cs.LG]
（或 arXiv:2411.09642v1 [cs.LG] 用于此版本）
https://doi.org/10.48550/arXiv.2411.09642
关注以了解更多
arXiv 发布的 DOI 通过 DataCite（待注册）
提交历史
从：Grigoris Velegkas [查看电子邮件]
[v1] 2024年11月14日 18:06:55 UTC (1,657 KB)
全文链接：
访问论文：阅读题为《语言生成极限：幻觉与模式崩溃之间的权衡》的论文，作者为 Alkis Kalavasis 及其他两位作者
查看 PDFHTML（实验性）TeX 源代码其他格式查看许可证
当前浏览上下文：cs.LG
上一篇 | 下一篇
新 | 最近 | 2024-11
更改浏览方式：
cs cs.AI cs.CL cs.DS stat stat.ML
参考文献
引文
NASA ADSGoogle Scholar Semantic Scholar
导出 BibTeX 引用加载中…
BibTeX 格式化引用加载中…
数据由以下机构提供：
书签
已选中=“已选中”>
书目工具
书目和引文工具
书目浏览器切换
书目浏览器（什么是浏览器？）
连接论文切换
连接论文（什么是连接论文？）
Litmaps 切换
Litmaps（什么是 Litmaps？）
scite.ai 切换
scite 智能引文（什么是智能引文？）
代码、数据、媒体
与本文相关的代码、数据和媒体
alphaXiv 切换
alphaXiv（什么是 alphaXiv？）
代码链接切换
CatalyzeX 代码查找器（什么是 CatalyzeX？）
DagsHub 切换
DagsHub（什么是 DagsHub？）
GotitPub 切换
Gotit.pub（什么是 GotitPub？）
Huggingface 切换
Hugging Face（什么是 Huggingface？）
代码链接切换
Papers with Code（什么是 Papers with Code？）
ScienceCast 切换
ScienceCast（什么是 ScienceCast？）
演示
演示
Replicate 切换
Replicate（什么是 Replicate？）
Spaces 切换
Hugging Face Spaces（什么是 Spaces？）
Spaces 切换
TXYZ.AI（什么是 TXYZ.AI？）
相关论文
推荐和搜索工具
链接至影响力花
影响力花（什么是影响力花？）
核心推荐切换
CORE 推荐（什么是 CORE？）
IArxiv 推荐切换
IArxiv 推荐（什么是 IArxiv？）
作者地点机构主题
关于 arXivLabs
arXivLabs：社区合作者的实验项目
arXivLabs 是一个框架，允许合作者在我们的网站上直接开发和分享新的 arXiv 功能。无论是个人还是组织，与 arXivLabs 合作的各方都接受了我们的开放、社区、卓越和用户数据隐私的价值观。arXiv 致力于这些价值观，并且只与遵守这些价值观的合作伙伴合作。有一个项目想法可以为 arXiv 社区增加价值？了解更多关于 arXivLabs 的信息。
本文的哪些作者是支持者？|
禁用 MathJax（什么是 MathJax？）
mathjaxToggle();
关于帮助
联系 arXiv点击此处联系 arXiv
联系
订阅 arXiv 邮件点击此处订阅
订阅
版权隐私政策
网页无障碍辅助
arXiv 运营状态
通过以下方式获取状态通知
电子邮件或 Slack

摘要：指定语言模型的所有理想属性是具有挑战性的，但某些要求似乎是必不可少的。给定未知语言的样本，训练后的模型应能生成训练中未见过的有效字符串，并具备足够的表达能力以捕捉该语言的全部丰富性。否则，输出无效字符串构成“幻觉”，而未能捕捉全部范围则导致“模式崩溃”。我们探讨了一个语言模型是否能同时满足这两个要求。我们在基于 Gold 和 Angluin 的统计语言生成环境中进行研究。在此环境中，模型从属于可能无限多种语言集合的未知语言 K 的分布中接收随机样本。目标是生成 K 中的未见字符串。我们称模型以一致性和广度生成 K 中的字符串，如果随着训练规模的增加，其输出收敛到 K 中的所有未见字符串。Kleinberg 和 Mullainathan [KM24] 曾提出语言生成中的一致性和广度是否可能的问题。我们对此给出了否定答案：对于包括下一个 Token 预测模型在内的大类语言模型，对于大多数候选语言集合，这是不可能的。这与 [KM24] 的结果形成对比，后者表明在任何可数语言集合中，无需广度的一致生成是可能的。我们的发现强调了广度生成与无广度生成在根本上存在差异。作为副产品，我们建立了在有无广度生成情况下所需样本数量的近似紧界。最后，我们的结果带来了希望：当正样本（K 中的字符串）与负样本（K 外的字符串）同时可用时，任何可数语言集合中的一致广度生成是可实现的。这表明，训练后的反馈（编码负样本）在减少幻觉的同时限制模式崩溃方面可能至关重要。

[NLP-5] PTR: Precision-Driven Tool Recommendation for Large Language Models

【速读】：该论文试图解决为大型语言模型（LLMs）推荐精确工具集的问题。解决方案的关键在于提出了一个名为“精度驱动工具推荐”（Precision-driven Tool Recommendation, PTR）的新方法。PTR通过利用历史工具包的使用情况，捕捉初始的简洁工具集，并通过工具匹配动态调整工具集，最终基于多视角进行工具添加。这种方法旨在为LLMs提供最适合特定任务的工具集，避免因固定数量的工具选择导致的效率低下问题，如冗余或不合适的工具。论文还引入了新的数据集RecTools和评估指标TRACC，以验证PTR方法的有效性，并通过实验证明了其在多个基准数据集上的良好表现。

链接: https://arxiv.org/abs/2411.09613
作者: Hang Gao,Yongfeng Zhang
关键词-EN: Large Language Models, augmenting Large Language, Language Models, Large Language, augmenting Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:By augmenting Large Language Models (LLMs) with external tools, their capacity to solve complex problems has been significantly enhanced. However, despite ongoing advancements in the parsing capabilities of LLMs, incorporating all available tools simultaneously in the prompt remains impractical due to the vast number of external tools. Consequently, it is essential to provide LLMs with a precise set of tools tailored to the specific task, considering both quantity and quality. Current tool retrieval methods primarily focus on refining the ranking list of tools and directly packaging a fixed number of top-ranked tools as the tool set. However, these approaches often fail to equip LLMs with the optimal set of tools prior to execution, since the optimal number of tools for different tasks could be different, resulting in inefficiencies such as redundant or unsuitable tools, which impede immediate access to the most relevant tools. This paper addresses the challenge of recommending precise toolsets for LLMs. We introduce the problem of tool recommendation, define its scope, and propose a novel Precision-driven Tool Recommendation (PTR) approach. PTR captures an initial, concise set of tools by leveraging historical tool bundle usage and dynamically adjusts the tool set by performing tool matching, culminating in a multi-view-based tool addition. Additionally, we present a new dataset, RecTools, and a metric, TRACC, designed to evaluate the effectiveness of tool recommendation for LLMs. We further validate our design choices through comprehensive experiments, demonstrating promising accuracy across two open benchmarks and our RecTools dataset.
摘要：通过将大语言模型 (LLM) 与外部工具相结合，其解决复杂问题的能力得到了显著提升。然而，尽管大语言模型在解析能力方面不断进步，但由于外部工具数量庞大，在提示中同时整合所有可用工具仍然不切实际。因此，为大语言模型提供一套针对特定任务量身定制的工具集显得尤为重要，这需要同时考虑工具的数量和质量。当前的工具检索方法主要集中在优化工具排名列表，并将固定数量的顶级工具直接打包为工具集。然而，这些方法往往无法在执行前为大语言模型配备最优的工具集，因为不同任务所需工具的数量可能不同，导致效率低下，如工具冗余或不适用，从而阻碍了对最相关工具的即时访问。本文针对为大语言模型推荐精确工具集的挑战进行了探讨。我们提出了工具推荐问题，界定了其范围，并提出了一种新颖的精准驱动工具推荐 (PTR) 方法。PTR 通过利用历史工具包使用情况，捕捉初始的简洁工具集，并通过执行工具匹配动态调整工具集，最终通过多视角工具添加实现优化。此外，我们引入了一个新的数据集 RecTools 和一个评估工具推荐效果的指标 TRACC。通过全面的实验，我们验证了设计选择的有效性，展示了在两个公开基准和我们自有的 RecTools 数据集上的良好准确性。

[NLP-6] he Moral Foundations Weibo Corpus

【速读】：该论文试图解决中文自然语言处理中道德情感测量的准确性问题，特别是现有语料库在语言表达上的局限性。解决方案的关键在于引入了一个名为“Moral Foundation Weibo Corpus”的新语料库，该语料库包含25,671条微博评论，涵盖六个不同的话题领域，并根据基于扎根理论的十种道德类别进行手动标注。通过系统培训的标注者进行至少三次标注，并使用kappa测试评估标注一致性，确保了标注的可靠性。此外，论文还应用了最新的语言模型来补充手动标注，通过实验比较其性能，为道德情感分类提供了基准结果。

链接: https://arxiv.org/abs/2411.09612
作者: Renjie Cao,Miaoyan Hu,Jiahan Wei,Baha Ihnaini
关键词-EN: including social media, social media selfpresentation, shaping behavioral styles, language significantly influence, natural language significantly
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Moral sentiments expressed in natural language significantly influence both online and offline environments, shaping behavioral styles and interaction patterns, including social media selfpresentation, cyberbullying, adherence to social norms, and ethical decision-making. To effectively measure moral sentiments in natural language processing texts, it is crucial to utilize large, annotated datasets that provide nuanced understanding for accurate analysis and modeltraining. However, existing corpora, while valuable, often face linguistic limitations. To address this gap in the Chinese language domain,we introduce the Moral Foundation Weibo Corpus. This corpus consists of 25,671 Chinese comments on Weibo, encompassing six diverse topic areas. Each comment is manually annotated by at least three systematically trained annotators based on ten moral categories derived from a grounded theory of morality. To assess annotator reliability, we present the kappa testresults, a gold standard for measuring consistency. Additionally, we apply several the latest large language models to supplement the manual annotations, conducting analytical experiments to compare their performance and report baseline results for moral sentiment classification.
摘要：自然语言中表达的道德情感显著影响线上和线下环境，塑造行为风格和互动模式，包括社交媒体自我呈现、网络霸凌、遵守社会规范和道德决策等。为了有效测量自然语言处理文本中的道德情感，利用大规模、标注精细的数据集至关重要，这些数据集能够提供细致的理解，以进行准确的分析和模型训练。然而，现有的语料库虽然有价值，但往往面临语言限制。为了填补中文语言领域的这一空白，我们引入了道德基础微博语料库。该语料库包含25,671条微博评论，涵盖六个不同的主题领域。每条评论均由至少三名经过系统培训的标注员根据基于道德基础理论的十个道德类别进行手动标注。为了评估标注员的可靠性，我们提供了kappa测试结果，这是衡量一致性的黄金标准。此外，我们还应用了多种最新的大语言模型来补充手动标注，进行了分析实验以比较其性能，并报告了道德情感分类的基线结果。

[NLP-7] Initial Nugget Evaluation Results for the TREC 2024 RAG Track with the AutoNuggetizer Framework

【速读】：该论文试图解决检索增强生成 (Retrieval-Augmented Generation, RAG) 系统的评估难题，特别是如何自动化评估这些系统生成的答案的质量。解决方案的关键在于引入AutoNuggetizer框架，该框架利用大型语言模型 (Large Language Models) 自动生成“nuggets”（信息片段）并将其分配给系统生成的答案，从而实现对RAG系统的自动化评估。通过与人工评估的对比，研究显示自动评估与人工评估之间存在强相关性，表明该自动化评估方法可以有效指导未来RAG系统的迭代改进。

链接: https://arxiv.org/abs/2411.09607
作者: Ronak Pradeep,Nandan Thakur,Shivani Upadhyay,Daniel Campos,Nick Craswell,Jimmy Lin
关键词-EN: Retrieval-Augmented Generation, Question Answering Track, TREC Question Answering, RAG systems, RAG
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This report provides an initial look at partial results from the TREC 2024 Retrieval-Augmented Generation (RAG) Track. We have identified RAG evaluation as a barrier to continued progress in information access (and more broadly, natural language processing and artificial intelligence), and it is our hope that we can contribute to tackling the many challenges in this space. The central hypothesis we explore in this work is that the nugget evaluation methodology, originally developed for the TREC Question Answering Track in 2003, provides a solid foundation for evaluating RAG systems. As such, our efforts have focused on “refactoring” this methodology, specifically applying large language models to both automatically create nuggets and to automatically assign nuggets to system answers. We call this the AutoNuggetizer framework. Within the TREC setup, we are able to calibrate our fully automatic process against a manual process whereby nuggets are created by human assessors semi-manually and then assigned manually to system answers. Based on initial results across 21 topics from 45 runs, we observe a strong correlation between scores derived from a fully automatic nugget evaluation and a (mostly) manual nugget evaluation by human assessors. This suggests that our fully automatic evaluation process can be used to guide future iterations of RAG systems.
摘要：本报告初步展示了 TREC 2024 检索增强生成 (Retrieval-Augmented Generation, RAG) 赛道部分结果。我们认识到 RAG 评估是信息获取（以及更广泛的自然语言处理和人工智能领域）持续进步的障碍，我们希望能够在解决这一领域的诸多挑战中做出贡献。我们在这项工作中探索的核心假设是，最初为 2003 年 TREC 问答赛道开发的“片段评估方法”为评估 RAG 系统提供了坚实的基础。因此，我们的工作重点在于“重构”这一方法，特别是应用大语言模型来自动创建片段，并自动将片段分配给系统答案。我们称之为 AutoNuggetizer 框架。在 TREC 的设置中，我们能够将我们的全自动流程与人工流程进行校准，其中片段由人工评估者半自动创建，然后手动分配给系统答案。基于来自 45 次运行的 21 个主题的初步结果，我们观察到全自动片段评估得分与人工评估者（大部分）手动片段评估得分之间存在强相关性。这表明我们的全自动评估流程可以用于指导未来 RAG 系统的迭代。

[NLP-8] LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

【速读】：该论文试图解决将预训练的大型语言模型（LLMs）扩展到生成3D网格（3D meshes）的问题，关键在于如何将3D网格数据有效地转化为LLMs能够处理的离散符号（discrete tokens）。解决方案的核心是引入LLaMA-Mesh，一种将3D网格的顶点坐标和面定义表示为纯文本的新方法，从而无需扩展词汇表即可直接与LLMs集成。通过构建监督微调（SFT）数据集，LLaMA-Mesh使预训练的LLMs能够从文本提示生成3D网格、生成交错的文本和3D网格输出，并理解和解释3D网格，从而在文本和3D模态之间实现统一。

链接: https://arxiv.org/abs/2411.09595
作者: Zhengyi Wang,Jonathan Lorraine,Yikai Wang,Hang Su,Jun Zhu,Sanja Fidler,Xiaohui Zeng
关键词-EN: large language models, capabilities of large, large language, work explores expanding, LLMs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: See the project website at this https URL

点击查看摘要

Abstract:This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. To address this, we introduce LLaMA-Mesh, a novel approach that represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. We construct a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate 3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs as required, and (3) understand and interpret 3D meshes. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities. LLaMA-Mesh achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.
摘要：本研究探讨了如何扩展大语言模型 (LLMs) 在文本预训练的基础上，生成统一的 3D 网格模型。这一方法具有两大关键优势：(1) 利用 LLMs 中已嵌入的空间知识，这些知识源自 3D 教程等文本资源；(2) 实现对话式的 3D 生成和网格理解。主要挑战在于如何将 3D 网格数据有效地 Token 化，使其成为 LLMs 能够无缝处理的离散 Token。为此，我们提出了 LLaMA-Mesh，这是一种新颖的方法，它将 3D 网格的顶点坐标和面定义表示为纯文本，从而可以直接与 LLMs 集成，而无需扩展词汇表。我们构建了一个监督微调 (SFT) 数据集，使预训练的 LLMs 能够：(1) 根据文本提示生成 3D 网格；(2) 按需生成交错的文本和 3D 网格输出；(3) 理解和解释 3D 网格。我们的工作首次证明了 LLMs 可以通过文本格式进行微调，以获取复杂的 3D 网格生成所需的空间知识，从而有效地统一 3D 和文本模态。LLaMA-Mesh 在网格生成质量上与从头开始训练的模型相当，同时保持了强大的文本生成性能。

[NLP-9] BabyLM Challenge: Exploring the Effect of Variation Sets on Language Model Training Efficiency CONLL2024

【速读】：该论文试图解决当前大型语言模型在训练数据效率方面的问题，特别是探讨儿童导向语言（Child-Directed Speech, CDS）中的变异集（Variation Sets, VSs）如何影响模型的训练效率。解决方案的关键在于通过在CDS数据中引入不同比例的人工VSs，并使用这些数据集训练GPT-2模型，评估VSs对模型性能的影响。研究发现，VSs的存在在某些评估基准（如BLiMP和GLUE）上对模型有益，但在其他基准（如EWOK）上则不然，且结果受训练轮数和语句呈现顺序等多重因素影响。这些发现表明VSs对语言模型有潜在的积极影响，但仍需进一步研究。

链接: https://arxiv.org/abs/2411.09587
作者: Akari Haga,Akiyo Fukatsu,Miyu Oba,Arianna Bisazza,Yohei Oseki
关键词-EN: current large language, data efficiency remains, training data efficiency, large language models, data efficiency
类目: Computation and Language (cs.CL)
备注: This paper accepted BabyLM challenge 2024 at CONLL 2024

点击查看摘要

Abstract:While current large language models have achieved a remarkable success, their data efficiency remains a challenge to overcome. Recently it has been suggested that child-directed speech (CDS) can improve training data efficiency of modern language models based on Transformer neural networks. However, it is not yet understood which specific properties of CDS are effective for training these models. In the context of the BabyLM Challenge, we focus on Variation Sets (VSs), sets of consecutive utterances expressing a similar intent with slightly different words and structures, which are ubiquitous in CDS. To assess the impact of VSs on training data efficiency, we augment CDS data with different proportions of artificial VSs and use these datasets to train an auto-regressive model, GPT-2. We find that the best proportion of VSs depends on the evaluation benchmark: BLiMP and GLUE scores benefit from the presence of VSs, but EWOK scores do not. Additionally, the results vary depending on multiple factors such as the number of epochs and the order of utterance presentation. Taken together, these findings suggest that VSs can have a beneficial influence on language models, while leaving room for further investigation.
摘要：尽管当前的大语言模型取得了显著的成功，但其数据效率仍然是一个需要克服的挑战。最近有研究表明，面向儿童的语音（Child-Directed Speech, CDS）可以提高基于 Transformer 神经网络的现代语言模型的训练数据效率。然而，目前尚不清楚 CDS 的哪些具体属性对训练这些模型有效。在 BabyLM 挑战的背景下，我们重点关注变异集（Variation Sets, VSs），即一组表达相似意图但用词和结构略有不同的连续话语，这些在 CDS 中普遍存在。为了评估 VSs 对训练数据效率的影响，我们通过不同比例的人工 VSs 来增强 CDS 数据，并使用这些数据集来训练一个自回归模型 GPT-2。我们发现，VSs 的最佳比例取决于评估基准：BLiMP 和 GLUE 分数受益于 VSs 的存在，但 EWOK 分数则不然。此外，结果还受到多个因素的影响，如训练轮数和话语呈现顺序。综合来看，这些发现表明 VSs 对语言模型可能具有有益的影响，同时为未来的进一步研究留下了空间。

[NLP-10] Piecing It All Together: Verifying Multi-Hop Multimodal Claims

【速读】：该论文试图解决现有声明验证数据集通常不需要系统进行复杂推理或有效解释多模态证据的问题。解决方案的关键在于引入了一个新的任务：多跳多模态声明验证 (multi-hop multimodal claim verification)。这一任务要求模型在多种来源（包括文本、图像和表格）的多条证据上进行推理，并确定这些多模态证据是否支持或反驳给定的声明。为此，研究团队构建了一个名为MMCV的大规模数据集，包含16k多跳声明与多模态证据的配对，这些数据通过大型语言模型生成并经过人工反馈的进一步完善。研究表明，即使是最先进的跨模态大型语言模型，在处理多跳推理时也面临挑战。此外，研究还建立了在MMCV子集上的人类表现基准。这一数据集及其评估任务旨在推动未来在多模态多跳声明验证领域的研究。

链接: https://arxiv.org/abs/2411.09547
作者: Haoran Wang,Aman Rangapur,Xiongxiao Xu,Yueqing Liang,Haroon Gharwi,Carl Yang,Kai Shu
关键词-EN: Existing claim verification, Existing claim, effectively interpret multimodal, perform complex reasoning, require systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing claim verification datasets often do not require systems to perform complex reasoning or effectively interpret multimodal evidence. To address this, we introduce a new task: multi-hop multimodal claim verification. This task challenges models to reason over multiple pieces of evidence from diverse sources, including text, images, and tables, and determine whether the combined multimodal evidence supports or refutes a given claim. To study this task, we construct MMCV, a large-scale dataset comprising 16k multi-hop claims paired with multimodal evidence, generated and refined using large language models, with additional input from human feedback. We show that MMCV is challenging even for the latest state-of-the-art multimodal large language models, especially as the number of reasoning hops increases. Additionally, we establish a human performance benchmark on a subset of MMCV. We hope this dataset and its evaluation task will encourage future research in multimodal multi-hop claim verification.
摘要：现有的声明验证数据集通常不需要系统进行复杂的推理或有效解释多模态证据。为了解决这一问题，我们引入了一项新的任务：多跳多模态声明验证。该任务要求模型对来自不同来源（包括文本、图像和表格）的多条证据进行推理，并判断综合的多模态证据是支持还是反驳给定的声明。为了研究这一任务，我们构建了 MMCV，这是一个包含 16k 多跳声明与多模态证据配对的大规模数据集，这些数据通过大语言模型生成和精炼，并结合了人类反馈的额外输入。我们发现，即使是最新的最先进的多模态大语言模型，在处理这一任务时也面临挑战，尤其是随着推理跳数的增加。此外，我们在 MMCV 的一个子集上建立了人类性能基准。我们希望这个数据集及其评估任务将促进未来在多模态多跳声明验证领域的研究。

[NLP-11] A Practical Guide to Fine-tuning Language Models with Limited Data

【速读】：该论文试图解决在数据稀缺的情况下，如何有效利用预训练的大型语言模型（LLMs）进行自然语言处理（NLP）任务的问题。解决方案的关键在于采用迁移学习方法，具体包括：1) 通过初始和持续的预训练策略，更好地利用先验知识以适应未见过的领域和语言；2) 在微调和少样本学习过程中最大化有限数据的效用；3) 针对不同程度的数据稀缺性，采用任务特定的模型和方法。论文旨在为实践者提供克服数据限制的实用指南，并指出未来研究的有前景方向。

链接: https://arxiv.org/abs/2411.09539
作者: Márton Szép,Daniel Rueckert,Rüdiger von Eisenhart-Rothe,Florian Hinterwimmer
关键词-EN: Employing pre-trained Large, Natural Language Processing, pre-trained Large Language, Large Language Models, Employing pre-trained
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Employing pre-trained Large Language Models (LLMs) has become the de facto standard in Natural Language Processing (NLP) despite their extensive data requirements. Motivated by the recent surge in research focused on training LLMs with limited data, particularly in low-resource domains and languages, this paper surveys recent transfer learning approaches to optimize model performance in downstream tasks where data is scarce. We first address initial and continued pre-training strategies to better leverage prior knowledge in unseen domains and languages. We then examine how to maximize the utility of limited data during fine-tuning and few-shot learning. The final section takes a task-specific perspective, reviewing models and methods suited for different levels of data scarcity. Our goal is to provide practitioners with practical guidelines for overcoming the challenges posed by constrained data while also highlighting promising directions for future research.
摘要：尽管预训练大语言模型 (Large Language Models, LLMs) 需要大量数据，但它们已成为自然语言处理 (Natural Language Processing, NLP) 领域的实际标准。受到近期研究热潮的启发，这些研究专注于在数据有限的领域和语言中训练 LLMs，本文调查了近期用于优化下游任务模型性能的迁移学习方法，这些下游任务的数据稀缺。我们首先探讨了初始和持续预训练策略，以更好地利用未见领域和语言中的先验知识。接着，我们研究了如何在微调和少样本学习过程中最大化有限数据的效用。最后一部分从任务特定的角度出发，回顾了适用于不同数据稀缺程度的模型和方法。我们的目标是向从业者提供实用的指导方针，以克服数据受限带来的挑战，同时指出未来研究的有前景方向。

[NLP-12] Communication Compression for Tensor Parallel LLM Inference

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）在多硬件加速器上部署时的推理延迟问题。解决方案的关键在于采用张量并行（Tensor Parallel）策略，并通过细粒度量化技术压缩加速器间的通信数据，从而显著减少首次生成时间（Time-to-First-Token, TTFT），同时保持模型性能的微小损失。具体来说，论文提出了一种方法，通过将选定的激活值压缩3.5到4.5倍，实现了最高2倍的TTFT减少。

链接: https://arxiv.org/abs/2411.09510
作者: Jan Hansen-Palmus,Michael Truong-Le,Oliver Hausdörfer,Alok Verma
关键词-EN: Large Language Models, Large Language, Language Models, Model Parallelism strategies, parameters and operations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.
摘要：大语言模型 (Large Language Models, LLMs) 推动了人工智能的前沿，但其包含数千亿个参数和操作。为了减少推理延迟，LLMs 通过多种模型并行策略部署在多个硬件加速器上。本文深入探讨了一种此类策略——张量并行 (Tensor Parallel)，并提出通过压缩加速器间通信来减少延迟。我们利用细粒度量化技术将选定的激活值压缩 3.5 至 4.5 倍。我们提出的方法在模型性能几乎无损的情况下，将首次生成 Token 的时间 (TTFT) 减少了高达 2 倍。

[NLP-13] he Use of Readability Metrics in Legal Text: A Systematic Literature Review

【速读】：该论文试图解决法律文档因其复杂的结构和专业术语而导致的理解困难问题。解决方案的关键在于系统性地评估和应用现有的语言复杂性和可读性指标（readability metrics），以提高法律文本的易读性。通过系统综述方法，论文识别了16种不同的可读性指标，其中Flesch-Kincaid Grade Level最为常用。研究还发现，尽管在“知情同意书”领域有较多相关研究，但并非所有法律领域都得到了充分的可读性评估，因此需要进一步达成共识，确定适用于法律文档的可读性指标。

链接: https://arxiv.org/abs/2411.09497
作者: Yu Han,Aaron Ceross,Jeroen H.M. Bergmann
关键词-EN: domain-specific jargon, challenging due, complex structure, inclusion of domain-specific, Understanding
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding the text in legal documents can be challenging due to their complex structure and the inclusion of domain-specific jargon. Laws and regulations are often crafted in such a manner that engagement with them requires formal training, potentially leading to vastly different interpretations of the same texts. Linguistic complexity is an important contributor to the difficulties experienced by readers. Simplifying texts could enhance comprehension across a broader audience, not just among trained professionals. Various metrics have been developed to measure document readability. Therefore, we adopted a systematic review approach to examine the linguistic and readability metrics currently employed for legal and regulatory texts. A total of 3566 initial papers were screened, with 34 relevant studies found and further assessed. Our primary objective was to identify which current metrics were applied for evaluating readability within the legal field. Sixteen different metrics were identified, with the Flesch-Kincaid Grade Level being the most frequently used method. The majority of studies (73.5%) were found in the domain of “informed consent forms”. From the analysis, it is clear that not all legal domains are well represented in terms of readability metrics and that there is a further need to develop more consensus on which metrics should be applied for legal documents.
摘要：理解法律文件中的文本可能具有挑战性，这是由于其复杂的结构以及包含的领域特定术语。法律和法规通常以需要专业培训的方式编写，这可能导致对同一文本产生截然不同的解释。语言复杂性是读者面临困难的重要因素。简化文本可以提高更广泛受众的理解能力，而不仅仅是受过训练的专业人员。已经开发了多种指标来衡量文档的可读性。因此，我们采用了系统性综述方法来考察目前用于法律和法规文本的语言和可读性指标。总共筛选了3566篇初始论文，发现了34篇相关研究并进行了进一步评估。我们的主要目标是确定当前用于评估法律领域内可读性的指标。共识别出16种不同的指标，其中Flesch-Kincaid Grade Level是最常用的方法。大多数研究（73.5%）发现于“知情同意书”领域。从分析中可以清楚地看出，并非所有法律领域在可读性指标方面都得到了充分代表，并且还需要进一步达成共识，以确定应为法律文件应用哪些指标。

[NLP-14] MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLM s

【速读】：该论文试图解决大语言模型（LLMs）在低资源语言如蒙古语中表现不佳的问题。解决方案的关键在于系统地评估和提升模型在语言能力（语法和语义）和认知能力（知识和推理）方面的表现。为此，研究团队开发了MM-Eval数据集，该数据集基于现代蒙古语教材并结合WebQSP和MGSM数据集，包含569项语法任务、677项语义任务、344项知识任务和250项推理任务。通过在多个模型上的初步实验，研究揭示了模型在语法任务上表现优于语义任务，以及在低资源语言环境中知识迁移的潜力，从而为提升NLP和LLMs在低资源语言中的应用提供了宝贵的见解和数据支持。

链接: https://arxiv.org/abs/2411.09492
作者: Mengyuan Zhang,Ruihui Wang,Bo Xia,Yuan Sun,Xiaobing Zhao
关键词-EN: Large language models, face notable challenges, Large language, Modern Mongolian Language, Mongolian Language Textbook
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in high-resource languages but face notable challenges in low-resource languages like Mongolian. This paper addresses these challenges by categorizing capabilities into language abilities (syntax and semantics) and cognitive abilities (knowledge and reasoning). To systematically evaluate these areas, we developed MM-Eval, a specialized dataset based on Modern Mongolian Language Textbook I and enriched with WebQSP and MGSM datasets. Preliminary experiments on models including Qwen2-7B-Instruct, GLM4-9b-chat, Llama3.1-8B-Instruct, GPT-4, and DeepseekV2.5 revealed that: 1) all models performed better on syntactic tasks than semantic tasks, highlighting a gap in deeper language understanding; and 2) knowledge tasks showed a moderate decline, suggesting that models can transfer general knowledge from high-resource to low-resource contexts. The release of MM-Eval, comprising 569 syntax, 677 semantics, 344 knowledge, and 250 reasoning tasks, offers valuable insights for advancing NLP and LLMs in low-resource languages like Mongolian. The dataset is available at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.09492 [cs.CL] (or arXiv:2411.09492v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2411.09492 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：大语言模型（LLMs）在资源丰富的语言中表现出色，但在如蒙古语这样的低资源语言中面临显著挑战。本文通过将能力分为语言能力（句法和语义）和认知能力（知识和推理）来应对这些挑战。为了系统地评估这些领域，我们开发了MM-Eval，这是一个基于《现代蒙古语教材I》的专用数据集，并结合了WebQSP和MGSM数据集进行了丰富。在包括Qwen2-7B-Instruct、GLM4-9b-chat、Llama3.1-8B-Instruct、GPT-4和DeepseekV2.5在内的模型上的初步实验显示：1) 所有模型在句法任务上的表现均优于语义任务，突显了在更深层次语言理解上的差距；2) 知识任务显示出中等程度的下降，表明模型能够将高资源语言中的通用知识转移到低资源语言环境中。MM-Eval的发布，包含569个句法任务、677个语义任务、344个知识任务和250个推理任务，为推动如蒙古语这样的低资源语言的自然语言处理（NLP）和大语言模型的发展提供了宝贵的见解。该数据集可通过此https URL获取。

主题：计算与语言（cs.CL）；人工智能（cs.AI）
引用为：arXiv:2411.09492 [cs.CL]（或arXiv:2411.09492v1 [cs.CL]用于此版本）
https://doi.org/10.48550/arXiv.2411.09492
通过DataCite发布的arXiv DOI（待注册）

[NLP-15] Robot Tasks with Fuzzy Time Requirements from Natural Language Instructions

【速读】：该论文试图解决自然语言指令中模糊时间要求（如“在几分钟内开始”）对传统机器人系统的挑战。解决方案的关键在于引入模糊技能（fuzzy skills），这些技能通过满意度函数（satisfaction functions）来表示模糊的执行时间要求。满意度函数表达了用户对技能执行开始时间的满意度，为机器人提供了执行的时间容差窗口，从而实现基于满意度的最优调度。研究通过用户实验发现，梯形函数（trapezoidal functions）最能近似用户的满意度，并且用户对未来执行时间的容忍度更高。

链接: https://arxiv.org/abs/2411.09436
作者: Sascha Sucker,Michael Neubauer,Dominik Henrich
关键词-EN: Natural language, natural language poses, Natural, satisfaction, execution
类目: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 9 pages, 8 figures, to be published in 2024 IEEE International Conference on Robotic Computing (IRC)

点击查看摘要

Abstract:Natural language allows robot programming to be accessible to everyone. However, the inherent fuzziness in natural language poses challenges for inflexible, traditional robot systems. We focus on instructions with fuzzy time requirements (e.g., “start in a few minutes”). Building on previous robotics research, we introduce fuzzy skills. These define an execution by the robot with so-called satisfaction functions representing vague execution time requirements. Such functions express a user’s satisfaction over potential starting times for skill execution. When the robot handles multiple fuzzy skills, the satisfaction function provides a temporal tolerance window for execution, thus, enabling optimal scheduling based on satisfaction. We generalized such functions based on individual user expectations with a user study. The participants rated their satisfaction with an instruction’s execution at various times. Our investigations reveal that trapezoidal functions best approximate the users’ satisfaction. Additionally, the results suggest that users are more lenient if the execution is specified further into the future.
摘要：自然语言使得机器人编程对每个人都变得触手可及。然而，自然语言固有的模糊性对传统、不灵活的机器人系统构成了挑战。我们专注于带有模糊时间要求的指令（例如，“在几分钟内开始”）。基于先前的机器人研究，我们引入了模糊技能。这些技能通过所谓的满意度函数来定义机器人的执行，这些函数表示模糊的执行时间要求。这些函数表达了用户对技能执行可能开始时间的满意度。当机器人处理多个模糊技能时，满意度函数提供了一个执行的时间容差窗口，从而能够基于满意度进行最佳调度。我们通过用户研究，根据个体用户的期望对这些函数进行了泛化。参与者对在不同时间执行指令的满意度进行了评分。我们的研究表明，梯形函数最能近似用户的满意度。此外，结果表明，如果执行时间指定得更远，用户会更加宽容。

[NLP-16] Everyone deserves their voice to be heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data KDD2024 ECML

【速读】：该论文试图解决当前最先进的自动语音识别系统（如Whisper）在不同性别群体中表现出的预测偏差问题。解决方案的关键在于通过分析荷兰语语音数据（来自Common Voice数据集和荷兰国家公共广播组织）的词错误率（WER）、字符错误率和基于BERT的语义相似性，来识别Whisper模型在性别群体间的性能差异。研究还采用了Weerts等（2022）的道德框架来评估服务质量损害和公平性，并深入讨论这些偏差对自动字幕生成的具体影响。

链接: https://arxiv.org/abs/2411.09431
作者: Rik Raes,Saskia Lensink,Mykola Pechenizkiy
关键词-EN: Automatic Speech Recognition, Recent research, Dutch National Public, National Public Broadcasting, Speech Recognition
类目: Computation and Language (cs.CL)
备注: Accepted at ECML PKDD 2024, 4th Workshop on Bias and Fairness in AI (BIAS)

点击查看摘要

Abstract:Recent research has shown that state-of-the-art (SotA) Automatic Speech Recognition (ASR) systems, such as Whisper, often exhibit predictive biases that disproportionately affect various demographic groups. This study focuses on identifying the performance disparities of Whisper models on Dutch speech data from the Common Voice dataset and the Dutch National Public Broadcasting organisation. We analyzed the word error rate, character error rate and a BERT-based semantic similarity across gender groups. We used the moral framework of Weerts et al. (2022) to assess quality of service harms and fairness, and to provide a nuanced discussion on the implications of these biases, particularly for automatic subtitling. Our findings reveal substantial disparities in word error rate (WER) among gender groups across all model sizes, with bias identified through statistical testing.
摘要：最近的研究表明，最先进的自动语音识别（Automatic Speech Recognition, ASR）系统，如Whisper，往往表现出预测偏差，这些偏差不成比例地影响着不同的群体。本研究专注于识别Whisper模型在来自Common Voice数据集和荷兰国家公共广播组织的荷兰语语音数据上的性能差异。我们分析了性别群体之间的词错误率（Word Error Rate, WER）、字符错误率（Character Error Rate, CER）以及基于BERT的语义相似性。我们采用了Weerts等人（2022）的道德框架来评估服务质量损害和公平性，并提供了关于这些偏差影响的细致讨论，特别是对于自动字幕生成的影响。我们的研究结果揭示了在所有模型规模中，性别群体之间的词错误率存在显著差异，并通过统计测试识别出偏差。

[NLP-17] Less is More: Unseen Domain Fake News Detection via Causal Propagation Substructures

【速读】：该论文试图解决的是在面对新兴或之前未见过的领域（即out-of-distribution (OOD)数据）时，现有基于文本和图的假新闻检测模型性能受限的问题。解决方案的关键在于提出了因果子图导向的领域自适应假新闻检测模型（Causal Subgraph-oriented Domain Adaptive Fake News Detection, CSDA），通过从已知分布数据中提取因果子结构，并将其泛化应用于OOD数据，从而增强零样本假新闻检测能力。CSDA模型利用图神经网络（Graph Neural Network）生成掩码的过程来识别传播图中的主导节点和边，并基于这些子结构进行假新闻检测。此外，在少样本场景下，通过对比学习进一步提升了模型的性能。实验结果表明，CSDA在处理OOD假新闻检测方面显著优于其他最先进的模型，准确率提高了7%到16%。

链接: https://arxiv.org/abs/2411.09389
作者: Shuzhi Gong,Richard O. Sinnott,Jianzhong Qi,Cecile Paris
关键词-EN: poses significant threats, fake news detection, media poses significant, fake, individuals and society
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 5 tables

点击查看摘要

Abstract:The spread of fake news on social media poses significant threats to individuals and society. Text-based and graph-based models have been employed for fake news detection by analysing news content and propagation networks, showing promising results in specific scenarios. However, these data-driven models heavily rely on pre-existing in-distribution data for training, limiting their performance when confronted with fake news from emerging or previously unseen domains, known as out-of-distribution (OOD) data. Tackling OOD fake news is a challenging yet critical task. In this paper, we introduce the Causal Subgraph-oriented Domain Adaptive Fake News Detection (CSDA) model, designed to enhance zero-shot fake news detection by extracting causal substructures from propagation graphs using in-distribution data and generalising this approach to OOD data. The model employs a graph neural network based mask generation process to identify dominant nodes and edges within the propagation graph, using these substructures for fake news detection. Additionally, the performance of CSDA is further improved through contrastive learning in few-shot scenarios, where a limited amount of OOD data is available for training. Extensive experiments on public social media datasets demonstrate that CSDA effectively handles OOD fake news detection, achieving a 7 to 16 percents accuracy improvement over other state-of-the-art models.
摘要：社交媒体上虚假新闻的传播对个人和社会构成了重大威胁。基于文本和图的模型通过分析新闻内容和传播网络来进行虚假新闻检测，在特定场景下展示了良好的效果。然而，这些数据驱动的模型严重依赖于预先存在的分布内数据进行训练，当面对来自新兴或之前未见领域的虚假新闻（即分布外数据，Out-of-Distribution, OOD）时，其性能受到限制。解决分布外虚假新闻检测是一个具有挑战性但至关重要的任务。本文中，我们提出了因果子图导向的领域自适应虚假新闻检测模型（Causal Subgraph-oriented Domain Adaptive Fake News Detection, CSDA），旨在通过从分布内数据中提取因果子结构并将其推广到分布外数据，从而增强零样本虚假新闻检测。该模型采用基于图神经网络的掩码生成过程来识别传播图中的主导节点和边，并利用这些子结构进行虚假新闻检测。此外，在少样本场景下，通过对比学习进一步提升了CSDA的性能，其中有限数量的分布外数据可用于训练。在公开的社交媒体数据集上的广泛实验表明，CSDA有效地处理了分布外虚假新闻检测，相比其他最先进的模型，准确率提高了7%至16%。

[NLP-18] Re-Parameterization of Lightweight Transformer for On-Device Speech Emotion Recognition

【速读】：该论文试图解决在资源受限的物联网（IoT）设备上部署复杂Transformer模型的问题。解决方案的关键在于提出了一种名为Transformer重参数化（Transformer Re-parameterization）的新方法，该方法包括两个主要过程：训练阶段的高秩分解（High-Rank Factorization, HRF）过程和推理阶段的去高秩分解（deHigh-Rank Factorization, deHRF）过程。在训练阶段，通过在轻量级Transformer的Feed-Forward Network (FFN)前插入额外的线性层（HRF层）来增强模型的学习能力；在推理阶段，将辅助的HRF层与后续的FFN层合并为一个线性层，从而恢复轻量级模型的原始结构。这种方法在多个Transformer变体（如ConvTransformer、Conformer和SpeechFormer）的语音情感识别任务中进行了验证，实验结果表明，该方法显著提升了轻量级Transformer的性能，使其在资源受限的IoT设备上部署成为可能。

链接: https://arxiv.org/abs/2411.09339
作者: Zixing Zhang,Zhongren Dong,Weixiang Xu,Jing Han
关键词-EN: devices remains challenging, IoT devices remains, resource-constrained IoT devices, IoT devices, Transformer models
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:With the increasing implementation of machine learning models on edge or Internet-of-Things (IoT) devices, deploying advanced models on resource-constrained IoT devices remains challenging. Transformer models, a currently dominant neural architecture, have achieved great success in broad domains but their complexity hinders its deployment on IoT devices with limited computation capability and storage size. Although many model compression approaches have been explored, they often suffer from notorious performance degradation. To address this issue, we introduce a new method, namely Transformer Re-parameterization, to boost the performance of lightweight Transformer models. It consists of two processes: the High-Rank Factorization (HRF) process in the training stage and the deHigh-Rank Factorization (deHRF) process in the inference stage. In the former process, we insert an additional linear layer before the Feed-Forward Network (FFN) of the lightweight Transformer. It is supposed that the inserted HRF layers can enhance the model learning capability. In the later process, the auxiliary HRF layer will be merged together with the following FFN layer into one linear layer and thus recover the original structure of the lightweight model. To examine the effectiveness of the proposed method, we evaluate it on three widely used Transformer variants, i.e., ConvTransformer, Conformer, and SpeechFormer networks, in the application of speech emotion recognition on the IEMOCAP, M3ED and DAIC-WOZ datasets. Experimental results show that our proposed method consistently improves the performance of lightweight Transformers, even making them comparable to large models. The proposed re-parameterization approach enables advanced Transformer models to be deployed on resource-constrained IoT devices.
摘要：随着机器学习模型在边缘设备或物联网（Internet-of-Things, IoT）设备上的应用日益增多，在资源受限的IoT设备上部署高级模型仍然面临挑战。Transformer模型作为当前主导的神经网络架构，在多个领域取得了巨大成功，但其复杂性阻碍了其在计算能力和存储空间有限的IoT设备上的部署。尽管已有多种模型压缩方法被探索，但它们往往伴随着显著的性能下降。为解决这一问题，我们提出了一种新的方法，即Transformer重参数化（Transformer Re-parameterization），以提升轻量级Transformer模型的性能。该方法包括两个过程：训练阶段的高秩分解（High-Rank Factorization, HRF）过程和推理阶段的去高秩分解（deHigh-Rank Factorization, deHRF）过程。在前一过程中，我们在轻量级Transformer的Feed-Forward Network (FFN)之前插入一个额外的线性层。假设插入的HRF层能够增强模型的学习能力。在后一过程中，辅助的HRF层将与后续的FFN层合并为一个线性层，从而恢复轻量级模型的原始结构。为验证所提出方法的有效性，我们在三个广泛使用的Transformer变体（即ConvTransformer、Conformer和SpeechFormer网络）上进行了实验，这些变体应用于IEMOCAP、M3ED和DAIC-WOZ数据集上的语音情感识别任务。实验结果表明，我们提出的方法持续提升了轻量级Transformer的性能，甚至使其与大型模型相媲美。该重参数化方法使得高级Transformer模型能够在资源受限的IoT设备上部署。

[NLP-19] DriveThru: a Document Extraction Platform and Benchmark Datasets for Indonesian Local Language Archives

【速读】：该论文试图解决印度尼西亚语言在自然语言处理（NLP）研究和技术中的代表性不足问题，特别是由于大多数现有资源是手动创建的，难以扩展到更多语言。解决方案的关键在于提出了一种通过数字化现有印刷文档来创建数据集的替代方法。具体来说，论文介绍了DriveThru平台，该平台利用光学字符识别（OCR）技术从文档中提取内容，从而减少手动工作和成本。此外，论文还研究了当前最先进的语言模型（LLM）在OCR后校正中的应用，以提高字符准确率（CAR）和词准确率（WAR）。

链接: https://arxiv.org/abs/2411.09318
作者: MohammadRifqi Farhansyah,Muhammad Zuhdi Fikri Johari,Afinzaki Amiral,Ayu Purwarianti,Kumara Ari Yuana,Derry Tanti Wijaya
关键词-EN: diverse countries linguistically, Natural Language Processing, Indonesian languages, countries linguistically, diverse countries
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Indonesia is one of the most diverse countries linguistically. However, despite this linguistic diversity, Indonesian languages remain underrepresented in Natural Language Processing (NLP) research and technologies. In the past two years, several efforts have been conducted to construct NLP resources for Indonesian languages. However, most of these efforts have been focused on creating manual resources thus difficult to scale to more languages. Although many Indonesian languages do not have a web presence, locally there are resources that document these languages well in printed forms such as books, magazines, and newspapers. Digitizing these existing resources will enable scaling of Indonesian language resource construction to many more languages. In this paper, we propose an alternative method of creating datasets by digitizing documents, which have not previously been used to build digital language resources in Indonesia. DriveThru is a platform for extracting document content utilizing Optical Character Recognition (OCR) techniques in its system to provide language resource building with less manual effort and cost. This paper also studies the utility of current state-of-the-art LLM for post-OCR correction to show the capability of increasing the character accuracy rate (CAR) and word accuracy rate (WAR) compared to off-the-shelf OCR.
摘要：印度尼西亚是语言多样性最为丰富的国家之一。然而，尽管拥有如此丰富的语言多样性，印度尼西亚语言在自然语言处理 (NLP) 研究和技术的应用中仍然处于边缘地位。过去两年中，虽然已经进行了多项努力来构建印度尼西亚语言的 NLP 资源，但大多数这些努力都集中在创建手工资源上，因此难以扩展到更多语言。尽管许多印度尼西亚语言在网络上没有存在感，但在当地，这些语言在书籍、杂志和报纸等印刷形式中得到了很好的记录。数字化这些现有资源将有助于将印度尼西亚语言资源的构建扩展到更多语言。本文提出了一种通过数字化文档来创建数据集的替代方法，这些文档之前并未用于在印度尼西亚构建数字语言资源。DriveThru 是一个利用光学字符识别 (OCR) 技术提取文档内容的平台，旨在以更少的劳动力和成本进行语言资源构建。本文还研究了当前最先进的大语言模型 (LLM) 在 OCR 后校正中的效用，以展示其在提高字符准确率 (CAR) 和词准确率 (WAR) 方面的能力，相比于现成的 OCR 技术。

[NLP-20] DTELS: Towards Dynamic Granularity of Timeline Summarization

【速读】：该论文试图解决在线新闻快速增长背景下，传统时间线摘要（Timeline Summarization）缺乏灵活性以满足多样粒度需求的问题。解决方案的关键在于引入动态粒度时间线摘要（Dynamic-granularity TimELine Summarization, DTELS），通过用户指令或需求构建适应性时间线。论文提出了一个全面的基准，包括基于新闻标准的评估框架、大规模多源数据集以及基于大语言模型（Large Language Models, LLMs）和现有最先进方法的实验分析，以验证LLM解决方案的有效性，同时揭示了DTELS任务的挑战性。

链接: https://arxiv.org/abs/2411.09297
作者: Chenlong Zhang,Tong Zhou,Pengfei Cao,Zhuoran Jin,Yubo Chen,Kang Liu,Jun Zhao
关键词-EN: posed significant challenges, Dynamic-granularity TimELine Summarization, rapid proliferation, proliferation of online, posed significant
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:The rapid proliferation of online news has posed significant challenges in tracking the continuous development of news topics. Traditional timeline summarization constructs a chronological summary of the events but often lacks the flexibility to meet the diverse granularity needs. To overcome this limitation, we introduce a new paradigm, Dynamic-granularity TimELine Summarization, (DTELS), which aims to construct adaptive timelines based on user instructions or requirements. This paper establishes a comprehensive benchmark for DTLES that includes: (1) an evaluation framework grounded in journalistic standards to assess the timeline quality across four dimensions: Informativeness, Granular Consistency, Factuality, and Coherence; (2) a large-scale, multi-source dataset with multiple granularity timeline annotations based on a consensus process to facilitate authority; (3) extensive experiments and analysis with two proposed solutions based on Large Language Models (LLMs) and existing state-of-the-art TLS methods. The experimental results demonstrate the effectiveness of LLM-based solutions. However, even the most advanced LLMs struggle to consistently generate timelines that are both informative and granularly consistent, highlighting the challenges of the DTELS task.
摘要：在线新闻的快速扩散给追踪新闻话题的持续发展带来了重大挑战。传统的时序摘要构建了事件的时间顺序总结，但往往缺乏满足多样化粒度需求的灵活性。为了克服这一局限，我们提出了一种新的范式——动态粒度时序摘要 (Dynamic-granularity TimELine Summarization, DTELS)，旨在根据用户指令或需求构建适应性时序。本文为 DTELS 建立了一个全面的基准，包括：(1) 基于新闻标准的评估框架，用于评估时序质量的四个维度：信息量、粒度一致性、事实性和连贯性；(2) 一个大规模、多源的数据集，包含基于共识过程的多粒度时序标注，以促进权威性；(3) 基于大语言模型 (LLMs) 和现有最先进的时序摘要 (TLS) 方法的两种解决方案的广泛实验和分析。实验结果表明，基于 LLM 的解决方案具有有效性。然而，即使是最高级的 LLM 也难以持续生成既信息丰富又粒度一致的时序，突显了 DTELS 任务的挑战性。

[NLP-21] StreamAdapter: Efficient Test Time Adaptation from Contextual Streams

【速读】：该论文试图解决大语言模型（LLMs）在上下文学习（ICL）过程中，随着上下文窗口的扩展导致推理成本增加而性能提升有限的问题。解决方案的关键是提出了StreamAdapter，这是一种新颖的方法，能够在测试时直接从上下文中更新模型参数，从而消除对显式上下文示例的依赖。StreamAdapter通过上下文映射和权重吸收机制，将ICL示例动态转换为参数更新，仅需极少的额外参数。这种方法显著降低了推理成本，并实现了与示例数量无关的常数时间复杂度推理。实验结果表明，StreamAdapter在多种任务和模型架构上均表现出与ICL相当或更优的适应能力，同时所需的示例数量显著减少，为LLMs在测试时的上下文适应提供了更高效和成本效益更高的解决方案。

链接: https://arxiv.org/abs/2411.09289
作者: Dilxat Muhtar,Yelong Shen,Yaming Yang,Xiaodong Liu,Yadong Lu,Jianfeng Liu,Yuefeng Zhan,Hao Sun,Weiwei Deng,Feng Sun,Xueliang Zhang,Jianfeng Gao,Weizhu Chen,Qi Zhang
关键词-EN: In-context learning, ICL, demonstrations, context, StreamAdapter
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 Pages, 9 Figures

点击查看摘要

Abstract:In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks directly from the given demonstrations without requiring gradient updates. While recent advances have expanded context windows to accommodate more demonstrations, this approach increases inference costs without necessarily improving performance. To mitigate these issues, We propose StreamAdapter, a novel approach that directly updates model parameters from context at test time, eliminating the need for explicit in-context demonstrations. StreamAdapter employs context mapping and weight absorption mechanisms to dynamically transform ICL demonstrations into parameter updates with minimal additional parameters. By reducing reliance on numerous in-context examples, StreamAdapter significantly reduce inference costs and allows for efficient inference with constant time complexity, regardless of demonstration count. Extensive experiments across diverse tasks and model architectures demonstrate that StreamAdapter achieves comparable or superior adaptation capability to ICL while requiring significantly fewer demonstrations. The superior task adaptation and context encoding capabilities of StreamAdapter on both language understanding and generation tasks provides a new perspective for adapting LLMs at test time using context, allowing for more efficient adaptation across scenarios and more cost-effective inference
摘要：上下文学习（In-context Learning, ICL）使得大语言模型（LLMs）能够直接从给定的示例中适应新任务，而无需进行梯度更新。尽管最近的进展扩大了上下文窗口以容纳更多示例，但这种方法增加了推理成本，并不一定能提升性能。为了缓解这些问题，我们提出了StreamAdapter，这是一种新颖的方法，它能够在测试时直接从上下文中更新模型参数，从而消除了对显式上下文示例的需求。StreamAdapter采用上下文映射和权重吸收机制，以最小的额外参数将ICL示例动态转化为参数更新。通过减少对大量上下文示例的依赖，StreamAdapter显著降低了推理成本，并允许在恒定时间复杂度下进行高效推理，无论示例数量多少。在多种任务和模型架构上的广泛实验表明，StreamAdapter在需要显著更少示例的情况下，实现了与ICL相当甚至更优的适应能力。StreamAdapter在语言理解和生成任务上的卓越任务适应和上下文编码能力，为在测试时利用上下文适应LLMs提供了新的视角，使得在各种场景下实现更高效的适应和更具成本效益的推理成为可能。

[NLP-22] Cross-Modal Consistency in Multimodal Large Language Models

【速读】：该论文试图解决的问题是现有研究在评估视觉大语言模型（Vision Large Language Models, VLLMs）时，往往忽略了不同模态（如文本、视觉）之间的跨模态交互，导致无法全面理解这些模型在处理多模态任务时的表现。解决方案的关键在于引入了一个新的概念——跨模态一致性（cross-modal consistency），并基于此概念提出了一个定量评估框架。通过实验，论文揭示了GPT-4V在视觉和语言模态之间存在显著的不一致性，尽管它被描述为一个统一的多模态模型。这一发现为模型的合理使用提供了见解，并指出了未来改进模型设计的潜在方向。

链接: https://arxiv.org/abs/2411.09273
作者: Xiang Zhang,Senyu Li,Ning Shi,Bradley Hauer,Zijun Wu,Grzegorz Kondrak,Muhammad Abdul-Mageed,Laks V.S. Lakshmanan
关键词-EN: diverse data types, processing diverse data, Recent developments, encompassing text, data types
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent developments in multimodal methodologies have marked the beginning of an exciting era for models adept at processing diverse data types, encompassing text, audio, and visual content. Models like GPT-4V, which merge computer vision with advanced language processing, exhibit extraordinary proficiency in handling intricate tasks that require a simultaneous understanding of both textual and visual information. Prior research efforts have meticulously evaluated the efficacy of these Vision Large Language Models (VLLMs) in various domains, including object detection, image captioning, and other related fields. However, existing analyses have often suffered from limitations, primarily centering on the isolated evaluation of each modality’s performance while neglecting to explore their intricate cross-modal interactions. Specifically, the question of whether these models achieve the same level of accuracy when confronted with identical task instances across different modalities remains unanswered. In this study, we take the initiative to delve into the interaction and comparison among these modalities of interest by introducing a novel concept termed cross-modal consistency. Furthermore, we propose a quantitative evaluation framework founded on this concept. Our experimental findings, drawn from a curated collection of parallel vision-language datasets developed by us, unveil a pronounced inconsistency between the vision and language modalities within GPT-4V, despite its portrayal as a unified multimodal model. Our research yields insights into the appropriate utilization of such models and hints at potential avenues for enhancing their design.
摘要：近年来，多模态方法的进展标志着模型处理多样化数据类型（包括文本、音频和视觉内容）的激动人心的时代的开始。像 GPT-4V 这样的模型，结合了计算机视觉与高级语言处理，在处理需要同时理解文本和视觉信息的复杂任务时表现出非凡的能力。先前的研究已经细致地评估了这些视觉大语言模型 (VLLMs) 在多个领域（如物体检测、图像描述生成等）中的有效性。然而，现有的分析往往存在局限性，主要集中在孤立地评估每种模态的性能，而忽视了它们之间复杂的跨模态交互。具体来说，这些模型在面对不同模态的相同任务实例时是否能达到相同的准确度，这一问题仍未得到解答。在本研究中，我们率先通过引入一个称为跨模态一致性的新概念，深入探讨了这些感兴趣模态之间的交互与比较。此外，我们基于这一概念提出了一种定量评估框架。我们从我们精心策划的平行视觉-语言数据集中得出的实验结果显示，尽管 GPT-4V 被描述为一个统一的多模态模型，但其视觉和语言模态之间存在显著的不一致性。我们的研究为这些模型的适当应用提供了见解，并指出了其设计改进的潜在方向。

[NLP-23] Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey

【速读】：该论文试图解决多模态生成模型（multimodal generative models）在面对越狱攻击（jailbreak attacks）时的安全问题。解决方案的关键在于系统性地探索和分类攻击方法及其对应的防御策略，涵盖输入、编码器、生成器和输出四个层次。论文通过详细分析多模态生成模型的生命周期，提出了针对多模态生成模型的攻击方法、防御机制和评估框架的分类体系，并涵盖了多种输入-输出配置，如Any-to-Text、Any-to-Vision和Any-to-Any。此外，论文还指出了当前研究中的挑战，并提出了未来研究的可能方向。

链接: https://arxiv.org/abs/2411.09259
作者: Xuannan Liu,Xing Cui,Peipei Li,Zekun Li,Huaibo Huang,Shuhan Xia,Miaoxuan Zhang,Yueying Zou,Ran He
关键词-EN: multimodal generative models, multimodal foundation models, rapid evolution, led to significant, significant advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ongoing work

点击查看摘要

Abstract:The rapid evolution of multimodal foundation models has led to significant advancements in cross-modal understanding and generation across diverse modalities, including text, images, audio, and video. However, these models remain susceptible to jailbreak attacks, which can bypass built-in safety mechanisms and induce the production of potentially harmful content. Consequently, understanding the methods of jailbreak attacks and existing defense mechanisms is essential to ensure the safe deployment of multimodal generative models in real-world scenarios, particularly in security-sensitive applications. To provide comprehensive insight into this topic, this survey reviews jailbreak and defense in multimodal generative models. First, given the generalized lifecycle of multimodal jailbreak, we systematically explore attacks and corresponding defense strategies across four levels: input, encoder, generator, and output. Based on this analysis, we present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models. Additionally, we cover a wide range of input-output configurations, including modalities such as Any-to-Text, Any-to-Vision, and Any-to-Any within generative systems. Finally, we highlight current research challenges and propose potential directions for future this http URL open-source repository corresponding to this work can be found at this https URL.
摘要：多模态基础模型的快速发展显著推动了跨模态理解和生成的进步，涵盖了文本、图像、音频和视频等多种模态。然而，这些模型仍然容易受到越狱攻击，这种攻击可以绕过内置的安全机制，导致可能产生有害内容。因此，理解越狱攻击的方法和现有的防御机制对于确保多模态生成模型在现实场景中的安全部署至关重要，特别是在安全性敏感的应用中。为了全面深入地探讨这一主题，本综述回顾了多模态生成模型中的越狱攻击与防御。首先，鉴于多模态越狱的通用生命周期，我们系统地探讨了在输入、编码器、生成器和输出四个层级上的攻击及其相应的防御策略。基于这一分析，我们详细阐述了针对多模态生成模型的攻击方法、防御机制和评估框架的分类。此外，我们还涵盖了广泛的输入-输出配置，包括生成系统中的Any-to-Text、Any-to-Vision和Any-to-Any等多种模态。最后，我们指出了当前研究中的挑战，并提出了未来可能的研究方向。与本工作相关的开源仓库可以在以下链接找到。

[NLP-24] DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine EMNLP2024

【速读】：该论文试图解决长文本生成中，特别是在生物医学领域内，大型语言模型（LLMs）产生的幻觉问题。解决方案的关键在于引入DAHL基准数据集和自动化评估系统，该系统通过将模型生成的文本分解为原子单位（atomic units），每个单位代表一个独立的信息片段，并对其准确性进行评估，从而计算出DAHL评分。这一评分方法相较于依赖多项选择任务的传统评估方法，提供了更深入的幻觉评估。此外，论文还发现，虽然较大规模的模型幻觉较少，但当模型规模超过7到80亿参数后，进一步扩大模型规模对事实准确性的提升效果有限。DAHL评分系统具有扩展到其他专业领域的潜力，并且可以作为人工标注偏好标签的高效替代方案。

链接: https://arxiv.org/abs/2411.09255
作者: Jean Seo,Jongwon Lim,Dongjun Jang,Hyopil Shin
关键词-EN: long-form text generation, automated evaluation system, evaluation system designed, text generation, benchmark dataset
类目: Computation and Language (cs.CL)
备注: EMNLP2024/FEVER

点击查看摘要

Abstract:We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain. Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories. DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing responses into atomic units, each representing a single piece of information. The accuracy of these responses is averaged to produce the DAHL Score, offering a more in-depth evaluation of hallucinations compared to previous methods that rely on multiple-choice tasks. We conduct experiments with 8 different models, finding that larger models tend to hallucinate less; however, beyond a model size of 7 to 8 billion parameters, further scaling does not significantly improve factual accuracy. The DAHL Score holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains. We release the dataset and code in public.
摘要：我们引入了 DAHL，这是一个用于评估长篇文本生成中幻觉现象的基准数据集和自动化评估系统，特别针对生物医学领域。我们的基准数据集精心筛选自生物医学研究论文，包含 8,573 个问题，涵盖 29 个类别。DAHL 通过将大语言模型 (LLM) 的响应分解为原子单位，每个单位代表一条独立的信息，来评估事实冲突的幻觉现象。这些响应的准确性平均值构成了 DAHL 评分，相较于依赖多项选择任务的先前方法，提供了更深入的幻觉评估。我们使用 8 种不同的模型进行了实验，发现较大的模型幻觉现象较少；然而，当模型规模超过 7 到 80 亿参数后，进一步扩展并不会显著提高事实准确性。DAHL 评分具有作为人类标注偏好标签的高效替代品的潜力，并且可以扩展到其他专业领域。我们公开发布了数据集和代码。

[NLP-25] Enhancing Financial Domain Adaptation of Language Models via Model Augmentation

【速读】：该论文试图解决大型语言模型（LLMs）在金融领域的适应性问题，解决方案的关键在于引入Composition to Augment Language Models (CALM)。CALM通过在两个具有不同功能的LLM之间引入交叉注意力机制，扩展了现有模型的能力。具体来说，CALM利用一个具有强大响应能力的LLM和一个专门针对金融领域的LLM，通过训练使其适应不同的金融数据集。实验结果表明，CALM在量化和质化的金融基准测试中均表现优异，显著提升了模型在金融领域的响应质量，尤其是在连接模型中间层时效果最佳。这一方法证实了CALM在适应LLMs到金融领域的实际应用价值。

链接: https://arxiv.org/abs/2411.09249
作者: Kota Tanabe,Masanori Hirano,Kazuki Matoya,Kentaro Imajo,Hiroki Sakaji,Itsuki Noda
关键词-EN: including large language, Augment Language Models, large language models, language models, CALM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The domain adaptation of language models, including large language models (LLMs), has become increasingly important as the use of such models continues to expand. This study demonstrates the effectiveness of Composition to Augment Language Models (CALM) in adapting to the financial domain. CALM is a model to extend the capabilities of existing models by introducing cross-attention between two LLMs with different functions. In our experiments, we developed a CALM to enhance the financial performance of an LLM with strong response capabilities by leveraging a financial-specialized LLM. Notably, the CALM was trained using a financial dataset different from the one used to train the financial-specialized LLM, confirming CALM’s ability to adapt to various datasets. The models were evaluated through quantitative Japanese financial benchmarks and qualitative response comparisons, demonstrating that CALM enables superior responses with higher scores than the original models and baselines. Additionally, comparative experiments on connection points revealed that connecting the middle layers of the models is most effective in facilitating adaptation to the financial domain. These findings confirm that CALM is a practical approach for adapting LLMs to the financial domain.
摘要：随着语言模型（包括大语言模型 (LLM)）的应用不断扩展，其领域适应性变得越来越重要。本研究展示了组合增强语言模型 (CALM) 在适应金融领域方面的有效性。CALM 是一种通过引入两个具有不同功能的 LLM 之间的交叉注意力来扩展现有模型能力的模型。在我们的实验中，我们开发了一个 CALM，通过利用一个专注于金融领域的 LLM，来增强具有强大响应能力的 LLM 的金融表现。值得注意的是，CALM 使用了一个与训练金融专用 LLM 不同的金融数据集进行训练，这证实了 CALM 能够适应各种数据集。模型通过定量的日本金融基准测试和定性的响应比较进行评估，结果表明 CALM 能够提供比原始模型和基线更高的分数的优越响应。此外，关于连接点的比较实验表明，连接模型的中间层在促进适应金融领域方面最为有效。这些发现证实了 CALM 是适应 LLM 到金融领域的实用方法。

[NLP-26] HateGPT: Unleashing GPT-3.5 Turbo to Combat Hate Speech on X

【速读】：该论文试图解决社交媒体平台上仇恨言论和冒犯性内容的自动检测问题。解决方案的关键在于利用先进的生成式 AI 模型（如 GPT-3.5 Turbo）通过提示（prompting）对英文推文进行分类，将其分为“仇恨和冒犯性”与“非仇恨和冒犯性”两类。研究通过评估模型的 Macro-F1 分数来衡量其性能，结果显示模型在多次运行中表现出高度的稳定性和一致性，Macro-F1 分数分别为 0.756、0.751 和 0.754，表明模型在精确度和召回率方面均表现出色。

链接: https://arxiv.org/abs/2411.09214
作者: Aniket Deroy,Subhankar Maity
关键词-EN: Twitter and Facebook, social media platforms, Facebook has enabled, thoughts and experiences, social media
类目: Computation and Language (cs.CL)
备注: Accepted at FIRE 2024 (Track: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages (HASOC)). arXiv admin note: text overlap with arXiv:2411.05039 , arXiv:2411.06946

点击查看摘要

Abstract:The widespread use of social media platforms like Twitter and Facebook has enabled people of all ages to share their thoughts and experiences, leading to an immense accumulation of user-generated content. However, alongside the benefits, these platforms also face the challenge of managing hate speech and offensive content, which can undermine rational discourse and threaten democratic values. As a result, there is a growing need for automated methods to detect and mitigate such content, especially given the complexity of conversations that may require contextual analysis across multiple languages, including code-mixed languages like Hinglish, German-English, and Bangla. We participated in the English task where we have to classify English tweets into two categories namely Hate and Offensive and Non Hate-Offensive. In this work, we experiment with state-of-the-art large language models like GPT-3.5 Turbo via prompting to classify tweets into Hate and Offensive or Non Hate-Offensive. In this study, we evaluate the performance of a classification model using Macro-F1 scores across three distinct runs. The Macro-F1 score, which balances precision and recall across all classes, is used as the primary metric for model evaluation. The scores obtained are 0.756 for run 1, 0.751 for run 2, and 0.754 for run 3, indicating a high level of performance with minimal variance among the runs. The results suggest that the model consistently performs well in terms of precision and recall, with run 1 showing the highest performance. These findings highlight the robustness and reliability of the model across different runs.
摘要：社交媒体平台如 Twitter 和 Facebook 的广泛使用，使得各个年龄段的人们能够分享他们的想法和经历，从而积累了大量的用户生成内容。然而，这些平台在带来好处的同时，也面临着管理仇恨言论和冒犯性内容的问题，这些内容可能破坏理性讨论并威胁民主价值观。因此，对于自动检测和缓解此类内容的需求日益增长，尤其是在需要跨多种语言进行上下文分析的复杂对话中，包括像 Hinglish、德英混合和孟加拉语等代码混合语言。我们参与了英语任务，需要将英语推文分类为仇恨和冒犯性内容与非仇恨和冒犯性内容。在这项工作中，我们通过提示方式，使用如 GPT-3.5 Turbo 等先进的大语言模型来对推文进行分类。在本研究中，我们使用 Macro-F1 分数评估分类模型的性能，进行了三次独立的运行。Macro-F1 分数作为模型评估的主要指标，平衡了所有类别的精确率和召回率。获得的分数分别为：第一次运行 0.756，第二次运行 0.751，第三次运行 0.754，显示出高水平的性能且运行间的差异极小。结果表明，该模型在精确率和召回率方面表现一致良好，其中第一次运行显示出最高的性能。这些发现突显了模型在不同运行中的稳健性和可靠性。

[NLP-27] Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering

【速读】：该论文试图解决在医疗领域中，现有检索增强生成（Retrieval-augmented Generation, RAG）基准测试在评估知识密集型任务（如医疗问答）时，未能充分考虑实际应用场景中对系统准确性和可靠性的关键需求的问题。解决方案的关键在于提出了一个全面的评估框架——医疗检索增强生成基准（Medical Retrieval-Augmented Generation Benchmark, MedRGB），该框架通过引入多种补充元素，针对四个医疗问答数据集，测试大语言模型（Large Language Models, LLMs）在处理特定场景（如充分性、集成性和鲁棒性）的能力。通过MedRGB，论文对当前最先进的商业LLMs和开源模型进行了广泛的评估，揭示了这些模型在处理检索文档中的噪声和错误信息方面的局限性，并提供了对LLMs推理过程的深入分析，为未来在医疗领域开发RAG系统提供了宝贵的见解和方向。

链接: https://arxiv.org/abs/2411.09213
作者: Nghia Trung Ngo,Chien Van Nguyen,Franck Dernoncourt,Thien Huu Nguyen
关键词-EN: large language models, promising approach, approach to enhance, enhance the performance, performance of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they overlook many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medical question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and robustness. We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets for testing LLMs’ ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models’ limited ability to handle noise and misinformation in the retrieved documents. We further analyze the LLMs’ reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.
摘要：检索增强生成 (Retrieval-augmented generation, RAG) 作为一种有前景的方法，旨在提升大语言模型 (Large Language Models, LLMs) 在医疗领域等知识密集型任务中的表现。然而，医疗领域的敏感性要求系统必须完全准确且可信。尽管现有的 RAG 基准主要集中在标准的检索-回答设置上，但它们忽略了评估可靠医疗系统关键方面的许多实际场景。本文通过提供一个全面的评估框架，填补了这一空白，该框架针对这些情况下的医疗问答 (Question-Answering, QA) 系统，包括充分性、集成性和鲁棒性。我们引入了医疗检索增强生成基准 (Medical Retrieval-Augmented Generation Benchmark, MedRGB)，该基准为四个医疗 QA 数据集提供了多种补充元素，用于测试 LLMs 处理这些特定场景的能力。利用 MedRGB，我们对多个检索条件下的最先进商业 LLMs 和开源模型进行了广泛评估。我们的实验结果揭示了当前模型在处理检索文档中的噪声和错误信息方面的有限能力。我们进一步分析了 LLMs 的推理过程，为在该关键医疗领域开发 RAG 系统提供了宝贵的见解和未来方向。

[NLP-28] Unstructured Text Enhanced Open-domain Dialogue System: A Systematic Survey

【速读】：该论文试图解决将外部知识整合到开放域对话系统中以提升其性能的问题，特别是通过使用非结构化文本作为外部知识源的开放域对话系统（Unstructured Text Enhanced Dialogue System, UTEDS）。解决方案的关键在于分析UTEDS与传统数据驱动对话系统之间的区别，并从模型组件的角度对UTEDS进行分类和介绍。具体来说，UTEDS被分为检索模型和生成模型两大类，检索模型包括融合、匹配和排序模块，而生成模型则包括对话和知识编码、知识选择以及响应生成模块。论文还总结了UTEDS的评估方法，并分析了当前模型的性能，最后讨论了UTEDS的未来发展趋势。

链接: https://arxiv.org/abs/2411.09166
作者: Longxuan Ma,Mingda Li,Weinan Zhang,Jiapeng Li,Ting Liu
关键词-EN: Incorporating external knowledge, open-domain Dialogue System, controlling conversation topics, Incorporating external, textbf
类目: Computation and Language (cs.CL)
备注: 45 pages, 3 Figures, 11 Tables

点击查看摘要

Abstract:Incorporating external knowledge into dialogue generation has been proven to benefit the performance of an open-domain Dialogue System (DS), such as generating informative or stylized responses, controlling conversation topics. In this article, we study the open-domain DS that uses unstructured text as external knowledge sources (\textbfUnstructured \textbfText \textbfEnhanced \textbfDialogue \textbfSystem, \textbfUTEDS). The existence of unstructured text entails distinctions between UTEDS and traditional data-driven DS and we aim to analyze these differences. We first give the definition of the UTEDS related concepts, then summarize the recently released datasets and models. We categorize UTEDS into Retrieval and Generative models and introduce them from the perspective of model components. The retrieval models consist of Fusion, Matching, and Ranking modules, while the generative models comprise Dialogue and Knowledge Encoding, Knowledge Selection, and Response Generation modules. We further summarize the evaluation methods utilized in UTEDS and analyze the current models’ performance. At last, we discuss the future development trends of UTEDS, hoping to inspire new research in this field.
摘要：将外部知识融入对话生成已被证明有利于开放领域对话系统 (Dialogue System, DS) 的性能，例如生成信息丰富或风格化的回复，控制对话话题等。本文研究了使用非结构化文本作为外部知识源的开放领域对话系统（非结构化文本增强对话系统，Unstructured Text Enhanced Dialogue System, UTEDS）。非结构化文本的存在使得 UTEDS 与传统数据驱动的 DS 之间存在显著差异，我们的目标是分析这些差异。首先，我们给出了与 UTEDS 相关概念的定义，然后总结了最近发布的数据集和模型。我们将 UTEDS 分为检索模型和生成模型，并从模型组件的角度进行介绍。检索模型包括融合、匹配和排序模块，而生成模型则包括对话与知识编码、知识选择和回复生成模块。我们进一步总结了 UTEDS 中使用的评估方法，并分析了当前模型的性能。最后，我们讨论了 UTEDS 的未来发展趋势，希望能激发该领域的新研究。

[NLP-29] DROJ: A Prompt-Driven Attack against Large Language Models

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在面对对抗性攻击时容易生成有害内容的问题。解决方案的关键在于提出了一种名为定向表示优化越狱 (Directed Representation Optimization Jailbreak, DROJ) 的新方法，该方法通过在嵌入层优化越狱提示，将有害查询的隐藏表示向更可能引发模型肯定响应的方向调整。实验结果表明，DROJ 在 LLaMA-2-7b-chat 模型上实现了 100% 的关键词攻击成功率 (Attack Success Rate, ASR)，有效避免了模型的直接拒绝响应，但偶尔会产生重复和非信息性的回答。为解决这一问题，论文还引入了一个有助于提升模型响应实用性的系统提示。

链接: https://arxiv.org/abs/2411.09125
作者: Leyang Hu,Boran Wang
关键词-EN: Large Language Models, language processing tasks, natural language processing, Large Language, demonstrated exceptional capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Due to their training on internet-sourced datasets, LLMs can sometimes generate objectionable content, necessitating extensive alignment with human feedback to avoid such outputs. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks, which usually are manipulated prompts designed to circumvent safety mechanisms and elicit harmful responses. Here, we introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ), which optimizes jailbreak prompts at the embedding level to shift the hidden representations of harmful queries towards directions that are more likely to elicit affirmative responses from the model. Our evaluations on LLaMA-2-7b-chat model show that DROJ achieves a 100% keyword-based Attack Success Rate (ASR), effectively preventing direct refusals. However, the model occasionally produces repetitive and non-informative responses. To mitigate this, we introduce a helpfulness system prompt that enhances the utility of the model’s responses. Our code is available at this https URL.
摘要：大语言模型 (LLM) 在各种自然语言处理任务中展现了卓越的能力。由于其训练数据来源于互联网，LLM 有时会生成不当内容，因此需要与人类反馈进行广泛的校准以避免此类输出。尽管进行了大量的校准工作，LLM 仍然容易受到对抗性越狱攻击，这些攻击通常是通过精心设计的提示来绕过安全机制并引发有害响应。在此，我们提出了一种新颖的方法，即定向表示优化越狱 (Directed Representation Optimization Jailbreak, DROJ)，该方法在嵌入层优化越狱提示，将有害查询的隐藏表示向更有可能引发模型肯定响应的方向调整。我们在 LLaMA-2-7b-chat 模型上的评估显示，DROJ 实现了 100% 的关键词攻击成功率 (ASR)，有效地防止了直接拒绝。然而，模型偶尔会产生重复且无信息量的响应。为了缓解这一问题，我们引入了一个有用的系统提示，以提高模型响应的实用性。我们的代码可在以下链接获取：https URL。

[NLP-30] P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLM s

【速读】：该论文试图解决现有大型语言模型（LLMs）评估方法的局限性，即以往的评估通常局限于基础自然语言处理（NLP）任务或特定能力的孤立任务，缺乏对多语言多任务能力的全面评估。解决方案的关键在于提出一个综合的多语言多任务基准测试（P-MMEval），并通过一个选择合理基准的流程来确保这些基准的有效性，即它们能够区分不同模型的性能。P-MMEval不仅覆盖了基础和特定能力的数据集，还确保了跨数据集的语言一致性和提供并行样本，从而为多语言模型的全面评估提供了坚实的基础。

链接: https://arxiv.org/abs/2411.09116
作者: Yidan Zhang,Boyi Deng,Yu Wan,Baosong Yang,Haoran Wei,Fei Huang,Bowen Yu,Junyang Lin,Fei Huang,Jingren Zhou
关键词-EN: Recent advancements, showcase varied multilingual, varied multilingual capabilities, code generation, showcase varied
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. To alleviate this drawback, we aim to present a comprehensive multilingual multitask benchmark. First, we present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks, i.e., their ability to differentiate between models being evaluated. Leveraging this pipeline, we introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets. Furthermore, P-MMEval delivers consistent language coverage across various datasets and provides parallel samples. Finally, we conduct extensive experiments on representative multilingual model series to compare performances across models, analyze dataset effectiveness, examine prompt impacts on model performances, and explore the relationship between multilingual performances and factors such as tasks, model sizes, and languages. These insights offer valuable guidance for future research. The dataset is available at this https URL.
摘要：近年来，大语言模型 (LLMs) 在翻译、代码生成和推理等任务中展示了多样的多语言能力。以往的评估往往局限于基本的自然语言处理 (NLP) 任务或特定能力的孤立任务。为了弥补这一不足，我们旨在提出一个全面的多语言多任务基准。首先，我们提出了一种从大量基准中选择可用且合理基准的流程，解决了以往工作中对这些基准实用性的忽视，即它们区分被评估模型之间差异的能力。利用这一流程，我们引入了 P-MMEval，这是一个涵盖有效基础和能力专业化数据集的大规模基准。此外，P-MMEval 在各种数据集中提供了持续的语言覆盖，并提供了并行样本。最后，我们对代表性的多语言模型系列进行了广泛的实验，以比较模型之间的性能，分析数据集的有效性，检查提示对模型性能的影响，并探讨多语言性能与任务、模型大小和语言等因素之间的关系。这些见解为未来的研究提供了宝贵的指导。数据集可通过此 https URL 获取。

[NLP-31] Personalized Help for Optimizing Low-Skilled Users Strategy

【速读】：该论文试图解决的问题是如何评估和提升AI在复杂游戏环境中对人类玩家的辅助作用。解决方案的关键在于增强CICERO这一自然语言代理，使其能够根据玩家意图生成游戏动作和沟通建议。通过在不同水平的玩家中进行实验，论文发现生成的建议对新手玩家尤其有益，能够帮助他们与经验丰富的玩家竞争，甚至在某些情况下超越他们。即使玩家不遵循建议，建议的存在本身也能带来优势。

链接: https://arxiv.org/abs/2411.09109
作者: Feng Gu,Wichayaporn Wongkamjan,Jordan Lee Boyd-Graber,Jonathan K. Kummerfeld,Denis Peskoff,Jonathan May
关键词-EN: human remains understudied, AIs can beat, remains understudied, beat humans, human remains
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:AIs can beat humans in game environments; however, how helpful those agents are to human remains understudied. We augment CICERO, a natural language agent that demonstrates superhuman performance in Diplomacy, to generate both move and message advice based on player intentions. A dozen Diplomacy games with novice and experienced players, with varying advice settings, show that some of the generated advice is beneficial. It helps novices compete with experienced players and in some instances even surpass them. The mere presence of advice can be advantageous, even if players do not follow it.
摘要：AI 在游戏环境中能够击败人类；然而，这些 AI 智能体对人类的实际帮助程度仍未得到充分研究。我们增强了 CICERO，一个在 Diplomacy 游戏中展现出超人类表现的自然语言智能体，使其能够根据玩家意图生成行动和消息建议。通过与新手和经验丰富的玩家进行的十多场 Diplomacy 游戏，在不同建议设置下，结果显示生成的部分建议具有积极作用。这些建议帮助新手玩家与经验丰富的玩家竞争，甚至在某些情况下超越他们。即使玩家不遵循这些建议，建议的存在本身也可能带来优势。

[NLP-32] Code-mixed LLM : Improve Large Language Models Capability to Handle Code-Mixing through Reinforcement Learning from AI Feedback

【速读】：该论文试图解决在代码混合（Code-mixing, CM）场景下，当前最先进的多语言大型语言模型（Large Language Models, LLMs）在自然语言处理（Natural Language Processing, NLP）任务中的性能不足问题。解决方案的关键在于通过强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）和代码混合机器翻译任务来提升多语言LLMs的理解能力。为了降低人工标注的成本和时间消耗，研究者提出利用LLMs作为标注工具，执行强化学习从AI反馈（Reinforcement Learning from AI Feedback, RLAIF），实验结果表明该方法的有效性。

链接: https://arxiv.org/abs/2411.09073
作者: Wenbo Zhang,Aditya Majumdar,Amulya Yadav
关键词-EN: single utterance, juxtaposition of linguistic, linguistic units, code-mixing NLP tasks, CSW
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: initial version: 5 pages, 2 figures

点击查看摘要

Abstract:Code-mixing(CM) or code-switching(CSW) refers to the juxtaposition of linguistic units from two or more languages during the conversation or sometimes even a single utterance. Code-mixing introduces unique challenges in daily life, such as syntactic mismatches and semantic blending, that are rarely encountered in monolingual settings. Large language models (LLMs) have revolutionized the field of natural language processing (NLP) by offering unprecedented capabilities in understanding human languages. However, the effectiveness of current state-of-the-art multilingual LLMs has not yet been fully explored in the CM scenario. To fill this gap, we first benchmark the performance of multilingual LLMs on various code-mixing NLP tasks. Then we propose to improve the multilingual LLMs’ ability to understand code-mixing through reinforcement learning from human feedback (RLHF) and code-mixed machine translation tasks. Given the high-cost and time-consuming preference labeling procedure, we improve this by utilizing LLMs as annotators to perform the reinforcement learning from AI feedback (RLAIF). The experiments show the effectiveness of the proposed method.
摘要：代码混合（Code-mixing, CM）或代码转换（Code-switching, CSW）指的是在对话中或甚至在单个话语中，将来自两种或多种语言的语言单位并置的现象。代码混合在日常生活中引入了独特的挑战，如句法不匹配和语义混合，这些在单语环境中很少遇到。大语言模型（Large Language Models, LLMs）通过提供前所未有的理解人类语言的能力，彻底改变了自然语言处理（Natural Language Processing, NLP）领域。然而，当前最先进的多语言大语言模型在代码混合场景中的有效性尚未得到充分探索。为了填补这一空白，我们首先在各种代码混合NLP任务上对多语言大语言模型的性能进行了基准测试。然后，我们提出通过从人类反馈中进行强化学习（Reinforcement Learning from Human Feedback, RLHF）和代码混合机器翻译任务来提高多语言大语言模型理解代码混合的能力。鉴于偏好标注过程的高成本和耗时性，我们通过利用大语言模型作为标注者来执行从AI反馈中进行强化学习（Reinforcement Learning from AI Feedback, RLAIF），从而改进了这一过程。实验结果显示了所提出方法的有效性。

[NLP-33] Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions

【速读】：该论文试图解决小规模视觉语言模型（VLMs）在微调过程中因长而详细的图像描述而产生的幻觉问题。解决方案的关键在于提出了知识适应性微调（Knowledge Adapted (KnowAda) fine-tuning），这是一种以数据为中心的方法，通过自动调整训练数据以适应模型的现有知识和视觉理解，从而在减少幻觉的同时保持高描述性。该方法在多个小规模VLMs和密集描述数据集上进行了验证，证明了其在减少幻觉和保持描述性方面的有效性，并优于多种基线方法。

链接: https://arxiv.org/abs/2411.09018
作者: Moran Yanuka,Assaf Ben Kish,Yonatan Bitton,Idan Szpektor,Raja Giryes
关键词-EN: Recent research increasingly, research increasingly focuses, Recent research, detailed image captions, detailed image
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent research increasingly focuses on training vision-language models (VLMs) with long, detailed image captions. However, small-scale VLMs often struggle to balance the richness of these captions with the risk of hallucinating content during fine-tuning. In this paper, we explore how well VLMs adapt to such captions. To quantify caption quality, we propose Decomposed NLI (DNLI), an evaluation framework that breaks down generated captions into individual propositions, assessing each in isolation. This fine-grained analysis reveals a critical balance between capturing descriptive details and preventing hallucinations. Our findings show that simply reducing caption complexity or employing standard data curation techniques does not effectively resolve this issue. To tackle this challenge, we introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model’s existing knowledge and visual understanding. KnowAda minimizes hallucinations while preserving high descriptiveness. We validate this approach across several small-scale VLMs (up to 7B parameters) and dense caption datasets, demonstrating that KnowAda effectively balances hallucination reduction and descriptiveness. Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations. We will release our code and models.
摘要：近年来，研究逐渐聚焦于使用长而详细的图像描述来训练视觉-语言模型 (Vision-Language Models, VLMs)。然而，小规模的 VLMs 在微调过程中往往难以平衡这些描述的丰富性与内容幻觉的风险。本文探讨了 VLMs 如何适应这些描述。为了量化描述质量，我们提出了分解式自然语言推理 (Decomposed NLI, DNLI)，这是一个评估框架，它将生成的描述分解为单独的命题，并分别进行评估。这种细粒度的分析揭示了在捕捉描述细节与防止幻觉之间的重要平衡。我们的研究发现，简单地减少描述复杂性或采用标准的数据整理技术并不能有效解决这一问题。为了应对这一挑战，我们引入了知识适应 (Knowledge Adapted, KnowAda) 微调，这是一种以数据为中心的方法，能够根据模型现有的知识和视觉理解自动调整训练数据。KnowAda 在保持高描述性的同时最小化了幻觉。我们在多个小规模 VLMs（参数规模高达 7B）和密集描述数据集上验证了这种方法，证明了 KnowAda 在减少幻觉和保持描述性之间的有效平衡。我们的结果显示，KnowAda 在自动指标和人工评估中均优于多种基线方法。我们将发布我们的代码和模型。

[NLP-34] Cut Your Losses in Large-Vocabulary Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在训练过程中由于交叉熵损失计算导致的内存占用过高的问题。解决方案的关键在于提出了Cut Cross-Entropy (CCE)方法，该方法通过避免在全局内存中具体化所有token的logit矩阵，仅计算正确token的logit并在运行时动态评估所有logit的log-sum-exp，从而显著减少内存消耗。具体实现中，CCE利用自定义内核在闪存中执行矩阵乘法和log-sum-exp约简，使得交叉熵计算的内存消耗几乎可以忽略不计。实验结果表明，CCE不仅大幅降低了内存占用（例如，将Gemma 2 (2B)模型的损失计算内存从24 GB减少到1 MB，分类器头的总训练时间内存从28 GB减少到1 GB），而且不牺牲训练速度或收敛性。

链接: https://arxiv.org/abs/2411.09009
作者: Erik Wijmans,Brody Huval,Alexander Hertzberg,Vladlen Koltun,Philipp Krähenbühl
关键词-EN: language models grow, memory, grow ever larger, memory consumption, cross-entropy
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e., below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.
摘要：随着语言模型规模的不断扩大，其词汇量也随之增加。这一变化导致大语言模型（LLM）在训练过程中的内存占用显著集中于一个特定层：损失计算中的交叉熵层。交叉熵构建了一个包含每个输入Token与词汇项对之间条目的Logit矩阵，对于小型模型而言，其内存消耗量远超LLM其他部分的总和。我们提出了一种名为“剪枝交叉熵”（Cut Cross-Entropy, CCE）的方法，该方法在计算交叉熵损失时无需将所有Token的Logit具体化到全局内存中。相反，CCE仅计算正确Token的Logit，并实时评估所有Logit的对数求和指数（log-sum-exp）。我们实现了一个自定义内核，该内核在闪存中执行矩阵乘法和词汇上的log-sum-exp缩减操作，从而使得交叉熵计算的全局内存消耗几乎可以忽略不计。这一改进效果显著。以Gemma 2（2B）模型为例，CCE将损失计算的内存占用从24 GB减少到1 MB，并将分类器头的总训练时间内存消耗从28 GB减少到1 GB。为了提高CCE的吞吐量，我们利用了Softmax的固有稀疏性，并提出跳过对梯度贡献微乎其微（即低于数值精度）的梯度计算元素。实验表明，这种显著的内存消耗减少并未牺牲训练速度或收敛性。

[NLP-35] Refusal in LLM s is an Affine Function

【速读】：该论文试图解决通过直接干预模型激活来引导语言模型行为的问题。解决方案的关键在于提出了仿射概念编辑 (Affine Concept Editing, ACE)，这是一种结合了仿射子空间投影和激活添加的方法。通过仿射分解模型激活向量，ACE能够更精确地控制模型的拒绝响应，并在不同类型的提示下实现一致的行为控制。实验结果表明，ACE不仅在控制模型行为方面表现出色，还能在仅使用仿射子空间投影导致输出不连贯的情况下，通过结合激活添加来生成合理的输出。

链接: https://arxiv.org/abs/2411.09003
作者: Thomas Marshall,Adam Scherlis,Nora Belrose
关键词-EN: steering language models’, language models’ behavior, affine concept editing, propose affine concept, Hermes Eagle RWKV
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose affine concept editing (ACE) as an approach for steering language models’ behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for steering model behavior correspond to subsets of terms of this decomposition. We then provide a derivation of ACE and test it on refusal using Llama 3 8B and Hermes Eagle RWKV v5. ACE ultimately combines affine subspace projection and activation addition to reliably control the model’s refusal responses across prompt types. We evaluate the results using LLM-based scoring on a collection of harmful and harmless prompts. Our experiments demonstrate that ACE consistently achieves more precise control over model behavior and generalizes to models where directional ablation via affine subspace projection alone produces incoherent outputs. Code for reproducing our results is available at this https URL .
摘要：我们提出了一种名为仿射概念编辑 (Affine Concept Editing, ACE) 的方法，通过直接干预激活来引导语言模型的行为。我们首先对模型激活向量进行仿射分解，并证明先前用于引导模型行为的方法对应于此分解的某些项。接着，我们推导了 ACE 的公式，并在 Llama 3 8B 和 Hermes Eagle RWKV v5 上测试了其在拒绝任务中的应用。ACE 最终结合了仿射子空间投影和激活加法，以可靠地控制模型在不同提示类型下的拒绝响应。我们使用基于大语言模型的评分方法对一组有害和无害的提示进行了评估。实验结果表明，ACE 能够持续实现对模型行为的更精确控制，并且能够推广到仅通过仿射子空间投影进行方向性消融会产生不连贯输出的模型。重现我们结果的代码可在以下链接获取：https URL。

[NLP-36] CoCoP: Enhancing Text Classification with LLM through Code Completion Prompt

【速读】：该论文试图解决文本分类任务中大型语言模型（LLMs）对输入提示质量依赖性强的问题。解决方案的关键是将文本分类问题转化为代码补全任务，提出了代码补全提示（Code Completion Prompt, CoCoP）方法。通过利用LLMs的代码补全能力，CoCoP显著提升了多种数据集上的文本分类性能，例如在SST2数据集上提高了超过20%的准确率。此外，当CoCoP与专门设计用于代码相关任务的LLMs（如CodeLLaMA）结合时，该方法在性能上优于或可与少样本学习技术相媲美，同时仅使用十分之一的模型大小。

链接: https://arxiv.org/abs/2411.08979
作者: Mohammad Mahdi Mohajeri,Mohammad Javad Dousti,Majid Nili Ahmadabadi
关键词-EN: natural language processing, large language models, language processing, natural language, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text classification is a fundamental task in natural language processing (NLP), and large language models (LLMs) have demonstrated their capability to perform this task across various domains. However, the performance of LLMs heavily depends on the quality of their input prompts. Recent studies have also shown that LLMs exhibit remarkable results in code-related tasks. To leverage the capabilities of LLMs in text classification, we propose the Code Completion Prompt (CoCoP) method, which transforms the text classification problem into a code completion task. CoCoP significantly improves text classification performance across diverse datasets by utilizing LLMs’ code-completion capability. For instance, CoCoP enhances the accuracy of the SST2 dataset by more than 20%. Moreover, when CoCoP integrated with LLMs specifically designed for code-related tasks (code models), such as CodeLLaMA, this method demonstrates better or comparable performance to few-shot learning techniques while using only one-tenth of the model size. The source code of our proposed method will be available to the public upon the acceptance of the paper.
摘要：文本分类是自然语言处理 (NLP) 中的一个基础任务，大语言模型 (LLMs) 已经在多个领域展示了其执行此任务的能力。然而，LLMs 的性能在很大程度上依赖于其输入提示的质量。最近的研究还表明，LLMs 在代码相关任务中表现出色。为了利用 LLMs 在文本分类中的能力，我们提出了代码补全提示 (Code Completion Prompt, CoCoP) 方法，该方法将文本分类问题转化为代码补全任务。通过利用 LLMs 的代码补全能力，CoCoP 显著提高了在不同数据集上的文本分类性能。例如，CoCoP 将 SST2 数据集的准确率提高了超过 20%。此外，当 CoCoP 与专门设计用于代码相关任务 (代码模型) 的 LLMs（如 CodeLLaMA）结合时，该方法在使用仅十分之一模型大小的情况下，表现优于或相当于少样本学习技术。我们提出的方法的源代码将在论文被接受后公开。

[NLP-37] Robustness and Confounders in the Demographic Alignment of LLM s with Human Perceptions of Offensiveness ACL’25

【速读】：该论文试图解决大语言模型（LLMs）中存在的群体偏见问题，特别是这些偏见在多个数据集中的表现及其与混杂因素的关系。解决方案的关键在于系统性地评估LLMs在多个冒犯性语言数据集中的对齐情况，并考虑混杂因素如文档难度、标注者敏感度和群体内一致性对对齐模式的影响。研究发现，虽然群体特征（尤其是种族）影响对齐，但其效应在不同数据集中不一致，且常与其他因素交织。通过多数据集分析和考虑混杂因素的方法，论文强调了在开发稳健的群体偏见测量方法时的重要性。

链接: https://arxiv.org/abs/2411.08977
作者: Shayan Alipour,Indira Sen,Mattia Samory,Tanushree Mitra
关键词-EN: Large language models, studies systematically evaluate, Large language, exhibit demographic biases, studies systematically
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 18 pages, 8 figures, ACL’25

点击查看摘要

Abstract:Large language models (LLMs) are known to exhibit demographic biases, yet few studies systematically evaluate these biases across multiple datasets or account for confounding factors. In this work, we examine LLM alignment with human annotations in five offensive language datasets, comprising approximately 220K annotations. Our findings reveal that while demographic traits, particularly race, influence alignment, these effects are inconsistent across datasets and often entangled with other factors. Confounders – such as document difficulty, annotator sensitivity, and within-group agreement – account for more variation in alignment patterns than demographic traits alone. Specifically, alignment increases with higher annotator sensitivity and group agreement, while greater document difficulty corresponds to reduced alignment. Our results underscore the importance of multi-dataset analyses and confounder-aware methodologies in developing robust measures of demographic bias in LLMs.
摘要：大语言模型（Large Language Models, LLMs）已知存在人口统计学偏见，但很少有研究系统地评估这些偏见在多个数据集上的表现，或考虑混杂因素的影响。在本研究中，我们考察了 LLM 与五个冒犯性语言数据集中约 22 万条标注的一致性。我们的研究发现，尽管人口统计学特征，特别是种族，对一致性有影响，但这些影响在不同数据集间并不一致，且常常与其他因素交织在一起。混杂因素——如文档难度、标注者敏感度以及组内一致性——对一致性模式的影响比单纯的人口统计学特征更大。具体而言，标注者敏感度和组内一致性越高，一致性越强；而文档难度越大，一致性越低。我们的结果强调了多数据集分析和考虑混杂因素的方法论在开发稳健的 LLM 人口统计学偏见测量中的重要性。

[NLP-38] Sparse Upcycling: Inference Inefficient Finetuning NEURIPS

【速读】：该论文试图解决在保持推理效率的同时提高小型、高度训练的开源大型语言模型质量的问题。解决方案的关键在于采用稀疏升级（Sparse Upcycling）方法，即将预训练的密集模型转换为混合专家模型（Mixture-of-Experts, MoE）架构，从而增加模型参数数量并提升模型质量。研究结果表明，稀疏升级在某些情况下可以比继续预训练（Continued Pretraining, CPT）带来超过20%的质量提升，但这也伴随着显著的推理成本，导致在高需求推理场景中模型推理速度下降40%。因此，论文强调了模型质量和推理效率之间的权衡，为寻求平衡模型质量和部署约束的实践者提供了重要见解。

链接: https://arxiv.org/abs/2411.08968
作者: Sasha Doubov,Nikhil Sardana,Vitaliy Chiley
关键词-EN: open-source large language, large language models, highly trained, open-source large, remains a challenge
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 4 figures, To appear in the 4th NeurIPS Workshop on Efficient Natural Language and Speech Processing (ENLSP), 2024

点击查看摘要

Abstract:Small, highly trained, open-source large language models are widely used due to their inference efficiency, but further improving their quality remains a challenge. Sparse upcycling is a promising approach that transforms a pretrained dense model into a Mixture-of-Experts (MoE) architecture, increasing the model’s parameter count and quality. In this work, we compare the effectiveness of sparse upcycling against continued pretraining (CPT) across different model sizes, compute budgets, and pretraining durations. Our experiments show that sparse upcycling can achieve better quality, with improvements of over 20% relative to CPT in certain scenarios. However, this comes with a significant inference cost, leading to 40% slowdowns in high-demand inference settings for larger models. Our findings highlight the trade-off between model quality and inference efficiency, offering insights for practitioners seeking to balance model quality and deployment constraints.
摘要：由于推理效率高，小型、高度训练的开源大语言模型被广泛使用，但进一步提升其质量仍是一个挑战。稀疏升级（Sparse upcycling）是一种有前景的方法，它将预训练的密集模型转换为专家混合（Mixture-of-Experts, MoE）架构，从而增加模型的参数数量和质量。在本研究中，我们比较了稀疏升级与持续预训练（Continued Pretraining, CPT）在不同模型规模、计算预算和预训练时长下的效果。实验结果表明，稀疏升级在某些情况下可以实现更好的质量，相对于CPT有超过20%的提升。然而，这伴随着显著的推理成本，导致在高需求推理环境下，较大模型的推理速度减慢了40%。我们的研究结果突显了模型质量和推理效率之间的权衡，为寻求平衡模型质量和部署限制的从业者提供了见解。

[NLP-39] Quantifying Risk Propensities of Large Language Models : Ethical Focus and Bias Detection through Role-Play

【速读】：该论文试图解决大型语言模型（LLMs）在伦理领域中的风险决策倾向和潜在偏见问题。解决方案的关键在于创新性地将认知科学中的领域特定风险承担量表（Domain-Specific Risk-Taking, DOSPERT）应用于LLMs，并提出了一种新的伦理决策风险态度量表（Ethical Decision-Making Risk Attitude Scale, EDRAS），以深入评估LLMs的伦理风险态度。此外，论文还提出了一种结合风险量表和角色扮演的新方法，用于定量评估LLMs中的系统性偏见。通过系统评估和分析多个主流LLMs，研究揭示并量化了LLMs在不同群体中的系统性偏见，从而有助于理解LLMs的风险决策机制，并确保其安全可靠的应用。

链接: https://arxiv.org/abs/2411.08884
作者: Yifan Zeng
关键词-EN: Large Language Models, Language Models, Large Language, ethical risk attitudes, LLMs’ risk decision-making
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) become more prevalent, concerns about their safety, ethics, and potential biases have risen. Systematically evaluating LLMs’ risk decision-making tendencies and attitudes, particularly in the ethical domain, has become crucial. This study innovatively applies the Domain-Specific Risk-Taking (DOSPERT) scale from cognitive science to LLMs and proposes a novel Ethical Decision-Making Risk Attitude Scale (EDRAS) to assess LLMs’ ethical risk attitudes in depth. We further propose a novel approach integrating risk scales and role-playing to quantitatively evaluate systematic biases in LLMs. Through systematic evaluation and analysis of multiple mainstream LLMs, we assessed the “risk personalities” of LLMs across multiple domains, with a particular focus on the ethical domain, and revealed and quantified LLMs’ systematic biases towards different groups. This research helps understand LLMs’ risk decision-making and ensure their safe and reliable application. Our approach provides a tool for identifying and mitigating biases, contributing to fairer and more trustworthy AI systems. The code and data are available.
摘要：随着大语言模型 (LLM) 的普及，关于其安全性、伦理以及潜在偏见的担忧也随之增加。系统地评估 LLM 在风险决策中的倾向和态度，特别是在伦理领域，变得至关重要。本研究创新性地将认知科学中的领域特定风险承担量表 (DOSPERT) 应用于 LLM，并提出了一种新的伦理决策风险态度量表 (EDRAS)，以深入评估 LLM 的伦理风险态度。我们进一步提出了一种将风险量表与角色扮演相结合的新方法，用于定量评估 LLM 中的系统性偏见。通过对多个主流 LLM 的系统评估和分析，我们评估了 LLM 在多个领域中的“风险人格”，特别关注伦理领域，并揭示和量化了 LLM 对不同群体的系统性偏见。这项研究有助于理解 LLM 的风险决策，并确保其安全可靠的应用。我们的方法提供了一种识别和减轻偏见的工具，有助于构建更公平、更值得信赖的 AI 系统。代码和数据已公开。

[NLP-40] NLIP_Lab-IITH Low-Resource MT System for WMT24 Indic MT Shared Task

【速读】：该论文试图解决低资源印度语言翻译问题，特别是针对英语与阿萨姆语（as）、卡西语（kha）、卢舍语（lus）和曼尼普尔语（mni）之间的翻译任务。解决方案的关键在于对预训练模型进行微调，利用对齐增强（alignment augmentation）技术来优化嵌入向量的对齐，从而提升翻译质量。论文中采用了语言特定的微调策略，并在多语言训练中探索了语言分组和层冻结技术，最终在官方公开测试集上取得了显著的chrF2分数提升。

链接: https://arxiv.org/abs/2410.03215
作者: Pramit Sahoo,Maharaj Brahma,Maunendra Sankar Desarkar
关键词-EN: Low-Resource Indic Language, Low-Resource Indic, Indic Language Translation, Indic Language, rightarrow
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: WMT2024 INDICMT Shared Task

点击查看摘要

Abstract:In this paper, we describe our system for the WMT 24 shared task of Low-Resource Indic Language Translation. We consider eng \leftrightarrow as, kha, lus, mni as participating language pairs. In this shared task, we explore the finetuning of a pre-trained model motivated by the pre-trained objective of aligning embeddings closer by alignment augmentation \citelin-etal-2020-pre for 22 scheduled Indian languages. Our primary system is based on language-specific finetuning on a pre-trained model. We achieve chrF2 scores of 50.6, 42.3, 54.9, and 66.3 on the official public test set for eng \rightarrow as, eng \rightarrow kha, eng \rightarrow lus, eng \rightarrow mni respectively. We also explore multilingual training with/without language grouping and layer-freezing. Our code, models, and generated translations are available here: this https URL.
摘要：本文描述了我们为 WMT 24 共享任务中的低资源印度语言翻译系统。我们考虑了 eng ↔ as, kha, lus, mni 作为参与的语言对。在该共享任务中，我们探索了基于预训练模型微调的方法，该方法受到预训练目标的启发，通过对齐增强（alignment augmentation）[Lin et al., 2020] 来更紧密地对齐嵌入，适用于 22 种计划中的印度语言。我们的主要系统基于预训练模型的语言特定微调。我们在官方公开测试集上分别获得了 eng → as, eng → kha, eng → lus, eng → mni 的 chrF2 分数为 50.6, 42.3, 54.9 和 66.3。我们还探索了有无语言分组和层冻结的多语言训练。我们的代码、模型和生成的翻译可在此处获取：此 https URL。

人工智能

[AI-0] On the Surprising Effectiveness of Attention Transfer for Vision Transformers NEURIPS2024

链接: https://arxiv.org/abs/2411.09702
作者: Alexander C. Li,Yuandong Tian,Beidi Chen,Deepak Pathak,Xinlei Chen
关键词-EN: pre-training Vision Transformers, Conventional wisdom suggests, Vision Transformers, Conventional wisdom, pre-training Vision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: NeurIPS 2024. Code: this https URL

点击查看摘要

Abstract:Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning

[AI-1] owards a Classification of Open-Source ML Models and Datasets for Software Engineering

链接: https://arxiv.org/abs/2411.09683
作者: Alexandra González,Xavier Franch,David Lo,Silverio Martínez-Fernández
关键词-EN: Machine Learning, provide extensive resources, datasets provide extensive, Open-Source Pre-Trained Models, Software Engineering
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 8 figures

点击查看摘要

Abstract:Background: Open-Source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. Aims: We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time. Method: We conducted a repository mining study. We started with a systematically gathered database of PTMs and datasets from the HF API. Our selection was refined by analyzing model and dataset cards and metadata, such as tags, and confirming SE relevance using Gemini 1.5 Pro. All analyses are replicable, with a publicly accessible replication package. Results: The most common SE task among PTMs and datasets is code generation, with a primary focus on software development and limited attention to software management. Popular PTMs and datasets mainly target software development. Among ML tasks, text generation is the most common in SE PTMs and datasets. There has been a marked increase in PTMs for SE since 2023 Q2. Conclusions: This study underscores the need for broader task coverage to enhance the integration of ML within SE practices.

[AI-2] NeuralDEM - Real-time Simulation of Industrial Particulate Flows

链接: https://arxiv.org/abs/2411.09678
作者: Benedikt Alkin,Tobias Kronlachner,Samuele Papa,Stefan Pirker,Thomas Lichtenegger,Johannes Brandstetter
关键词-EN: numerically simulate large-scale, simulate large-scale fluid-mechanical, Advancements in computing, core industrial processes, computing power
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:Advancements in computing power have made it possible to numerically simulate large-scale fluid-mechanical and/or particulate systems, many of which are integral to core industrial processes. Among the different numerical methods available, the discrete element method (DEM) provides one of the most accurate representations of a wide range of physical systems involving granular and discontinuous materials. Consequently, DEM has become a widely accepted approach for tackling engineering problems connected to granular flows and powder mechanics. Additionally, DEM can be integrated with grid-based computational fluid dynamics (CFD) methods, enabling the simulation of chemical processes taking place, e.g., in fluidized beds. However, DEM is computationally intensive because of the intrinsic multiscale nature of particulate systems, restricting simulation duration or number of particles. Towards this end, NeuralDEM presents an end-to-end approach to replace slow numerical DEM routines with fast, adaptable deep learning surrogates. NeuralDEM is capable of picturing long-term transport processes across different regimes using macroscopic observables without any reference to microscopic model parameters. First, NeuralDEM treats the Lagrangian discretization of DEM as an underlying continuous field, while simultaneously modeling macroscopic behavior directly as additional auxiliary fields. Second, NeuralDEM introduces multi-branch neural operators scalable to real-time modeling of industrially-sized scenarios - from slow and pseudo-steady to fast and transient. Such scenarios have previously posed insurmountable challenges for deep learning models. Notably, NeuralDEM faithfully models coupled CFD-DEM fluidized bed reactors of 160k CFD cells and 500k DEM particles for trajectories of 28s. NeuralDEM will open many new doors to advanced engineering and much faster process cycles.

[AI-3] Med-Bot: An AI-Powered Assistant to Provide Accurate and Reliable Medical Information ALT

链接: https://arxiv.org/abs/2411.09648
作者: Ahan Bhatt,Nandan Vaghela
关键词-EN: AI-powered chatbot designed, paper introduces Med-Bot, reliable medical information, paper introduces, AI-powered chatbot
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 3 figures, 5 pages Keywords-LLM, AI-powered healthcare, Medical chatbot, Context-based interaction, Llama-assisted data processing, AutoGPT-Q, PyTorch, TensorFlow, Reliable medical information, Machine learning in healthcare, Conversational AI

点击查看摘要

Abstract:This paper introduces Med-Bot, an AI-powered chatbot designed to provide users with accurate and reliable medical information. Utilizing advanced libraries and frameworks such as PyTorch, Chromadb, Langchain and Autogptq, Med-Bot is built to handle the complexities of natural language understanding in a healthcare context. The integration of llamaassisted data processing and AutoGPT-Q provides enhanced performance in processing and responding to queries based on PDFs of medical literature, ensuring that users receive precise and trustworthy information. This research details the methodologies employed in developing Med-Bot and evaluates its effectiveness in disseminating healthcare information.

[AI-4] One-Shot Manipulation Strategy Learning by Making Contact Analogies

链接: https://arxiv.org/abs/2411.09627
作者: Yuyao Liu,Jiayuan Mao,Joshua Tenenbaum,Tomás Lozano-Pérez,Leslie Pack Kaelbling
关键词-EN: generalizable intelligent contacts, manipulation analogies, manipulation strategies, analogies for generalizable, generalizable intelligent
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: CoRL LEAP Workshop, 2024

点击查看摘要

Abstract:We present a novel approach, MAGIC (manipulation analogies for generalizable intelligent contacts), for one-shot learning of manipulation strategies with fast and extensive generalization to novel objects. By leveraging a reference action trajectory, MAGIC effectively identifies similar contact points and sequences of actions on novel objects to replicate a demonstrated strategy, such as using different hooks to retrieve distant objects of different shapes and sizes. Our method is based on a two-stage contact-point matching process that combines global shape matching using pretrained neural features with local curvature analysis to ensure precise and physically plausible contact points. We experiment with three tasks including scooping, hanging, and hooking objects. MAGIC demonstrates superior performance over existing methods, achieving significant improvements in runtime speed and generalization to different object categories. Website: this https URL .

[AI-5] Vision-based Manipulation of Transparent Plastic Bags in Industrial Setups

链接: https://arxiv.org/abs/2411.09623
作者: F. Adetunji(1 and 2),A. Karukayil(1 and 2),P. Samant(1 and 2),S. Shabana(1 and 2),F. Varghese(1 and 2),U. Upadhyay(1 and 2),R. A. Yadav(1 and 2),A. Partridge(2),E. Pendleton(2),R. Plant(2),Y. Petillot(1 and 2),M. Koskinopoulou(1 and 2) ((1) Heriot-Watt University, (2) The National Robotarium)
关键词-EN: transparent plastic bags, transparent plastic, Convolutional Neural Networks, plastic bags, industrial setups
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper addresses the challenges of vision-based manipulation for autonomous cutting and unpacking of transparent plastic bags in industrial setups, aligning with the Industry 4.0 paradigm. Industry 4.0, driven by data, connectivity, analytics, and robotics, promises enhanced accessibility and sustainability throughout the value chain. The integration of autonomous systems, including collaborative robots (cobots), into industrial processes is pivotal for efficiency and safety. The proposed solution employs advanced Machine Learning algorithms, particularly Convolutional Neural Networks (CNNs), to identify transparent plastic bags under varying lighting and background conditions. Tracking algorithms and depth sensing technologies are utilized for 3D spatial awareness during pick and placement. The system addresses challenges in grasping and manipulation, considering optimal points, compliance control with vacuum gripping technology, and real-time automation for safe interaction in dynamic environments. The system’s successful testing and validation in the lab with the FRANKA robot arm, showcases its potential for widespread industrial applications, while demonstrating effectiveness in automating the unpacking and cutting of transparent plastic bags for an 8-stack bulk-loader based on specific requirements and rigorous testing.

[AI-6] Local-Global Attention: An Adaptive Mechanism for Multi-Scale Feature Integration

链接: https://arxiv.org/abs/2411.09604
作者: Yifan Shao
关键词-EN: recent years, attention, Local-Global Attention, key feature information, attention mechanisms
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, attention mechanisms have significantly enhanced the performance of object detection by focusing on key feature information. However, prevalent methods still encounter difficulties in effectively balancing local and global features. This imbalance hampers their ability to capture both fine-grained details and broader contextual information-two critical elements for achieving accurate object this http URL address these challenges, we propose a novel attention mechanism, termed Local-Global Attention, which is designed to better integrate both local and global contextual features. Specifically, our approach combines multi-scale convolutions with positional encoding, enabling the model to focus on local details while concurrently considering the broader global context. Additionally, we introduce a learnable parameters, which allow the model to dynamically adjust the relative importance of local and global attention, depending on the specific requirements of the task, thereby optimizing feature representations across multiple this http URL have thoroughly evaluated the Local-Global Attention mechanism on several widely used object detection and classification datasets. Our experimental results demonstrate that this approach significantly enhances the detection of objects at various scales, with particularly strong performance on multi-class and small object detection tasks. In comparison to existing attention mechanisms, Local-Global Attention consistently outperforms them across several key metrics, all while maintaining computational efficiency.

[AI-7] Accelerating Knowledge Graph and Ontology Engineering with Large Language Models

链接: https://arxiv.org/abs/2411.09601
作者: Cogan Shimizu,Pascal Hitzler
关键词-EN: Large Language Models, Language Models bear, including ontology modeling, key Knowledge Graph, Ontology Engineering tasks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models bear the promise of significant acceleration of key Knowledge Graph and Ontology Engineering tasks, including ontology modeling, extension, modification, population, alignment, as well as entity disambiguation. We lay out LLM-based Knowledge Graph and Ontology Engineering as a new and coming area of research, and argue that modular approaches to ontologies will be of central importance.

[AI-8] Adopting RAG for LLM -Aided Future Vehicle Design

链接: https://arxiv.org/abs/2411.09590
作者: Vahid Zolfaghari,Nenad Petrovic,Fengjunjie Pan,Krzysztof Lebioda,Alois Knoll
关键词-EN: Large Language Models, Language Models, Large Language, Retrieval-Augmented Generation, enhance automated design
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Conference paper accepted in IEEE FLLM 2024

点击查看摘要

Abstract:In this paper, we explore the integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to enhance automated design and software development in the automotive industry. We present two case studies: a standardization compliance chatbot and a design copilot, both utilizing RAG to provide accurate, context-aware responses. We evaluate four LLMs-GPT-4o, LLAMA3, Mistral, and Mixtral- comparing their answering accuracy and execution time. Our results demonstrate that while GPT-4 offers superior performance, LLAMA3 and Mistral also show promising capabilities for local deployment, addressing data privacy concerns in automotive applications. This study highlights the potential of RAG-augmented LLMs in improving design workflows and compliance in automotive engineering.

[AI-9] Software Performance Engineering for Foundation Model-Powered Software (FMware)

链接: https://arxiv.org/abs/2411.09580
作者: Haoxiang Zhang,Shi Chang,Arthur Leung,Kishanthan Thangarajah,Boyuan Chen,Hanan Lutfiyya,Ahmed E. Hassan
关键词-EN: Large Language Models, Foundation Models, Language Models, Large Language, rise of Foundation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of Foundation Models (FMs) like Large Language Models (LLMs) is revolutionizing software development. Despite the impressive prototypes, transforming FMware into production-ready products demands complex engineering across various domains. A critical but overlooked aspect is performance engineering, which aims at ensuring FMware meets performance goals such as throughput and latency to avoid user dissatisfaction and financial loss. Often, performance considerations are an afterthought, leading to costly optimization efforts post-deployment. FMware’s high computational resource demands highlight the need for efficient hardware use. Continuous performance engineering is essential to prevent degradation. This paper highlights the significance of Software Performance Engineering (SPE) in FMware, identifying four key challenges: cognitive architecture design, communication protocols, tuning and optimization, and deployment. These challenges are based on literature surveys and experiences from developing an in-house FMware system. We discuss problems, current practices, and innovative paths for the software engineering community.

[AI-10] Automating Reformulation of Essence Specifications via Graph Rewriting

链接: https://arxiv.org/abs/2411.09576
作者: Ian Miguel,András Z. Salamon,Christopher Stone
关键词-EN: Formulating an effective, parameterised problem class, effective constraint model, subsequently be solved, class is crucial
类目: Artificial Intelligence (cs.AI)
*备注: Presented at the PTHG 2024 workshop

点击查看摘要

Abstract:Formulating an effective constraint model of a parameterised problem class is crucial to the efficiency with which instances of the class can subsequently be solved. It is difficult to know beforehand which of a set of candidate models will perform best in practice. This paper presents a system that employs graph rewriting to reformulate an input model for improved performance automatically. By situating our work in the Essence abstract constraint specification language, we can use the structure in its high level variable types to trigger rewrites directly. We implement our system via rewrite rules expressed in the Graph Programs 2 language, applied to the abstract syntax tree of an input specification. We show how to automatically translate the solution of the reformulated problem into a solution of the original problem for verification and presentation. We demonstrate the efficacy of our system with a detailed case study.

[AI-11] OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling

链接: https://arxiv.org/abs/2411.09543
作者: Xiaoling Yi,Ryan Antonio,Joren Dumoulin,Jiacong Sun,Josse Van Delm,Guilherme Paim,Marian Verhelst
关键词-EN: Deep neural networks, face significant challenges, resource-constrained extreme edge, extreme edge devices, edge devices due
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application scenarios suffer from inflexible control and limited programmability, generic hardware acceleration platforms coupled with RISC-V CPUs can enable high reusability and flexibility, yet typically at the expense of system level efficiency and low utilization. To fill this gap, we propose OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked scratchpad memory. The GeMM core utilization and system efficiency are boosted through three mechanisms: configuration pre-loading, input pre-fetching with output buffering, and programmable strided memory access. Experimental results show that OpenGeMM can consistently achieve hardware utilization ranging from 81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58x to 16.40x speedup on normalized throughput across a wide variety ofGeMM workloads, while achieving 4.68 TOPS/W system efficiency.

[AI-12] Prompting the Unseen: Detecting Hidden Backdoors in Black-Box Models

链接: https://arxiv.org/abs/2411.09540
作者: Zi-Xuan Huang,Jia-Wei Chen,Zhi-Peng Zhang,Chia-Mu Yu
关键词-EN: adapts well-trained frozen, domain tasks, textsc, BProm, Visual prompting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Visual prompting (VP) is a new technique that adapts well-trained frozen models for source domain tasks to target domain tasks. This study examines VP’s benefits for black-box model-level backdoor detection. The visual prompt in VP maps class subspaces between source and target domains. We identify a misalignment, termed class subspace inconsistency, between clean and poisoned datasets. Based on this, we introduce \textscBProm, a black-box model-level detection method to identify backdoors in suspicious models, if any. \textscBProm leverages the low classification accuracy of prompted models when backdoors are present. Extensive experiments confirm \textscBProm’s effectiveness.

[AI-13] Navigating the Risks: A Survey of Security Privacy and Ethics Threats in LLM -Based Agents

链接: https://arxiv.org/abs/2411.09523
作者: Yuyou Gan,Yong Yang,Zhe Ma,Ping He,Rui Zeng,Yiming Wang,Qingming Li,Chunyi Zhou,Songze Li,Ting Wang,Yunjun Gao,Yingcai Wu,Shouling Ji
关键词-EN: large language models, natural language processing, made groundbreaking advances, numerous natural language, language models
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the continuous development of large language models (LLMs), transformer-based models have made groundbreaking advances in numerous natural language processing (NLP) tasks, leading to the emergence of a series of agents that use LLMs as their control hub. While LLMs have achieved success in various tasks, they face numerous security and privacy threats, which become even more severe in the agent scenarios. To enhance the reliability of LLM-based applications, a range of research has emerged to assess and mitigate these risks from different perspectives. To help researchers gain a comprehensive understanding of various risks, this survey collects and analyzes the different threats faced by these agents. To address the challenges posed by previous taxonomies in handling cross-module and cross-stage threats, we propose a novel taxonomy framework based on the sources and impacts. Additionally, we identify six key features of LLM-based agents, based on which we summarize the current research progress and analyze their limitations. Subsequently, we select four representative agents as case studies to analyze the risks they may face in practical use. Finally, based on the aforementioned analyses, we propose future research directions from the perspectives of data, methodology, and policy, respectively. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2411.09523 [cs.AI] (or arXiv:2411.09523v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2411.09523 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-14] oward a Cohesive AI and Simulation Software Ecosystem for Scientific Innovation

链接: https://arxiv.org/abs/2411.09507
作者: Michael A. Heroux,Sameer Shende,Lois Curfman McInnes,Todd Gamblin,James M. Willenbring
关键词-EN: unites artificial intelligence, advance scientific discovery, integrated software stack, artificial intelligence, modeling and simulation
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 5 pages

点击查看摘要

Abstract:In this paper, we discuss the need for an integrated software stack that unites artificial intelligence (AI) and modeling and simulation (ModSim) tools to advance scientific discovery. The authors advocate for a unified AI/ModSim software ecosystem that ensures compatibility across a wide range of software on diverse high-performance computing systems, promoting ease of deployment, version management, and binary distribution. Key challenges highlighted include balancing the distinct needs of AI and ModSim, especially in terms of software build practices, dependency management, and compatibility. The document underscores the importance of continuous integration, community-driven stewardship, and collaboration with the Department of Energy (DOE) to develop a portable and cohesive scientific software ecosystem. Recommendations focus on supporting standardized environments through initiatives like the Extreme-scale Scientific Software Stack (E4S) and Spack to foster interdisciplinary innovation and facilitate new scientific advancements.

[AI-15] ResidualDroppath: Enhancing Feature Reuse over Residual Connections

链接: https://arxiv.org/abs/2411.09475
作者: Sejik Park
关键词-EN: vanishing gradient problem, neural network architectures, Residual connections, feature reuse, deeper network training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Residual connections are one of the most important components in neural network architectures for mitigating the vanishing gradient problem and facilitating the training of much deeper networks. One possible explanation for how residual connections aid deeper network training is by promoting feature reuse. However, we identify and analyze the limitations of feature reuse with vanilla residual connections. To address these limitations, we propose modifications in training methods. Specifically, we provide an additional opportunity for the model to learn feature reuse with residual connections through two types of iterations during training. The first type of iteration involves using droppath, which enforces feature reuse by randomly dropping a subset of layers. The second type of iteration focuses on training the dropped parts of the model while freezing the undropped parts. As a result, the dropped parts learn in a way that encourages feature reuse, as the model relies on the undropped parts with feature reuse in mind. Overall, we demonstrated performance improvements in models with residual connections for image classification in certain cases.

[AI-16] Renal Cell Carcinoma subtyping: learning from multi-resolution localization

链接: https://arxiv.org/abs/2411.09471
作者: Mohamad Mohamad,Francesco Ponzio,Santa Di Cataldo,Damien Ambrosetti,Xavier Descombes
关键词-EN: Renal Cell Carcinoma, Cell Carcinoma, Cell Carcinoma high, Cell Carcinoma diagnosis, Renal Cell
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Renal Cell Carcinoma is typically asymptomatic at the early stages for many patients. This leads to a late diagnosis of the tumor, where the curability likelihood is lower, and makes the mortality rate of Renal Cell Carcinoma high, with respect to its incidence rate. To increase the survival chance, a fast and correct categorization of the tumor subtype is paramount. Nowadays, computerized methods, based on artificial intelligence, represent an interesting opportunity to improve the productivity and the objectivity of the microscopy-based Renal Cell Carcinoma diagnosis. Nonetheless, much of their exploitation is hampered by the paucity of annotated dataset, essential for a proficient training of supervised machine learning technologies. This study sets out to investigate a novel self supervised training strategy for machine learning diagnostic tools, based on the multi-resolution nature of the histological samples. We aim at reducing the need of annotated dataset, without significantly reducing the accuracy of the tool. We demonstrate the classification capability of our tool on a whole slide imaging dataset for Renal Cancer subtyping, and we compare our solution with several state-of-the-art classification counterparts.

[AI-17] DiffRoad: Realistic and Diverse Road Scenario Generation for Autonomous Vehicle Testing

链接: https://arxiv.org/abs/2411.09451
作者: Junjie Zhou,Lin Wang,Qiang Meng,Xiaofan Wang
关键词-EN: Generating realistic, road, Generating, scenarios, autonomous vehicle testing
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Generating realistic and diverse road scenarios is essential for autonomous vehicle testing and validation. Nevertheless, owing to the complexity and variability of real-world road environments, creating authentic and varied scenarios for intelligent driving testing is challenging. In this paper, we propose DiffRoad, a novel diffusion model designed to produce controllable and high-fidelity 3D road scenarios. DiffRoad leverages the generative capabilities of diffusion models to synthesize road layouts from white noise through an inverse denoising process, preserving real-world spatial features. To enhance the quality of generated scenarios, we design the Road-UNet architecture, optimizing the balance between backbone and skip connections for high-realism scenario generation. Furthermore, we introduce a road scenario evaluation module that screens adequate and reasonable scenarios for intelligent driving testing using two critical metrics: road continuity and road reasonableness. Experimental results on multiple real-world datasets demonstrate DiffRoad’s ability to generate realistic and smooth road structures while maintaining the original distribution. Additionally, the generated scenarios can be fully automated into the OpenDRIVE format, facilitating generalized autonomous vehicle simulation testing. DiffRoad provides a rich and diverse scenario library for large-scale autonomous vehicle testing and offers valuable insights for future infrastructure designs that are better suited for autonomous vehicles.

[AI-18] An Adaptive Open-Source Dataset Generation Framework for Machine Learning Tasks in Logic Synthesis

链接: https://arxiv.org/abs/2411.09422
作者: Liwei Ni,Rui Wang,Miao Liu,Xingyu Meng,Xiaoze Lin,Junfeng Liu,Guojie Luo,Zhufei Chu,Weikang Qian,Xiaoyan Yang,Biwei Xie,Xingquan Li,Huawei Li
关键词-EN: enhance machine learning, machine learning applications, machine learning, logic synthesis process, generation framework designed
类目: Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:This paper introduces an adaptive logic synthesis dataset generation framework designed to enhance machine learning applications within the logic synthesis process. Unlike previous dataset generation flows that were tailored for specific tasks or lacked integrated machine learning capabilities, the proposed framework supports a comprehensive range of machine learning tasks by encapsulating the three fundamental steps of logic synthesis: Boolean representation, logic optimization, and technology mapping. It preserves the original information in the intermediate files that can be stored in both Verilog and Graphmal format. Verilog files enable semi-customizability, allowing researchers to add steps and incrementally refine the generated dataset. The framework also includes an adaptive circuit engine to facilitate the loading of GraphML files for final dataset packaging and sub-dataset extraction. The generated OpenLS-D dataset comprises 46 combinational designs from established benchmarks, totaling over 966,000 Boolean circuits, with each design containing 21,000 circuits generated from 1000 synthesis recipes, including 7000 Boolean networks, 7000 ASIC netlists, and 7000 FPGA netlists. Furthermore, OpenLS-D supports integrating newly desired data features, making it more versatile for new challenges. The utility of OpenLS-D is demonstrated through four distinct downstream tasks: circuit classification, circuit ranking, quality of results (QoR) prediction, and probability prediction. Each task highlights different internal steps of logic synthesis, with the datasets extracted and relabeled from the OpenLS-D dataset using the circuit engine. The experimental results confirm the dataset’s diversity and extensive applicability. The source code and datasets are available at this https URL.

[AI-19] SAG-ViT: A Scale-Aware High-Fidelity Patching Approach with Graph Attention for Vision Transformers

链接: https://arxiv.org/abs/2411.09420
作者: Shravan Venkatraman,Jaskaran Singh Walia,Joe Dhanith P R
关键词-EN: computer vision task, Attention Vision Transformer, Graph Attention Vision, specific label, Graph Attention Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Image classification is a computer vision task where a model analyzes an image to categorize it into a specific label. Vision Transformers (ViT) improve this task by leveraging self-attention to capture complex patterns and long range relationships between image patches. However, a key challenge for ViTs is efficiently incorporating multiscale feature representations, which is inherent in CNNs through their hierarchical structure. In this paper, we introduce the Scale-Aware Graph Attention Vision Transformer (SAG-ViT), a novel framework that addresses this challenge by integrating multi-scale features. Using EfficientNet as a backbone, the model extracts multi-scale feature maps, which are divided into patches to preserve semantic information. These patches are organized into a graph based on spatial and feature similarities, with a Graph Attention Network (GAT) refining the node embeddings. Finally, a Transformer encoder captures long-range dependencies and complex interactions. The SAG-ViT is evaluated on benchmark datasets, demonstrating its effectiveness in enhancing image classification performance.

[AI-20] Script-centric behavior understanding for assisted autism spectrum disorder diagnosis ICASSP2025

链接: https://arxiv.org/abs/2411.09413
作者: Wenxing Liu,Yueran Pan,Ming Li
关键词-EN: Autism Spectrum Disorders, Spectrum Disorders, Autism Spectrum, Observing and analyzing, diagnosis of Autism
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures, submitted to ICASSP 2025

点击查看摘要

Abstract:Observing and analyzing children’s social behaviors is crucial for the early diagnosis of Autism Spectrum Disorders (ASD). This work focuses on automatically detecting ASD using computer vision techniques and large language models (LLMs). Existing methods typically rely on supervised learning. However, the scarcity of ASD diagnostic datasets and the lack of interpretability in diagnostic results significantly limits its clinical application. To address these challenges, we introduce a novel unsupervised approach based on script-centric behavior understanding. Our pipeline converts video content into scripts that describe the behavior of characters, leveraging the generalizability of large language models to detect ASD in a zero-shot or few-shot manner. Specifically, we propose a scripts transcription module for multimodal behavior data textualization and a domain prompts module to bridge LLMs. Our method achieves an accuracy of 92.00% in diagnosing ASD in children with an average age of 24 months, surpassing the performance of supervised learning methods by 3.58% absolutely. Extensive experiments confirm the effectiveness of our approach and suggest its potential for advancing ASD research through LLMs.

[AI-21] Imagined Speech and Visual Imagery as Intuitive Paradigms for Brain-Computer Interfaces

链接: https://arxiv.org/abs/2411.09400
作者: Seo-Hyun Lee,Ji-Ha Park,Deok-Seon Kim
关键词-EN: Recent advancements, imagined speech, visual imagery, brain-computer interface, technology have emphasized
类目: Artificial Intelligence (cs.AI)
*备注: 4 pages

点击查看摘要

Abstract:Recent advancements in brain-computer interface (BCI) technology have emphasized the promise of imagined speech and visual imagery as effective paradigms for intuitive communication. This study investigates the classification performance and brain connectivity patterns associated with these paradigms, focusing on decoding accuracy across selected word classes. Sixteen participants engaged in tasks involving thirteen imagined speech and visual imagery classes, revealing above-chance classification accuracy for both paradigms. Variability in classification accuracy across individual classes highlights the influence of sensory and motor associations in imagined speech and vivid visual associations in visual imagery. Connectivity analysis further demonstrated increased functional connectivity in language-related and sensory regions for imagined speech, whereas visual imagery activated spatial and visual processing networks. These findings suggest the potential of imagined speech and visual imagery as an intuitive and scalable paradigm for BCI communication when selecting optimal word classes. Further exploration of the decoding outcomes for these two paradigms could provide insights for practical BCI communication.

[AI-22] LTLf and PPLTL: Extending LTLf and PPLTL to Infinite Traces

链接: https://arxiv.org/abs/2411.09366
作者: Benjamin Aminof,Giuseppe De Giacomo,Sasha Rubin,Moshe Y. Vardi
关键词-EN: linear-time temporal logics, PPLTL, Pnueli LTL safety-progress, Manna and Pnueli, express properties
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
*备注:

点击查看摘要

Abstract:We introduce LTLf+ and PPLTL+, two logics to express properties of infinite traces, that are based on the linear-time temporal logics LTLf and PPLTL on finite traces. LTLf+/PPLTL+ use levels of Manna and Pnueli’s LTL safety-progress hierarchy, and thus have the same expressive power as LTL. However, they also retain a crucial characteristic of the reactive synthesis problem for the base logics: the game arena for strategy extraction can be derived from deterministic finite automata (DFA). Consequently, these logics circumvent the notorious difficulties associated with determinizing infinite trace automata, typical of LTL reactive synthesis. We present DFA-based synthesis techniques for LTLf+/PPLTL+, and show that synthesis is 2EXPTIME-complete for LTLf+ (matching LTLf) and EXPTIME-complete for PPLTL+ (matching PPLTL). Notably, while PPLTL+ retains the full expressive power of LTL, reactive synthesis is EXPTIME-complete instead of 2EXPTIME-complete. The techniques are also adapted to optimally solve satisfiability, validity, and model-checking, to get EXPSPACE-complete for LTLf+ (extending a recent result for the guarantee level using LTLf), and PSPACE-complete for PPLTL+.

[AI-23] Your Fixed Watermark is Fragile: Towards Semantic-Aware Watermark for EaaS Copyright Protection

链接: https://arxiv.org/abs/2411.09359
作者: Zekun Fei,Biao Yi,Jianing Geng,Ruiqi He,Lihai Nie,Zheli Liu
关键词-EN: including API misuse, successful business pattern, faces significant challenges, significant challenges related, including API
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Embedding-as-a-Service (EaaS) has emerged as a successful business pattern but faces significant challenges related to various forms of copyright infringement, including API misuse and different attacks. Various studies have proposed backdoor-based watermarking schemes to protect the copyright of EaaS services. In this paper, we reveal that previous watermarking schemes possess semantic-independent characteristics and propose the Semantic Perturbation Attack (SPA). Our theoretical and experimental analyses demonstrate that this semantic-independent nature makes current watermarking schemes vulnerable to adaptive attacks that exploit semantic perturbations test to bypass watermark verification. To address this vulnerability, we propose the Semantic Aware Watermarking (SAW) scheme, a robust defense mechanism designed to resist SPA, by injecting a watermark that adapts to the text semantics. Extensive experimental results across multiple datasets demonstrate that the True Positive Rate (TPR) for detecting watermarked samples under SPA can reach up to more than 95%, rendering previous watermarks ineffective. Meanwhile, our watermarking scheme can resist such attack while ensuring the watermark verification capability. Our code is available at this https URL.

[AI-24] Multi-scale Generative Modeling for Fast Sampling

链接: https://arxiv.org/abs/2411.09356
作者: Xiongye Xiao,Shixuan Li,Luzhe Huang,Gengshuo Liu,Trung-Kien Nguyen,Yi Huang,Di Chang,Mykel J. Kochenderfer,Paul Bogdan
关键词-EN: wavelet domain offers, ill-conditioned scores caused, wavelet domain, power-law decay, recent advances
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While working within the spatial domain can pose problems associated with ill-conditioned scores caused by power-law decay, recent advances in diffusion-based generative models have shown that transitioning to the wavelet domain offers a promising alternative. However, within the wavelet domain, we encounter unique challenges, especially the sparse representation of high-frequency coefficients, which deviates significantly from the Gaussian assumptions in the diffusion process. To this end, we propose a multi-scale generative modeling in the wavelet domain that employs distinct strategies for handling low and high-frequency bands. In the wavelet domain, we apply score-based generative modeling with well-conditioned scores for low-frequency bands, while utilizing a multi-scale generative adversarial learning for high-frequency bands. As supported by the theoretical analysis and experimental results, our model significantly improve performance and reduce the number of trainable parameters, sampling steps, and time.

[AI-25] EEG-Based Speech Decoding: A Novel Approach Using Multi-Kernel Ensemble Diffusion Models

链接: https://arxiv.org/abs/2411.09302
作者: Soowon Kim,Ha-Na Jo,Eunyeong Ko
关键词-EN: leveraging denoising diffusion, denoising diffusion probabilistic, varying convolutional kernel, convolutional kernel sizes, ensemble learning framework
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this study, we propose an ensemble learning framework for electroencephalogram-based overt speech classification, leveraging denoising diffusion probabilistic models with varying convolutional kernel sizes. The ensemble comprises three models with kernel sizes of 51, 101, and 201, effectively capturing multi-scale temporal features inherent in signals. This approach improves the robustness and accuracy of speech decoding by accommodating the rich temporal complexity of neural signals. The ensemble models work in conjunction with conditional autoencoders that refine the reconstructed signals and maximize the useful information for downstream classification tasks. The results indicate that the proposed ensemble-based approach significantly outperforms individual models and existing state-of-the-art techniques. These findings demonstrate the potential of ensemble methods in advancing brain signal decoding, offering new possibilities for non-verbal communication applications, particularly in brain-computer interface systems aimed at aiding individuals with speech impairments.

[AI-26] Learning Hand State Estimation for a Light Exoskeleton

链接: https://arxiv.org/abs/2411.09294
作者: Gabriele Abbate,Alessandro Giusti,Luca Randazzo,Antonio Paolillo
关键词-EN: machine learning-based estimator, propose a machine, machine learning-based, learning-based estimator, hand state
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a machine learning-based estimator of the hand state for rehabilitation purposes, using light exoskeletons. These devices are easy to use and useful for delivering domestic and frequent therapies. We build a supervised approach using information from the muscular activity of the forearm and the motion of the exoskeleton to reconstruct the hand’s opening degree and compliance level. Such information can be used to evaluate the therapy progress and develop adaptive control behaviors. Our approach is validated with a real light exoskeleton. The experiments demonstrate good predictive performance of our approach when trained on data coming from a single user and tested on the same user, even across different sessions. This generalization capability makes our system promising for practical use in real rehabilitation.

[AI-27] Harnessing multiple LLM s for Information Retrieval: A case study on Deep Learning methodologies in Biodiversity publications

链接: https://arxiv.org/abs/2411.09269
作者: Vamsi Krishna Kommineni,Birgitta König-Ries,Sheeba Samuel
关键词-EN: Deep Learning, complex research questions, techniques are increasingly, Large Language Models, address complex research
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Learning (DL) techniques are increasingly applied in scientific studies across various domains to address complex research questions. However, the methodological details of these DL models are often hidden in the unstructured text. As a result, critical information about how these models are designed, trained, and evaluated is challenging to access and comprehend. To address this issue, in this work, we use five different open-source Large Language Models (LLMs): Llama-3 70B, Llama-3.1 70B, Mixtral-8x22B-Instruct-v0.1, Mixtral 8x7B, and Gemma 2 9B in combination with Retrieval-Augmented Generation (RAG) approach to extract and process DL methodological details from scientific publications automatically. We built a voting classifier from the outputs of five LLMs to accurately report DL methodological information. We tested our approach using biodiversity publications, building upon our previous research. To validate our pipeline, we employed two datasets of DL-related biodiversity publications: a curated set of 100 publications from our prior work and a set of 364 publications from the Ecological Informatics journal. Our results demonstrate that the multi-LLM, RAG-assisted pipeline enhances the retrieval of DL methodological information, achieving an accuracy of 69.5% (417 out of 600 comparisons) based solely on textual content from publications. This performance was assessed against human annotators who had access to code, figures, tables, and other supplementary information. Although demonstrated in biodiversity, our methodology is not limited to this field; it can be applied across other scientific domains where detailed methodological reporting is essential for advancing knowledge and ensuring reproducibility. This study presents a scalable and reliable approach for automating information extraction, facilitating better reproducibility and knowledge transfer across studies.

[AI-28] How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT AI Models and Human Perception

链接: https://arxiv.org/abs/2411.09266
作者: Sahibzada Adil Shahzad,Ammarah Hashmi,Yan-Tsung Peng,Yu Tsao,Hsin-Min Wang
关键词-EN: unimodal deep learningbased, deep learningbased forgery, involving audiovisual manipulations, Multimodal deepfakes involving, deepfakes involving audiovisual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Multimodal deepfakes involving audiovisual manipulations are a growing threat because they are difficult to detect with the naked eye or using unimodal deep learningbased forgery detection methods. Audiovisual forensic models, while more capable than unimodal models, require large training datasets and are computationally expensive for training and inference. Furthermore, these models lack interpretability and often do not generalize well to unseen manipulations. In this study, we examine the detection capabilities of a large language model (LLM) (i.e., ChatGPT) to identify and account for any possible visual and auditory artifacts and manipulations in audiovisual deepfake content. Extensive experiments are conducted on videos from a benchmark multimodal deepfake dataset to evaluate the detection performance of ChatGPT and compare it with the detection capabilities of state-of-the-art multimodal forensic models and humans. Experimental results demonstrate the importance of domain knowledge and prompt engineering for video forgery detection tasks using LLMs. Unlike approaches based on end-to-end learning, ChatGPT can account for spatial and spatiotemporal artifacts and inconsistencies that may exist within or across modalities. Additionally, we discuss the limitations of ChatGPT for multimedia forensic tasks.

[AI-29] Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming

链接: https://arxiv.org/abs/2411.09261
作者: Umar Alkafaween,Ibrahim Albluwi,Paul Denny
关键词-EN: assignments provide instant, manual grading time, test suites, graded programming assignments, programming assignments provide
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Submitted to Journal of Computer Assisted Learning

点击查看摘要

Abstract:Automatically graded programming assignments provide instant feedback to students and significantly reduce manual grading time for instructors. However, creating comprehensive suites of test cases for programming problems within automatic graders can be time-consuming and complex. The effort needed to define test suites may deter some instructors from creating additional problems or lead to inadequate test coverage, potentially resulting in misleading feedback on student solutions. Such limitations may reduce student access to the well-documented benefits of timely feedback when learning programming. In this work, we evaluate the effectiveness of using Large Language Models (LLMs), as part of a larger workflow, to automatically generate test suites for CS1-level programming problems. Each problem’s statement and reference solution are provided to GPT-4 to produce a test suite that can be used by an autograder. We evaluate our proposed approach using a sample of 26 problems, and more than 25,000 attempted solutions to those problems, submitted by students in an introductory programming course. We compare the performance of the LLM-generated test suites against the instructor-created test suites for each problem. Our findings reveal that LLM-generated test suites can correctly identify most valid solutions, and for most problems are at least as comprehensive as the instructor test suites. Additionally, the LLM-generated test suites exposed ambiguities in some problem statements, underscoring their potential to improve both autograding and instructional design. Comments: Submitted to Journal of Computer Assisted Learning Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) ACMclasses: K.3.2; I.2.7 Cite as: arXiv:2411.09261 [cs.CY] (or arXiv:2411.09261v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2411.09261 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting

链接: https://arxiv.org/abs/2411.09251
作者: Weilin Ruan,Wenzhuo Wang,Siru Zhong,Wei Chen,Li Liu,Yuxuan Liang
关键词-EN: Predicting spatio-temporal traffic, traffic flow presents, flow presents significant, presents significant challenges, significant challenges due
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Predicting spatio-temporal traffic flow presents significant challenges due to complex interactions between spatial and temporal factors. Existing approaches often address these dimensions in isolation, neglecting their critical interdependencies. In this paper, we introduce the Spatio-Temporal Unitized Model (STUM), a unified framework designed to capture both spatial and temporal dependencies while addressing spatio-temporal heterogeneity through techniques such as distribution alignment and feature fusion. It also ensures both predictive accuracy and computational efficiency. Central to STUM is the Adaptive Spatio-temporal Unitized Cell (ASTUC), which utilizes low-rank matrices to seamlessly store, update, and interact with space, time, as well as their correlations. Our framework is also modular, allowing it to integrate with various spatio-temporal graph neural networks through components such as backbone models, feature extractors, residual fusion blocks, and predictive modules to collectively enhance forecasting outcomes. Experimental results across multiple real-world datasets demonstrate that STUM consistently improves prediction performance with minimal computational cost. These findings are further supported by hyperparameter optimization, pre-training analysis, and result visualization. We provide our source code for reproducibility at this https URL.

[AI-31] owards Unified Neural Decoding of Perceived Spoken and Imagined Speech from EEG Signals

链接: https://arxiv.org/abs/2411.09243
作者: Jung-Sun Lee,Ha-Na Jo,Seo-Hyun Lee
关键词-EN: understanding human intentions, Brain signals accompany, mental imagery, making them crucial, accompany various information
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Brain signals accompany various information relevant to human actions and mental imagery, making them crucial to interpreting and understanding human intentions. Brain-computer interface technology leverages this brain activity to generate external commands for controlling the environment, offering critical advantages to individuals with paralysis or locked-in syndrome. Within the brain-computer interface domain, brain-to-speech research has gained attention, focusing on the direct synthesis of audible speech from brain signals. Most current studies decode speech from brain activity using invasive techniques and emphasize spoken speech data. However, humans express various speech states, and distinguishing these states through non-invasive approaches remains a significant yet challenging task. This research investigated the effectiveness of deep learning models for non-invasive-based neural signal decoding, with an emphasis on distinguishing between different speech paradigms, including perceived, overt, whispered, and imagined speech, across multiple frequency bands. The model utilizing the spatial conventional neural network module demonstrated superior performance compared to other models, especially in the gamma band. Additionally, imagined speech in the theta frequency band, where deep learning also showed strong effects, exhibited statistically significant differences compared to the other speech paradigms.

[AI-32] Programming with AI: Evaluating ChatGPT Gemini AlphaCode and GitHub Copilot for Programmers

链接: https://arxiv.org/abs/2411.09224
作者: Md Kamrul Siam,Huanying Gu,Jerry Q. Cheng
关键词-EN: powered large language, large language models, artificial intelligence, everyday lives, lives now heavily
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:Our everyday lives now heavily rely on artificial intelligence (AI) powered large language models (LLMs). Like regular users, programmers are also benefiting from the newest large language models. In response to the critical role that AI models play in modern software development, this study presents a thorough evaluation of leading programming assistants, including ChatGPT, Gemini(Bard AI), AlphaCode, and GitHub Copilot. The evaluation is based on tasks like natural language processing and code generation accuracy in different programming languages like Java, Python and C++. Based on the results, it has emphasized their strengths and weaknesses and the importance of further modifications to increase the reliability and accuracy of the latest popular models. Although these AI assistants illustrate a high level of progress in language understanding and code generation, along with ethical considerations and responsible usage, they provoke a necessity for discussion. With time, developing more refined AI technology is essential for achieving advanced solutions in various fields, especially with the knowledge of the feature intricacies of these models and their implications. This study offers a comparison of different LLMs and provides essential feedback on the rapidly changing area of AI models. It also emphasizes the need for ethical developmental practices to actualize AI models’ full potential.

[AI-33] Dynamic Neural Communication: Convergence of Computer Vision and Brain-Computer Interface

链接: https://arxiv.org/abs/2411.09211
作者: Ji-Ha Park,Seo-Hyun Lee,Soowon Kim,Seong-Whan Lee
关键词-EN: Interpreting human neural, innovative communication tool, Interpreting human, neural signals, human neural signals
类目: Artificial Intelligence (cs.AI)
*备注: 4 pages, 2 figures, 1 table, Name of Conference: International Conference on Brain-Computer Interface

点击查看摘要

Abstract:Interpreting human neural signals to decode static speech intentions such as text or images and dynamic speech intentions such as audio or video is showing great potential as an innovative communication tool. Human communication accompanies various features, such as articulatory movements, facial expressions, and internal speech, all of which are reflected in neural signals. However, most studies only generate short or fragmented outputs, while providing informative communication by leveraging various features from neural signals remains challenging. In this study, we introduce a dynamic neural communication method that leverages current computer vision and brain-computer interface technologies. Our approach captures the user’s intentions from neural signals and decodes visemes in short time steps to produce dynamic visual outputs. The results demonstrate the potential to rapidly capture and reconstruct lip movements during natural speech attempts from human neural signals, enabling dynamic neural communication through the convergence of computer vision and brain–computer interface.

[AI-34] Improvement and Implementation of a Speech Emotion Recognition Model Based on Dual-Layer LSTM

链接: https://arxiv.org/abs/2411.09189
作者: Xiaoran Yang,Shuhan Yu,Wenxi Xu
关键词-EN: additional LSTM layer, existing speech emotion, speech emotion recognition, paper builds, adding an additional
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper builds upon an existing speech emotion recognition model by adding an additional LSTM layer to improve the accuracy and processing efficiency of emotion recognition from audio data. By capturing the long-term dependencies within audio sequences through a dual-layer LSTM network, the model can recognize and classify complex emotional patterns more accurately. Experiments conducted on the RAVDESS dataset validated this approach, showing that the modified dual layer LSTM model improves accuracy by 2% compared to the single-layer LSTM while significantly reducing recognition latency, thereby enhancing real-time performance. These results indicate that the dual-layer LSTM architecture is highly suitable for handling emotional features with long-term dependencies, providing a viable optimization for speech emotion recognition systems. This research provides a reference for practical applications in fields like intelligent customer service, sentiment analysis and human-computer interaction.

[AI-35] Dynamic technology impact analysis: A multi-task learning approach to patent citation prediction

链接: https://arxiv.org/abs/2411.09184
作者: Youngjin Seol,Jaewoong Choi,Seunghyun Lee,Janghyeok Yoon
关键词-EN: technology impact, Machine learning, technology, tools for analyzing, patent citation information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning (ML) models are valuable tools for analyzing the impact of technology using patent citation information. However, existing ML-based methods often struggle to account for the dynamic nature of the technology impact over time and the interdependencies of these impacts across different periods. This study proposes a multi-task learning (MTL) approach to enhance the prediction of technology impact across various time frames by leveraging knowledge sharing and simultaneously monitoring the evolution of technology impact. First, we quantify the technology impacts and identify patterns through citation analysis over distinct time periods. Next, we develop MTL models to predict citation counts using multiple patent indicators over time. Finally, we examine the changes in key input indicators and their patterns over different periods using the SHapley Additive exPlanation method. We also offer guidelines for validating and interpreting the results by employing statistical methods and natural language processing techniques. A case study on battery technologies demonstrates that our approach not only deepens the understanding of technology impact, but also improves prediction accuracy, yielding valuable insights for both academia and industry.

[AI-36] DeBaTeR: Denoising Bipartite Temporal Graph for Recommendation

链接: https://arxiv.org/abs/2411.09181
作者: Xinyu He,Jose Sepulveda,Mostafa Rahmani,Alyssa Woo,Fei Wang,Hanghang Tong
关键词-EN: acquiring large-scale explicit, explicit user feedback, large-scale explicit user, user-item interactions, noisy interactions
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to the difficulty of acquiring large-scale explicit user feedback, implicit feedback (e.g., clicks or other interactions) is widely applied as an alternative source of data, where user-item interactions can be modeled as a bipartite graph. Due to the noisy and biased nature of implicit real-world user-item interactions, identifying and rectifying noisy interactions are vital to enhance model performance and robustness. Previous works on purifying user-item interactions in collaborative filtering mainly focus on mining the correlation between user/item embeddings and noisy interactions, neglecting the benefit of temporal patterns in determining noisy interactions. Time information, while enhancing the model utility, also bears its natural advantage in helping to determine noisy edges, e.g., if someone usually watches horror movies at night and talk shows in the morning, a record of watching a horror movie in the morning is more likely to be noisy interaction. Armed with this observation, we introduce a simple yet effective mechanism for generating time-aware user/item embeddings and propose two strategies for denoising bipartite temporal graph in recommender systems (DeBaTeR): the first is through reweighting the adjacency matrix (DeBaTeR-A), where a reliability score is defined to reweight the edges through both soft assignment and hard assignment; the second is through reweighting the loss function (DeBaTeR-L), where weights are generated to reweight user-item samples in the losses. Extensive experiments have been conducted to demonstrate the efficacy of our methods and illustrate how time information indeed helps identifying noisy edges.

[AI-37] LEAP:D - A Novel Prompt-based Approach for Domain-Generalized Aerial Object Detection ICIP2024

链接: https://arxiv.org/abs/2411.09180
作者: Chanyeong Park,Heegwang Kim,Joonki Paik
关键词-EN: Drone-captured images present, varying shooting conditions, images present significant, Drone-captured images, present significant challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: ICIP 2024 Workshop accepted paper

点击查看摘要

Abstract:Drone-captured images present significant challenges in object detection due to varying shooting conditions, which can alter object appearance and shape. Factors such as drone altitude, angle, and weather cause these variations, influencing the performance of object detection algorithms. To tackle these challenges, we introduce an innovative vision-language approach using learnable prompts. This shift from conventional manual prompts aims to reduce domain-specific knowledge interference, ultimately improving object detection capabilities. Furthermore, we streamline the training process with a one-step approach, updating the learnable prompt concurrently with model training, enhancing efficiency without compromising performance. Our study contributes to domain-generalized object detection by leveraging learnable prompts and optimizing training processes. This enhances model robustness and adaptability across diverse environments, leading to more effective aerial object detection.

[AI-38] Gazing at Rewards: Eye Movements as a Lens into Human and AI Decision-Making in Hybrid Visual Foraging

链接: https://arxiv.org/abs/2411.09176
作者: Bo Wang,Dingwei Tan,Yen-Ling Kuo,Zhaowei Sun,Jeremy M. Wolfe,Tat-Jen Cham,Mengmi Zhang
关键词-EN: Imagine searching, multiple target types, hybrid foraging task, multiple instances, searching a collection
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Imagine searching a collection of coins for quarters ( 0.25 ), dimes ( 0.10 ), nickels ( 0.05 ), and pennies ( 0.01 )-a hybrid foraging task where observers look for multiple instances of multiple target types. In such tasks, how do target values and their prevalence influence foraging and eye movement behaviors (e.g., should you prioritize rare quarters or common nickels)? To explore this, we conducted human psychophysics experiments, revealing that humans are proficient reward foragers. Their eye fixations are drawn to regions with higher average rewards, fixation durations are longer on more valuable targets, and their cumulative rewards exceed chance, approaching the upper bound of optimal foragers. To probe these decision-making processes of humans, we developed a transformer-based Visual Forager (VF) model trained via reinforcement learning. Our VF model takes a series of targets, their corresponding values, and the search image as inputs, processes the images using foveated vision, and produces a sequence of eye movements along with decisions on whether to collect each fixated item. Our model outperforms all baselines, achieves cumulative rewards comparable to those of humans, and approximates human foraging behavior in eye movements and foraging biases within time-limited environments. Furthermore, stress tests on out-of-distribution tasks with novel targets, unseen values, and varying set sizes demonstrate the VF model’s effective generalization. Our work offers valuable insights into the relationship between eye movements and decision-making, with our model serving as a powerful tool for further exploration of this connection. All data, code, and models will be made publicly available.

[AI-39] Advancing Diffusion Models: Alias-Free Resampling and Enhanced Rotational Equivariance

链接: https://arxiv.org/abs/2411.09174
作者: Md Fahim Anjum
关键词-EN: Recent advances, diffusion models, alias-free resampling, led to impressive, impressive improvements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Recent advances in image generation, particularly via diffusion models, have led to impressive improvements in image synthesis quality. Despite this, diffusion models are still challenged by model-induced artifacts and limited stability in image fidelity. In this work, we hypothesize that the primary cause of this issue is the improper resampling operation that introduces aliasing in the diffusion model and a careful alias-free resampling dictated by image processing theory can improve the model’s performance in image synthesis. We propose the integration of alias-free resampling layers into the UNet architecture of diffusion models without adding extra trainable parameters, thereby maintaining computational efficiency. We then assess whether these theory-driven modifications enhance image quality and rotational equivariance. Our experimental results on benchmark datasets, including CIFAR-10, MNIST, and MNIST-M, reveal consistent gains in image quality, particularly in terms of FID and KID scores. Furthermore, we propose a modified diffusion process that enables user-controlled rotation of generated images without requiring additional training. Our findings highlight the potential of theory-driven enhancements such as alias-free resampling in generative models to improve image quality while maintaining model efficiency and pioneer future research directions to incorporate them into video-generating diffusion models, enabling deeper exploration of the applications of alias-free resampling in generative modeling.

[AI-40] owards Scalable Handwriting Communication via EEG Decoding and Latent Embedding Integration

链接: https://arxiv.org/abs/2411.09170
作者: Jun-Young Kim,Deok-Seon Kim,Seo-Hyun Lee
关键词-EN: including gesture recognition, utilizing electroencephalogram, recent years, brain-computer interfaces, interfaces have made
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 4 pages, 2 figures, 1 table, Name of Conference: International Conference on Brain-Computer Interface

点击查看摘要

Abstract:In recent years, brain-computer interfaces have made advances in decoding various motor-related tasks, including gesture recognition and movement classification, utilizing electroencephalogram (EEG) data. These developments are fundamental in exploring how neural signals can be interpreted to recognize specific physical actions. This study centers on a written alphabet classification task, where we aim to decode EEG signals associated with handwriting. To achieve this, we incorporate hand kinematics to guide the extraction of the consistent embeddings from high-dimensional neural recordings using auxiliary variables (CEBRA). These CEBRA embeddings, along with the EEG, are processed by a parallel convolutional neural network model that extracts features from both data sources simultaneously. The model classifies nine different handwritten characters, including symbols such as exclamation marks and commas, within the alphabet. We evaluate the model using a quantitative five-fold cross-validation approach and explore the structure of the embedding space through visualizations. Our approach achieves a classification accuracy of 91 % for the nine-class task, demonstrating the feasibility of fine-grained handwriting decoding from EEG.

[AI-41] Artificial Theory of Mind and Self-Guided Social Organisation

链接: https://arxiv.org/abs/2411.09169
作者: Michael S. Harré,Jaime Ruiz-Serra,Catherine Drysdale
关键词-EN: behaviour to achieve, achieve goals, challenges artificial intelligence, network, article by Ozmen
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: 4 pages

点击查看摘要

Abstract:One of the challenges artificial intelligence (AI) faces is how a collection of agents coordinate their behaviour to achieve goals that are not reachable by any single agent. In a recent article by Ozmen et al this was framed as one of six grand challenges: That AI needs to respect human cognitive processes at the human-AI interaction frontier. We suggest that this extends to the AI-AI frontier and that it should also reflect human psychology, as it is the only successful framework we have from which to build out. In this extended abstract we first make the case for collective intelligence in a general setting, drawing on recent work from single neuron complexity in neural networks and ant network adaptability in ant colonies. From there we introduce how species relate to one another in an ecological network via niche selection, niche choice, and niche conformity with the aim of forming an analogy with human social network development as new agents join together and coordinate. From there we show how our social structures are influenced by our neuro-physiology, our psychology, and our language. This emphasises how individual people within a social network influence the structure and performance of that network in complex tasks, and that cognitive faculties such as Theory of Mind play a central role. We finish by discussing the current state of the art in AI and where there is potential for further development of a socially embodied collective artificial intelligence that is capable of guiding its own social structures.

[AI-42] heory of Mind Enhances Collective Intelligence

链接: https://arxiv.org/abs/2411.09168
作者: Michael S. Harré,Catherine Drysdale,Jaime Ruiz-Serra
关键词-EN: Collective Intelligence, Collective Intelligence plays, Collective, Intelligence, variety of fields
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: 20 pages, 2 figures, 1 table

点击查看摘要

Abstract:Collective Intelligence plays a central role in a large variety of fields, from economics and evolutionary theory to neural networks and eusocial insects, and it is also core to much of the work on emergence and self-organisation in complex systems theory. However, in human collective intelligence there is still much more to be understood in the relationship between specific psychological processes at the individual level and the emergence of self-organised structures at the social level. Previously psychological factors have played a relatively minor role in the study of collective intelligence as the principles are often quite general and applicable to humans just as readily as insects or other agents without sophisticated psychologies. In this article we emphasise, with examples from other complex adaptive systems, the broad applicability of collective intelligence principles while the mechanisms and time-scales differ significantly between examples. We contend that flexible collective intelligence in human social settings is improved by our use of a specific cognitive tool: our Theory of Mind. We identify several key characteristics of psychologically mediated collective intelligence and show that the development of a Theory of Mind is a crucial factor distinguishing social collective intelligence from general collective intelligence. We then place these capabilities in the context of the next steps in artificial intelligence embedded in a future that includes an effective human-AI hybrid social ecology.

[AI-43] Rationality based Innate-Values-driven Reinforcement Learning

链接: https://arxiv.org/abs/2411.09160
作者: Qin Yang
关键词-EN: develop diverse skills, diverse skills satisfying, reflect their inherent, inherent interests, interests and preferences
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: arXiv admin note: substantial text overlap with arXiv:2401.05572

点击查看摘要

Abstract:Innate values describe agents’ intrinsic motivations, which reflect their inherent interests and preferences to pursue goals and drive them to develop diverse skills satisfying their various needs. The essence of reinforcement learning (RL) is learning from interaction based on reward-driven behaviors, much like natural agents. It is an excellent model to describe the innate-values-driven (IV) behaviors of AI agents. Especially developing the awareness of the AI agent through balancing internal and external utilities based on its needs in different tasks is a crucial problem for individuals learning to support AI agents integrating human society with safety and harmony in the long term. This paper proposes a hierarchical compound intrinsic value reinforcement learning model – innate-values-driven reinforcement learning termed IVRL to describe the complex behaviors of AI agents’ interaction. We formulated the IVRL model and proposed two IVRL models: DQN and A2C. By comparing them with benchmark algorithms such as DQN, DDQN, A2C, and PPO in the Role-Playing Game (RPG) reinforcement learning test platform VIZDoom, we demonstrated that rationally organizing various individual needs can effectively achieve better performance.

[AI-44] he emphOptimist: Towards Fully Automated Graph Theory Research

链接: https://arxiv.org/abs/2411.09158
作者: Randy Davila
关键词-EN: emph, Optimist, autonomous system developed, paper introduces, developed to advance
类目: Artificial Intelligence (cs.AI); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:This paper introduces the \emphOptimist, an autonomous system developed to advance automated conjecture generation in graph theory. Leveraging mixed-integer programming (MIP) and heuristic methods, the \emphOptimist generates conjectures that both rediscover established theorems and propose novel inequalities. Through a combination of memory-based computation and agent-like adaptability, the \emphOptimist iteratively refines its conjectures by integrating new data, enabling a feedback process with minimal human (\emphor machine) intervention. Initial experiments reveal the \emphOptimist’s potential to uncover foundational results in graph theory, as well as to produce conjectures of interest for future exploration. This work also outlines the \emphOptimist’s evolving integration with a counterpart agent, the \emphPessimist (a human \emphor machine agent), to establish a dueling system that will drive fully automated graph theory research.

[AI-45] ABCI 3.0: Evolution of the leading AI infrastructure in Japan

链接: https://arxiv.org/abs/2411.09134
作者: Ryousei Takano,Shinichiro Takizawa,Yusuke Tanimura,Hidemoto Nakada,Hirotaka Ogawa
关键词-EN: operating since August, operational in January, infrastructure that AIST, ABCI, latest version
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: 4 pages, 2 figures

点击查看摘要

Abstract:ABCI 3.0 is the latest version of the ABCI, a large-scale open AI infrastructure that AIST has been operating since August 2018 and will be fully operational in January 2025. ABCI 3.0 consists of computing servers equipped with 6128 of the NVIDIA H200 GPUs and an all-flash storage system. Its peak performance is 6.22 exaflops in half precision and 3.0 exaflops in single precision, which is 7 to 13 times faster than the previous system, ABCI 2.0. It also more than doubles both storage capacity and theoretical read/write performance. ABCI 3.0 is expected to accelerate research and development, evaluation, and workforce development of cutting-edge AI technologies, with a particular focus on generative AI.

[AI-46] VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition

链接: https://arxiv.org/abs/2411.09105
作者: Chenglin Li,Qianglong Chen,Zhi Li,Feng Tao,Yin Zhang
关键词-EN: Large Video-Language Models, Recent advancements, advancements in Large, Large Video-Language, abstract concepts
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Video-Language Models (LVLMs) have driven the development of benchmarks designed to assess cognitive abilities in video-based tasks. However, most existing benchmarks heavily rely on web-collected videos paired with human annotations or model-generated questions, which limit control over the video content and fall short in evaluating advanced cognitive abilities involving symbolic elements and abstract concepts. To address these limitations, we introduce VCBench, a controllable benchmark to assess LVLMs’ cognitive abilities, involving symbolic and abstract concepts at varying difficulty levels. By generating video data with the Python-based engine, VCBench allows for precise control over the video content, creating dynamic, task-oriented videos that feature complex scenes and abstract concepts. Each task pairs with tailored question templates that target specific cognitive challenges, providing a rigorous evaluation test. Our evaluation reveals that even state-of-the-art (SOTA) models, such as Qwen2-VL-72B, struggle with simple video cognition tasks involving abstract concepts, with performance sharply dropping by 19% as video complexity rises. These findings reveal the current limitations of LVLMs in advanced cognitive tasks and highlight the critical role of VCBench in driving research toward more robust LVLMs for complex video cognition challenges.

[AI-47] Provocation: Who benefits from “inclusion” in Generative AI? NEURIPS2024

链接: https://arxiv.org/abs/2411.09102
作者: Nari Johnson,Siobhan Mackenzie Hall,Samantha Dalal
关键词-EN: participatory evaluation structures, increased demand, accurate and representative, representative generative, generative AI systems
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 3 pages, 1 figure. Published as a Short Paper in the NeurIPS 2024 Workshop on Evaluating Evaluations: Examining Best Practices for Measuring Broader Impacts of Generative AI

点击查看摘要

Abstract:The demands for accurate and representative generative AI systems means there is an increased demand on participatory evaluation structures. While these participatory structures are paramount to to ensure non-dominant values, knowledge and material culture are also reflected in AI models and the media they generate, we argue that dominant structures of community participation in AI development and evaluation are not explicit enough about the benefits and harms that members of socially marginalized groups may experience as a result of their participation. Without explicit interrogation of these benefits by AI developers, as a community we may remain blind to the immensity of systemic change that is needed as well. To support this provocation, we present a speculative case study, developed from our own collective experiences as AI researchers. We use this speculative context to itemize the barriers that need to be overcome in order for the proposed benefits to marginalized communities to be realized, and harms mitigated.

[AI-48] Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery

链接: https://arxiv.org/abs/2411.09101
作者: Ashim Dahal,Saydul Akbar Murad,Nick Rahimi
关键词-EN: Vision Transformers, computer vision, recently brought, field of computer, Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision Transformers (ViT) have recently brought a new wave of research in the field of computer vision. These models have done particularly well in the field of image classification and segmentation. Research on semantic and instance segmentation has emerged to accelerate with the inception of the new architecture, with over 80% of the top 20 benchmarks for the iSAID dataset being either based on the ViT architecture or the attention mechanism behind its success. This paper focuses on the heuristic comparison of three key factors of using (or not using) ViT for semantic segmentation of remote sensing aerial images on the iSAID. The experimental results observed during the course of the research were under the scrutinization of the following objectives: 1. Use of weighted fused loss function for the maximum mean Intersection over Union (mIoU) score, Dice score, and minimization or conservation of entropy or class representation, 2. Comparison of transfer learning on Meta’s MaskFormer, a ViT-based semantic segmentation model, against generic UNet Convolutional Neural Networks (CNNs) judged over mIoU, Dice scores, training efficiency, and inference time, and 3. What do we lose for what we gain? i.e., the comparison of the two models against current state-of-art segmentation models. We show the use of the novel combined weighted loss function significantly boosts the CNN model’s performance capacities as compared to transfer learning the ViT. The code for this implementation can be found on \urlthis https URL.

[AI-49] Set-Based Retrograde Analysis: Precomputing the Solution to 24-card Bridge Double Dummy Deals

链接: https://arxiv.org/abs/2411.09089
作者: Isaac Stone,Nathan R. Sturtevant,Jonathan Schaeffer
关键词-EN: working backwards, game-playing programs, states, algorithm, game
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrograde analysis is used in game-playing programs to solve states at the end of a game, working backwards toward the start of the game. The algorithm iterates through and computes the perfect-play value for as many states as resources allow. We introduce setrograde analysis which achieves the same results by operating on sets of states that have the same game value. The algorithm is demonstrated by computing exact solutions for Bridge double dummy card-play. For deals with 24 cards remaining to be played ( 10^27 states, which can be reduced to 10^15 states using preexisting techniques), we strongly solve all deals. The setrograde algorithm performs a factor of 10^3 fewer search operations than a standard retrograde algorithm, producing a database with a factor of 10^4 fewer entries. For applicable domains, this allows retrograde searching to reach unprecedented search depths.

[AI-50] Drone Detection using Deep Neural Networks Trained on Pure Synthetic Data

链接: https://arxiv.org/abs/2411.09077
作者: Mariusz Wisniewski,Zeeshan A. Rana,Ivan Petrunin,Alan Holt,Stephen Harman
关键词-EN: deep neural networks, data, neural networks, benefited from improvements, improvements in deep
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Drone detection has benefited from improvements in deep neural networks, but like many other applications, suffers from the availability of accurate data for training. Synthetic data provides a potential for low-cost data generation and has been shown to improve data availability and quality. However, models trained on synthetic datasets need to prove their ability to perform on real-world data, known as the problem of sim-to-real transferability. Here, we present a drone detection Faster-RCNN model trained on a purely synthetic dataset that transfers to real-world data. We found that it achieves an AP_50 of 97.0% when evaluated on the MAV-Vid - a real dataset of flying drones - compared with 97.8% for an equivalent model trained on real-world data. Our results show that using synthetic data for drone detection has the potential to reduce data collection costs and improve labelling quality. These findings could be a starting point for more elaborate synthetic drone datasets. For example, realistic recreations of specific scenarios could de-risk the dataset generation of safety-critical applications such as the detection of drones at airports. Further, synthetic data may enable reliable drone detection systems, which could benefit other areas, such as unmanned traffic management systems. The code is available this https URL alongside the datasets this https URL.

[AI-51] Liner Shipping Network Design with Reinforcement Learning

链接: https://arxiv.org/abs/2411.09068
作者: Utsav Dutta,Yifan Lin,Zhaoyang Larry Jin
关键词-EN: Liner Shipping Network, maritime shipping routes, cost-efficient maritime shipping, Shipping Network Design, challenging combinatorial optimization
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes a novel reinforcement learning framework to address the Liner Shipping Network Design Problem (LSNDP), a challenging combinatorial optimization problem focused on designing cost-efficient maritime shipping routes. Traditional methods for solving the LSNDP typically involve decomposing the problem into sub-problems, such as network design and multi-commodity flow, which are then tackled using approximate heuristics or large neighborhood search (LNS) techniques. In contrast, our approach employs a model-free reinforcement learning algorithm on the network design, integrated with a heuristic-based multi-commodity flow solver, to produce competitive results on the publicly available LINERLIB benchmark. Additionally, our method also demonstrates generalization capabilities by producing competitive solutions on the benchmark instances after training on perturbed instances.

[AI-52] Language-Model Prior Overcomes Cold-Start Items

链接: https://arxiv.org/abs/2411.09065
作者: Shiyu Wang,Hao Ding,Yupeng Gu,Sergul Aydore,Kousha Kalantari,Branislav Kveton
关键词-EN: video streaming, driven by digitization, e-commerce and video, item similarities, personalized content
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper is dedicated to cold-start item recommendation using language-model priors

点击查看摘要

Abstract:The growth of recommender systems (RecSys) is driven by digitization and the need for personalized content in areas such as e-commerce and video streaming. The content in these systems often changes rapidly and therefore they constantly face the ongoing cold-start problem, where new items lack interaction data and are hard to value. Existing solutions for the cold-start problem, such as content-based recommenders and hybrid methods, leverage item metadata to determine item similarities. The main challenge with these methods is their reliance on structured and informative metadata to capture detailed item similarities, which may not always be available. This paper introduces a novel approach for cold-start item recommendation that utilizes the language model (LM) to estimate item similarities, which are further integrated as a Bayesian prior with classic recommender systems. This approach is generic and able to boost the performance of various recommenders. Specifically, our experiments integrate it with both sequential and collaborative filtering-based recommender and evaluate it on two real-world datasets, demonstrating the enhanced performance of the proposed approach.

[AI-53] Multimodal Object Detection using Depth and Image Data for Manufacturing Parts

链接: https://arxiv.org/abs/2411.09062
作者: Nazanin Mahjourian,Vinh Nguyen
关键词-EN: parts and components, object detection, picking and handling, handling of diverse, diverse types
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Manufacturing requires reliable object detection methods for precise picking and handling of diverse types of manufacturing parts and components. Traditional object detection methods utilize either only 2D images from cameras or 3D data from lidars or similar 3D sensors. However, each of these sensors have weaknesses and limitations. Cameras do not have depth perception and 3D sensors typically do not carry color information. These weaknesses can undermine the reliability and robustness of industrial manufacturing systems. To address these challenges, this work proposes a multi-sensor system combining an red-green-blue (RGB) camera and a 3D point cloud sensor. The two sensors are calibrated for precise alignment of the multimodal data captured from the two hardware devices. A novel multimodal object detection method is developed to process both RGB and depth data. This object detector is based on the Faster R-CNN baseline that was originally designed to process only camera images. The results show that the multimodal model significantly outperforms the depth-only and RGB-only baselines on established object detection metrics. More specifically, the multimodal model improves mAP by 13% and raises Mean Precision by 11.8% in comparison to the RGB-only baseline. Compared to the depth-only baseline, it improves mAP by 78% and raises Mean Precision by 57%. Hence, this method facilitates more reliable and robust object detection in service to smart manufacturing applications.

[AI-54] SAFELOC: Overcoming Data Poisoning Attacks in Heterogeneous Federated Machine Learning for Indoor Localization

链接: https://arxiv.org/abs/2411.09055
作者: Akhil Singampalli,Danish Gufran,Sudeep Pasricha
关键词-EN: Machine learning, emerging applications, compromised by hardware, software variations, solutions are critical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Machine learning (ML) based indoor localization solutions are critical for many emerging applications, yet their efficacy is often compromised by hardware/software variations across mobile devices (i.e., device heterogeneity) and the threat of ML data poisoning attacks. Conventional methods aimed at countering these challenges show limited resilience to the uncertainties created by these phenomena. In response, in this paper, we introduce SAFELOC, a novel framework that not only minimizes localization errors under these challenging conditions but also ensures model compactness for efficient mobile device deployment. Our framework targets a distributed and co-operative learning environment that uses federated learning (FL) to preserve user data privacy and assumes heterogeneous mobile devices carried by users (just like in most real-world scenarios). Within this heterogeneous FL context, SAFELOC introduces a novel fused neural network architecture that performs data poisoning detection and localization, with a low model footprint. Additionally, a dynamic saliency map-based aggregation strategy is designed to adapt based on the severity of the detected data poisoning scenario. Experimental evaluations demonstrate that SAFELOC achieves improvements of up to 5.9x in mean localization error, 7.8x in worst-case localization error, and a 2.1x reduction in model inference latency compared to state-of-the-art indoor localization frameworks, across diverse building floorplans, mobile devices, and ML data poisoning attack scenarios.

[AI-55] he Systems Engineering Approach in Times of Large Language Models

链接: https://arxiv.org/abs/2411.09050
作者: Christian Cabrera,Viviana Bastidas,Jennifer Schooling,Neil D. Lawrence
关键词-EN: Large Language Models, Language Models, Large Language, address critical societal, critical societal problems
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
*备注: This paper has been accepted for the upcoming 58th Hawaii International Conference on System Sciences (HICSS-58)

点击查看摘要

Abstract:Using Large Language Models (LLMs) to address critical societal problems requires adopting this novel technology into socio-technical systems. However, the complexity of such systems and the nature of LLMs challenge such a vision. It is unlikely that the solution to such challenges will come from the Artificial Intelligence (AI) community itself. Instead, the Systems Engineering approach is better equipped to facilitate the adoption of LLMs by prioritising the problems and their context before any other aspects. This paper introduces the challenges LLMs generate and surveys systems research efforts for engineering AI-based systems. We reveal how the systems engineering principles have supported addressing similar issues to the ones LLMs pose and discuss our findings to provide future directions for adopting LLMs.

[AI-56] Virtual teaching assistant for undergraduate students using natural language processing deep learning

链接: https://arxiv.org/abs/2411.09001
作者: Sadman Jashim Sakib,Baktiar Kabir Joy,Zahin Rydha,Md. Nuruzzaman,Annajiat Alim Rasel
关键词-EN: Online education popularity, Online education, continuously increasing, Online, education popularity
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Online education’s popularity has been continuously increasing over the past few years. Many universities were forced to switch to online education as a result of COVID-19. In many cases, even after more than two years of online instruction, colleges were unable to resume their traditional classroom programs. A growing number of institutions are considering blended learning with some parts in-person and the rest of the learning taking place online. Nevertheless, many online education systems are inefficient, and this results in a poor rate of student retention. In this paper, we are offering a primary dataset, the initial implementation of a virtual teaching assistant named VTA-bot, and its system architecture. Our primary implementation of the suggested system consists of a chatbot that can be queried about the content and topics of the fundamental python programming language course. Students in their first year of university will be benefited from this strategy, which aims to increase student participation and involvement in online education.

[AI-57] Reliability Resilience and Human Factors Engineering for Trustworthy AI Systems

链接: https://arxiv.org/abs/2411.08981
作者: Saurabh Mishra,Anand Rao,Ramayya Krishnan,Bilal Ayyub,Amin Aria,Enrico Zio
关键词-EN: industries and services, safety is essential, integral to critical, critical operations, operations across industries
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:As AI systems become integral to critical operations across industries and services, ensuring their reliability and safety is essential. We offer a framework that integrates established reliability and resilience engineering principles into AI systems. By applying traditional metrics such as failure rate and Mean Time Between Failures (MTBF) along with resilience engineering and human reliability analysis, we propose an integrate framework to manage AI system performance, and prevent or efficiently recover from failures. Our work adapts classical engineering methods to AI systems and outlines a research agenda for future technical studies. We apply our framework to a real-world AI system, using system status data from platforms such as openAI, to demonstrate its practical applicability. This framework aligns with emerging global standards and regulatory frameworks, providing a methodology to enhance the trustworthiness of AI systems. Our aim is to guide policy, regulation, and the development of reliable, safe, and adaptable AI technologies capable of consistent performance in real-world environments.

[AI-58] Inconsistencies In Consistency Models: Better ODE Solving Does Not Imply Better Samples NEURIPS2024

链接: https://arxiv.org/abs/2411.08954
作者: Noël Vouitsis,Rasa Hosseinzadeh,Brendan Leigh Ross,Valentin Villecroze,Satya Krishna Gorti,Jesse C. Cresswell,Gabriel Loaiza-Ganem
关键词-EN: generate remarkably high-quality, iterative sampling procedure, expensive iterative sampling, remarkably high-quality samples, generate remarkably
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 ATTRIB Workshop

点击查看摘要

Abstract:Although diffusion models can generate remarkably high-quality samples, they are intrinsically bottlenecked by their expensive iterative sampling procedure. Consistency models (CMs) have recently emerged as a promising diffusion model distillation method, reducing the cost of sampling by generating high-fidelity samples in just a few iterations. Consistency model distillation aims to solve the probability flow ordinary differential equation (ODE) defined by an existing diffusion model. CMs are not directly trained to minimize error against an ODE solver, rather they use a more computationally tractable objective. As a way to study how effectively CMs solve the probability flow ODE, and the effect that any induced error has on the quality of generated samples, we introduce Direct CMs, which \textitdirectly minimize this error. Intriguingly, we find that Direct CMs reduce the ODE solving error compared to CMs but also result in significantly worse sample quality, calling into question why exactly CMs work well in the first place. Full code is available at: this https URL.

[AI-59] Confidence-aware Denoised Fine-tuning of Off-the-shelf Models for Certified Robustness

链接: https://arxiv.org/abs/2411.08933
作者: Suhyeok Jang,Seojin Kim,Jinwoo Shin,Jongheon Jeong
关键词-EN: large pre-trained models, large pre-trained, remarkable advances, advances in deep, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 26 pages; TMLR 2024; Code is available at this https URL

点击查看摘要

Abstract:The remarkable advances in deep learning have led to the emergence of many off-the-shelf classifiers, e.g., large pre-trained models. However, since they are typically trained on clean data, they remain vulnerable to adversarial attacks. Despite this vulnerability, their superior performance and transferability make off-the-shelf classifiers still valuable in practice, demanding further work to provide adversarial robustness for them in a post-hoc manner. A recently proposed method, denoised smoothing, leverages a denoiser model in front of the classifier to obtain provable robustness without additional training. However, the denoiser often creates hallucination, i.e., images that have lost the semantics of their originally assigned class, leading to a drop in robustness. Furthermore, its noise-and-denoise procedure introduces a significant distribution shift from the original distribution, causing the denoised smoothing framework to achieve sub-optimal robustness. In this paper, we introduce Fine-Tuning with Confidence-Aware Denoised Image Selection (FT-CADIS), a novel fine-tuning scheme to enhance the certified robustness of off-the-shelf classifiers. FT-CADIS is inspired by the observation that the confidence of off-the-shelf classifiers can effectively identify hallucinated images during denoised smoothing. Based on this, we develop a confidence-aware training objective to handle such hallucinated images and improve the stability of fine-tuning from denoised images. In this way, the classifier can be fine-tuned using only images that are beneficial for adversarial robustness. We also find that such a fine-tuning can be done by updating a small fraction of parameters of the classifier. Extensive experiments demonstrate that FT-CADIS has established the state-of-the-art certified robustness among denoised smoothing methods across all \ell_2 -adversary radius in various benchmarks.

[AI-60] PyGen: A Collaborative Human-AI Approach to Python Package Creation

链接: https://arxiv.org/abs/2411.08932
作者: Saikat Barua,Mostafizur Rahman,Md Jafor Sadek,Rafiul Islam,Shehnaz Khaled,Md. Shohrab Hossain
关键词-EN: science and technology, serve as foundational, foundational elements, elements for advancement, advancement in contemporary
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 33 pages, 13 figures

点击查看摘要

Abstract:The principles of automation and innovation serve as foundational elements for advancement in contemporary science and technology. Here, we introduce Pygen, an automation platform designed to empower researchers, technologists, and hobbyists to bring abstract ideas to life as core, usable software tools written in Python. Pygen leverages the immense power of autoregressive large language models to augment human creativity during the ideation, iteration, and innovation process. By combining state-of-the-art language models with open-source code generation technologies, Pygen has significantly reduced the manual overhead of tool development. From a user prompt, Pygen automatically generates Python packages for a complete workflow from concept to package generation and documentation. The findings of our work show that Pygen considerably enhances the researcher’s productivity by enabling the creation of resilient, modular, and well-documented packages for various specialized purposes. We employ a prompt enhancement approach to distill the user’s package description into increasingly specific and actionable. While being inherently an open-ended task, we have evaluated the generated packages and the documentation using Human Evaluation, LLM-based evaluation, and CodeBLEU, with detailed results in the results section. Furthermore, we documented our results, analyzed the limitations, and suggested strategies to alleviate them. Pygen is our vision of ethical automation, a framework that promotes inclusivity, accessibility, and collaborative development. This project marks the beginning of a large-scale effort towards creating tools where intelligent agents collaborate with humans to improve scientific and technological development substantially. Our code and generated examples are open-sourced at [this https URL] Comments: 33 pages, 13 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2411.08932 [cs.SE] (or arXiv:2411.08932v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2411.08932 Focus to learn more arXiv-issued DOI via DataCite

[AI-61] Retrieval of sun-induced plant fluorescence in the O_2-A absorption band from DESIS imagery ECCV

链接: https://arxiv.org/abs/2411.08925
作者: Jim Buffat,Miguel Pato,Kevin Alonso,Stefan Auer,Emiliano Carmona,Stefan Maier,Rupert Müller,Patrick Rademske,Uwe Rascher,Hanno Scharr
关键词-EN: retrieve spaceborne SIF, spaceborne SIF maps, spaceborne SIF, SIF, Spaceborne SIF products
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
*备注: submitted to ECCV CVPPA 2024, 14 pages, 8 figures

点击查看摘要

Abstract:We provide the first method allowing to retrieve spaceborne SIF maps at 30 m ground resolution with a strong correlation ( r^2=0.6 ) to high-quality airborne estimates of sun-induced fluorescence (SIF). SIF estimates can provide explanatory information for many tasks related to agricultural management and physiological studies. While SIF products from airborne platforms are accurate and spatially well resolved, the data acquisition of such products remains science-oriented and limited to temporally constrained campaigns. Spaceborne SIF products on the other hand are available globally with often sufficient revisit times. However, the spatial resolution of spaceborne SIF products is too small for agricultural applications. In view of ESA’s upcoming FLEX mission we develop a method for SIF retrieval in the O _2 -A band of hyperspectral DESIS imagery to provide first insights for spaceborne SIF retrieval at high spatial resolution. To this end, we train a simulation-based self-supervised network with a novel perturbation based regularizer and test performance improvements under additional supervised regularization of atmospheric variable prediction. In a validation study with corresponding HyPlant derived SIF estimates at 740 nm we find that our model reaches a mean absolute difference of 0.78 mW / nm / sr / m ^2 .

[AI-62] Wireless Federated Learning over UAV-enabled Integrated Sensing and Communication

链接: https://arxiv.org/abs/2411.08918
作者: Shaba Shaon,Tien Nguyen,Lina Mohjazi,Aryan Kaushik,Dinh C. Nguyen
关键词-EN: enabled federated learning, unmanned aerial vehicles, aerial vehicles, enabled federated, federated learning
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: Accepted to IEEE Conference on Standards for Communications and Networking (CSCN), 6 pages

点击查看摘要

Abstract:This paper studies a new latency optimization problem in unmanned aerial vehicles (UAVs)-enabled federated learning (FL) with integrated sensing and communication. In this setup, distributed UAVs participate in model training using sensed data and collaborate with a base station (BS) serving as FL aggregator to build a global model. The objective is to minimize the FL system latency over UAV networks by jointly optimizing UAVs’ trajectory and resource allocation of both UAVs and the BS. The formulated optimization problem is troublesome to solve due to its non-convexity. Hence, we develop a simple yet efficient iterative algorithm to find a high-quality approximate solution, by leveraging block coordinate descent and successive convex approximation techniques. Simulation results demonstrate the effectiveness of our proposed joint optimization strategy under practical parameter settings, saving the system latency up to 68.54% compared to benchmark schemes.

[AI-63] Automated Feedback in Math Education: A Comparative Analysis of LLM s for Open-Ended Responses

链接: https://arxiv.org/abs/2411.08910
作者: Sami Baral,Eamon Worden,Wen-Chiang Lim,Zhuang Luo,Christopher Santorelli,Ashish Gurung,Neil Heffernan
关键词-EN: Educational Data Mining, Data Mining, Educational Data, documented within Educational, feedback
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages including references, 4 figures, 9 tables

点击查看摘要

Abstract:The effectiveness of feedback in enhancing learning outcomes is well documented within Educational Data Mining (EDM). Various prior research has explored methodologies to enhance the effectiveness of feedback. Recent developments in Large Language Models (LLMs) have extended their utility in enhancing automated feedback systems. This study aims to explore the potential of LLMs in facilitating automated feedback in math education. We examine the effectiveness of LLMs in evaluating student responses by comparing 3 different models: Llama, SBERT-Canberra, and GPT4 model. The evaluation requires the model to provide both a quantitative score and qualitative feedback on the student’s responses to open-ended math problems. We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems. A similar approach was taken for training the SBERT model as well, while the GPT4 model used a zero-shot learning approach. We evaluate the model’s performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers. The teachers utilized a shared rubric in assessing the accuracy and relevance of the generated feedback. We conduct both quantitative and qualitative analyses of the model performance. By offering a detailed comparison of these methods, this study aims to further the ongoing development of automated feedback systems and outlines potential future directions for leveraging generative LLMs to create more personalized learning experiences.

[AI-64] Assessing the Auditability of AI-integrating Systems: A Framework and Learning Analytics Case Study

链接: https://arxiv.org/abs/2411.08906
作者: Linda Fernsel,Yannick Kalff,Katharina Simbeck
关键词-EN: integrate Artificial Intelligence, Learning Analytics, Artificial Intelligence, trustworthiness of Learning, integrate Artificial
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Audits contribute to the trustworthiness of Learning Analytics (LA) systems that integrate Artificial Intelligence (AI) and may be legally required in the future. We argue that the efficacy of an audit depends on the auditability of the audited system. Therefore, systems need to be designed with auditability in mind. We present a framework for assessing the auditability of AI-integrating systems that consists of three parts: (1) Verifiable claims about the validity, utility and ethics of the system, (2) Evidence on subjects (data, models or the system) in different types (documentation, raw sources and logs) to back or refute claims, (3) Evidence must be accessible to auditors via technical means (APIs, monitoring tools, explainable AI, etc.). We apply the framework to assess the auditability of Moodle’s dropout prediction system and a prototype AI-based LA. We find that Moodle’s auditability is limited by incomplete documentation, insufficient monitoring capabilities and a lack of available test data. The framework supports assessing the auditability of AI-based LA systems in use and improves the design of auditable systems and thus of audits.

[AI-65] Comment on Is Complexity an Illusion?

链接: https://arxiv.org/abs/2411.08897
作者: Gabriel Simmons
关键词-EN: Complexity an Illusion, Illusion, Abstract, Complexity, paper
类目: Artificial Intelligence (cs.AI)
*备注: Comment on arXiv:2404.07227

点击查看摘要

Abstract:The paper “Is Complexity an Illusion?” (Bennett, 2024) provides a formalism for complexity, learning, inference, and generalization, and introduces a formal definition for a “policy”. This reply shows that correct policies do not exist for a simple task of supervised multi-class classification, via mathematical proof and exhaustive search. Implications of this result are discussed, as well as possible responses and amendments to the theory.

[AI-66] mporal Patterns of Multiple Long-Term Conditions in Welsh Individuals with Intellectual Disabilities: An Unsupervised Clustering Approach to Disease Trajectories

链接: https://arxiv.org/abs/2411.08894
作者: Rania Kousovista,Georgina Cosma,Emeka Abakasanga,Ashley Akbari,Francesco Zaccardi,Gyuchan Thomas Jun,Reza Kiani,Satheesh Gangadharan
关键词-EN: Identifying and understanding, multiple long-term conditions, effective healthcare management, intellectual disabilities, multiple long-term
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Identifying and understanding the co-occurrence of multiple long-term conditions (MLTC) in individuals with intellectual disabilities (ID) is vital for effective healthcare management. These individuals often face earlier onset and higher prevalence of MLTCs, yet specific co-occurrence patterns remain unexplored. This study applies an unsupervised approach to characterise MLTC clusters based on shared disease trajectories using electronic health records (EHRs) from 13069 individuals with ID in Wales (2000-2021). The population consisted of 52.3% males and 47.7% females, with an average of 4.5 conditions per patient. Disease associations and temporal directionality were assessed, followed by spectral clustering to group shared trajectories. Males under 45 formed a single cluster dominated by neurological conditions (32.4%), while males above 45 had three clusters, the largest featuring circulatory conditions (51.8%). Females under 45 formed one cluster with digestive conditions (24.6%) as most prevalent, while those aged 45 and older showed two clusters: one dominated by circulatory conditions (34.1%), and the other by digestive (25.9%) and musculoskeletal (21.9%) issues. Mental illness, epilepsy, and reflux were common across groups. Individuals above 45 had higher rates of circulatory and musculoskeletal issues. These clusters offer insights into disease progression in individuals with ID, informing targeted interventions and personalised healthcare strategies.

[AI-67] Auto-assessment of assessment: A conceptual framework towards fulfilling the policy gaps in academic assessment practices

链接: https://arxiv.org/abs/2411.08892
作者: Wasiq Khan,Luke K. Topham,Peter Atherton,Raghad Al-Shabandar,Hoshang Kolivand,Iftikhar Khan,Abir Hussain
关键词-EN: Generative Artificial Intelligence, emerging Generative Artificial, Artificial Intelligence, including emerging Generative, Generative Artificial
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 20 Pages, 5 Figures, submitted for journal peer-review

点击查看摘要

Abstract:Education is being transformed by rapid advances in Artificial Intelligence (AI), including emerging Generative Artificial Intelligence (GAI). Such technology can significantly support academics and students by automating monotonous tasks and making personalised suggestions. However, despite the potential of the technology, there are significant concerns regarding AI misuse, particularly by students in assessments. There are two schools of thought: one advocates for a complete ban on it, while the other views it as a valuable educational tool, provided it is governed by a robust usage policy. This contradiction clearly indicates a major policy gap in academic practices, and new policies are required to uphold academic standards while enabling staff and students to benefit from technological advancements. We surveyed 117 academics from three countries (UK, UAE, and Iraq), and identified that most academics retain positive opinions regarding AI in education. For example, the majority of experienced academics do not favour complete bans, and they see the potential benefits of AI for students, teaching staff, and academic institutions. Importantly, academics specifically identified the particular benefits of AI for autonomous assessment (71.79% of respondents agreed). Therefore, for the first time, we propose a novel AI framework for autonomously evaluating students’ work (e.g., reports, coursework, etc.) and automatically assigning grades based on their knowledge and in-depth understanding of the submitted content. The survey results further highlight a significant lack of awareness of modern AI-based tools (e.g., ChatGPT) among experienced academics, a gap that must be addressed to uphold educational standards.

[AI-68] Calibrated Decision-Making through LLM -Assisted Retrieval

链接: https://arxiv.org/abs/2411.08891
作者: Chaeyun Jang,Hyungi Lee,Seanie Lee,Juho Lee
关键词-EN: large language models, large language, language models, decision-making tasks, Retrieval Augmented Generation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have been increasingly used to support various decision-making tasks, assisting humans in making informed decisions. However, when LLMs confidently provide incorrect information, it can lead humans to make suboptimal decisions. To prevent LLMs from generating incorrect information on topics they are unsure of and to improve the accuracy of generated content, prior works have proposed Retrieval Augmented Generation (RAG), where external documents are referenced to generate responses. However, traditional RAG methods focus only on retrieving documents most relevant to the input query, without specifically aiming to ensure that the human user’s decisions are well-calibrated. To address this limitation, we propose a novel retrieval method called Calibrated Retrieval-Augmented Generation (CalibRAG), which ensures that decisions informed by the retrieved documents are well-calibrated. Then we empirically validate that CalibRAG improves calibration performance as well as accuracy, compared to other baselines across various datasets.

[AI-69] Spotlight Session on Autonomous Weapons Systems at ICRC 34th International Conference

链接: https://arxiv.org/abs/2411.08890
作者: Susannah Kate Conroy
关键词-EN: Training weapons decision, humans make decisions, weapons systems, Autonomous weapons systems, AWS
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 8 pages, 2415 words, 1 figure. Panelist notes for the Spotlight Session on Autonomous Weapons Systems at the ICRC 34th International Conference 28-31 Oct 2024

点击查看摘要

Abstract:Autonomous weapons systems (AWS) change the way humans make decisions, the effect of those decisions and who is accountable for decisions made. We must remain vigilant, informed and human-centred as we tackle our deliberations on developing norms regarding their development, use and justification. Ways to enhance compliance in international humanitarian law (IHL) include: Training weapons decision makers in IHL; developing best practice in weapons reviews including requirements for industry to ensure that any new weapon, means or method of warfare is capable of being used lawfully; develop human-centred test and evaluation methods; invest in digital infrastructure to increase knowledge of the civilian environment in a conflict and its dynamics; invest in research on the real effects and consequences of civilian harms to the achievement of military and political objectives; improve secure communications between stakeholders in a conflict; and finally to upskill governments and NGOs in what is technically achievable with emerging technologies so that they can contribute to system requirements, test and evaluation protocols and operational rules of use and engagement. Governments are responsible for setting requirements for weapons systems. They are responsible for driving ethicality as well as lethality. Governments can require systems to be made and used to better protect civilians and protected objects. The UN can advocate for compliance with IHL, human rights, human-centred use of weapons systems and improved mechanisms to monitor and trace military decision making including those decisions affected by autonomous functionality.

[AI-70] Multilingual Standalone Trustworthy Voice-Based Social Network for Disaster Situations

链接: https://arxiv.org/abs/2411.08889
作者: Majid Behravan,Elham Mohammadrezaei,Mohamed Azab,Denis Gracanin
关键词-EN: accurate information dissemination, complicating response efforts, information dissemination, exacerbating vulnerabilities, barriers often hinder
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted for publication in IEEE UEMCON 2024, to appear in December 2024. 7 pages, 3 figures

点击查看摘要

Abstract:In disaster scenarios, effective communication is crucial, yet language barriers often hinder timely and accurate information dissemination, exacerbating vulnerabilities and complicating response efforts. This paper presents a novel, multilingual, voice-based social network specifically designed to address these challenges. The proposed system integrates advanced artificial intelligence (AI) with blockchain technology to enable secure, asynchronous voice communication across multiple languages. The application operates independently of external servers, ensuring reliability even in compromised environments by functioning offline through local networks. Key features include AI-driven real-time translation of voice messages, ensuring seamless cross-linguistic communication, and blockchain-enabled storage for secure, immutable records of all interactions, safeguarding message integrity. Designed for cross-platform use, the system offers consistent performance across devices, from mobile phones to desktops, making it highly adaptable in diverse disaster situations. Evaluation metrics demonstrate high accuracy in speech recognition and translation, low latency, and user satisfaction, validating the system’s effectiveness in enhancing communication during crises. This solution represents a significant advancement in disaster communication, bridging language gaps to support more inclusive and efficient emergency response.

[AI-71] Exploring Capabilities of Time Series Foundation Models in Building Analytics

链接: https://arxiv.org/abs/2411.08888
作者: Xiachong Lin,Arian Prabowo,Imran Razzak,Hao Xue,Matthew Amos,Sam Behrens,Flora D. Salim
关键词-EN: Internet of Things, infrastructure with Internet, building energy consumption, networks has transformed, growing integration
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 7 pages, 1 figures, and 4 tables

点击查看摘要

Abstract:The growing integration of digitized infrastructure with Internet of Things (IoT) networks has transformed the management and optimization of building energy consumption. By leveraging IoT-based monitoring systems, stakeholders such as building managers, energy suppliers, and policymakers can make data-driven decisions to improve energy efficiency. However, accurate energy forecasting and analytics face persistent challenges, primarily due to the inherent physical constraints of buildings and the diverse, heterogeneous nature of IoT-generated data. In this study, we conduct a comprehensive benchmarking of two publicly available IoT datasets, evaluating the performance of time series foundation models in the context of building energy analytics. Our analysis shows that single-modal models demonstrate significant promise in overcoming the complexities of data variability and physical limitations in buildings, with future work focusing on optimizing multi-modal models for sustainable energy management.

[AI-72] Enhancing Lie Detection Accuracy: A Comparative Study of Classic ML CNN and GCN Models using Audio-Visual Features

链接: https://arxiv.org/abs/2411.08885
作者: Abdelrahman Abdelwahab,Abdelrahman Abdelwahab,Ayaan Vaswani,Advait Bharathulwar,Arnav Kommaraju
关键词-EN: Inaccuracies in polygraph, false information, wrongful convictions, political systems, polygraph tests
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 11 pages, 18 figures

点击查看摘要

Abstract:Inaccuracies in polygraph tests often lead to wrongful convictions, false information, and bias, all of which have significant consequences for both legal and political systems. Recently, analyzing facial micro-expressions has emerged as a method for detecting deception; however, current models have not reached high accuracy and generalizability. The purpose of this study is to aid in remedying these problems. The unique multimodal transformer architecture used in this study improves upon previous approaches by using auditory inputs, visual facial micro-expressions, and manually transcribed gesture annotations, moving closer to a reliable non-invasive lie detection model. Visual and auditory features were extracted using the Vision Transformer and OpenSmile models respectively, which were then concatenated with the transcriptions of participants micro-expressions and gestures. Various models were trained for the classification of lies and truths using these processed and concatenated features. The CNN Conv1D multimodal model achieved an average accuracy of 95.4%. However, further research is still required to create higher-quality datasets and even more generalized models for more diverse applications.

[AI-73] KisanQRS: A Deep Learning-based Automated Query-Response System for Agricultural Decision-Making

链接: https://arxiv.org/abs/2411.08883
作者: Mohammad Zia Ur Rehman,Devraj Raghuvanshi,Nagendra Kumar
关键词-EN: Delivering prompt information, Delivering prompt, agricultural decision-making, prompt information, information and guidance
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Delivering prompt information and guidance to farmers is critical in agricultural decision-making. Farmers helpline centres are heavily reliant on the expertise and availability of call centre agents, leading to inconsistent quality and delayed responses. To this end, this article presents Kisan Query Response System (KisanQRS), a Deep Learning-based robust query-response framework for the agriculture sector. KisanQRS integrates semantic and lexical similarities of farmers queries and employs a rapid threshold-based clustering method. The clustering algorithm is based on a linear search technique to iterate through all queries and organize them into clusters according to their similarity. For query mapping, LSTM is found to be the optimal method. Our proposed answer retrieval method clusters candidate answers for a crop, ranks these answer clusters based on the number of answers in a cluster, and selects the leader of each cluster. The dataset used in our analysis consists of a subset of 34 million call logs from the Kisan Call Centre (KCC), operated under the Government of India. We evaluated the performance of the query mapping module on the data of five major states of India with 3,00,000 samples and the quantifiable outcomes demonstrate that KisanQRS significantly outperforms traditional techniques by achieving 96.58% top F1-score for a state. The answer retrieval module is evaluated on 10,000 samples and it achieves a competitive NDCG score of 96.20%. KisanQRS is useful in enabling farmers to make informed decisions about their farming practices by providing quick and pertinent responses to their queries.

[AI-74] A Novel Multimodal System to Predict Agitation in People with Dementia Within Clinical Settings: A Proof of Concept

链接: https://arxiv.org/abs/2411.08882
作者: Abeer Badawi,Somayya Elmoghazy,Samira Choudhury,Sara Elgazzar,Khalid Elgazzar,Amer Burhan
关键词-EN: neurodegenerative condition, condition that combines, combines several diseases, diseases and impacts, impacts millions
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dementia is a neurodegenerative condition that combines several diseases and impacts millions around the world and those around them. Although cognitive impairment is profoundly disabling, it is the noncognitive features of dementia, referred to as Neuropsychiatric Symptoms (NPS), that are most closely associated with a diminished quality of life. Agitation and aggression (AA) in people living with dementia (PwD) contribute to distress and increased healthcare demands. Current assessment methods rely on caregiver intervention and reporting of incidents, introducing subjectivity and bias. Artificial Intelligence (AI) and predictive algorithms offer a potential solution for detecting AA episodes in PwD when utilized in real-time. We present a 5-year study system that integrates a multimodal approach, utilizing the EmbracePlus wristband and a video detection system to predict AA in severe dementia patients. We conducted a pilot study with three participants at the Ontario Shores Mental Health Institute to validate the functionality of the system. The system collects and processes raw and digital biomarkers from the EmbracePlus wristband to accurately predict AA. The system also detected pre-agitation patterns at least six minutes before the AA event, which was not previously discovered from the EmbracePlus wristband. Furthermore, the privacy-preserving video system uses a masking tool to hide the features of the people in frames and employs a deep learning model for AA detection. The video system also helps identify the actual start and end time of the agitation events for labeling. The promising results of the preliminary data analysis underscore the ability of the system to predict AA events. The ability of the proposed system to run autonomously in real-time and identify AA and pre-agitation symptoms without external assistance represents a significant milestone in this research field.

[AI-75] Can We Trust AI Agents ? An Experimental Study Towards Trustworthy LLM -Based Multi-Agent Systems for AI Ethics

链接: https://arxiv.org/abs/2411.08881
作者: José Antonio Siqueira de Cerqueira,Mamia Agbese,Rebekah Rousi,Nannan Xi,Juho Hamari,Pekka Abrahamsson
关键词-EN: Large Language Models, including Large Language, Language Models, Large Language, supporting diverse tasks
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI-based systems, including Large Language Models (LLMs), impact millions by supporting diverse tasks but face issues like misinformation, bias, and misuse. Ethical AI development is crucial as new technologies and concerns emerge, but objective, practical ethical guidance remains debated. This study examines LLMs in developing ethical AI systems, assessing how trustworthiness-enhancing techniques affect ethical AI output generation. Using the Design Science Research (DSR) method, we identify techniques for LLM trustworthiness: multi-agents, distinct roles, structured communication, and multiple rounds of debate. We design the multi-agent prototype LLM-BMAS, where agents engage in structured discussions on real-world ethical AI issues from the AI Incident Database. The prototype’s performance is evaluated through thematic analysis, hierarchical clustering, ablation studies, and source code execution. Our system generates around 2,000 lines per run, compared to only 80 lines in the ablation study. Discussions reveal terms like bias detection, transparency, accountability, user consent, GDPR compliance, fairness evaluation, and EU AI Act compliance, showing LLM-BMAS’s ability to generate thorough source code and documentation addressing often-overlooked ethical AI issues. However, practical challenges in source code integration and dependency management may limit smooth system adoption by practitioners. This study aims to shed light on enhancing trustworthiness in LLMs to support practitioners in developing ethical AI-based systems.

[AI-76] SMILE-UHURA Challenge – Small Vessel Segmentation at Mesoscopic Scale from Ultra-High Resolution 7T Magnetic Resonance Angiograms

链接: https://arxiv.org/abs/2411.09593
作者: Soumick Chatterjee,Hendrik Mattern,Marc Dörner,Alessandro Sciarra,Florian Dubost,Hannes Schnurre,Rupali Khatun,Chun-Chih Yu,Tsung-Lin Hsieh,Yi-Shan Tsai,Yi-Zeng Fang,Yung-Ching Yang,Juinn-Dar Huang,Marshall Xu,Siyu Liu,Fernanda L. Ribeiro,Saskia Bollmann,Karthikesh Varma Chintalapati,Chethan Mysuru Radhakrishna,Sri Chandana Hudukula Ram Kumara,Raviteja Sutrave,Abdul Qayyum,Moona Mazher,Imran Razzak,Cristobal Rodero,Steven Niederren,Fengming Lin,Yan Xia,Jiacheng Wang,Riyu Qiu,Liansheng Wang,Arya Yazdan Panah,Rosana El Jurdi,Guanghui Fu,Janan Arslan,Ghislain Vaillant,Romain Valabregue,Didier Dormont,Bruno Stankoff,Olivier Colliot,Luisa Vargas,Isai Daniel Chacón,Ioannis Pitsiorlas,Pablo Arbeláez,Maria A. Zuluaga,Stefanie Schreiber,Oliver Speck,Andreas Nürnberger
关键词-EN: Small Vessel Diseases, Cerebral Small Vessel, human brain receives, brain receives nutrients, affecting small vessels
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The human brain receives nutrients and oxygen through an intricate network of blood vessels. Pathology affecting small vessels, at the mesoscopic scale, represents a critical vulnerability within the cerebral blood supply and can lead to severe conditions, such as Cerebral Small Vessel Diseases. The advent of 7 Tesla MRI systems has enabled the acquisition of higher spatial resolution images, making it possible to visualise such vessels in the brain. However, the lack of publicly available annotated datasets has impeded the development of robust, machine learning-driven segmentation algorithms. To address this, the SMILE-UHURA challenge was organised. This challenge, held in conjunction with the ISBI 2023, in Cartagena de Indias, Colombia, aimed to provide a platform for researchers working on related topics. The SMILE-UHURA challenge addresses the gap in publicly available annotated datasets by providing an annotated dataset of Time-of-Flight angiography acquired with 7T MRI. This dataset was created through a combination of automated pre-segmentation and extensive manual refinement. In this manuscript, sixteen submitted methods and two baseline methods are compared both quantitatively and qualitatively on two different datasets: held-out test MRAs from the same dataset as the training data (with labels kept secret) and a separate 7T ToF MRA dataset where both input volumes and labels are kept secret. The results demonstrate that most of the submitted deep learning methods, trained on the provided training dataset, achieved reliable segmentation performance. Dice scores reached up to 0.838 \pm 0.066 and 0.716 \pm 0.125 on the respective datasets, with an average performance of up to 0.804 \pm 0.15.

[AI-77] An Explainable Attention Model for Cervical Precancer Risk Classification using Colposcopic Images

链接: https://arxiv.org/abs/2411.09469
作者: Smith K. Khare,Berit Bargum Booth,Victoria Blanes-Vidal,Lone Kjeld Petersen,Esmaeil S. Nadimi
关键词-EN: effective preventive interventions, playing critical roles, cervical precancer risk, assessment playing critical, worldwide health issue
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注: 19 pages, 9 figure, and 7 tables

点击查看摘要

Abstract:Cervical cancer remains a major worldwide health issue, with early identification and risk assessment playing critical roles in effective preventive interventions. This paper presents the Cervix-AID-Net model for cervical precancer risk classification. The study designs and evaluates the proposed Cervix-AID-Net model based on patients colposcopy images. The model comprises a Convolutional Block Attention Module (CBAM) and convolutional layers that extract interpretable and representative features of colposcopic images to distinguish high-risk and low-risk cervical precancer. In addition, the proposed Cervix-AID-Net model integrates four explainable techniques, namely gradient class activation maps, Local Interpretable Model-agnostic Explanations, CartoonX, and pixel rate distortion explanation based on output feature maps and input features. The evaluation using holdout and ten-fold cross-validation techniques yielded a classification accuracy of 99.33% and 99.81%. The analysis revealed that CartoonX provides meticulous explanations for the decision of the Cervix-AID-Net model due to its ability to provide the relevant piece-wise smooth part of the image. The effect of Gaussian noise and blur on the input shows that the performance remains unchanged up to Gaussian noise of 3% and blur of 10%, while the performance reduces thereafter. A comparison study of the proposed model’s performance compared to other deep learning approaches highlights the Cervix-AID-Net model’s potential as a supplemental tool for increasing the effectiveness of cervical precancer risk assessment. The proposed method, which incorporates the CBAM and explainable artificial integration, has the potential to influence cervical cancer prevention and early detection, improving patient outcomes and lowering the worldwide burden of this preventable disease.

[AI-78] AI-driven inverse design of materials: Past present and future

链接: https://arxiv.org/abs/2411.09429
作者: Xiao-Qi Han,Xin-De Wang,Meng-Yuan Xu,Zhen Feng,Bo-Wen Yao,Peng-Jie Guo,Ze-Feng Gao,Zhong-Yi Lu
关键词-EN: inverse design, materials, discovery of advanced, AI-driven inverse design, design
类目: Materials Science (cond-mat.mtrl-sci); Superconductivity (cond-mat.supr-con); Artificial Intelligence (cs.AI)
*备注: 43 pages, 5 figures, 2 tables

点击查看摘要

Abstract:The discovery of advanced materials is the cornerstone of human technological development and progress. The structures of materials and their corresponding properties are essentially the result of a complex interplay of multiple degrees of freedom such as lattice, charge, spin, symmetry, and topology. This poses significant challenges for the inverse design methods of materials. Humans have long explored new materials through a large number of experiments and proposed corresponding theoretical systems to predict new material properties and structures. With the improvement of computational power, researchers have gradually developed various electronic structure calculation methods, particularly such as the one based density functional theory, as well as high-throughput computational methods. Recently, the rapid development of artificial intelligence technology in the field of computer science has enabled the effective characterization of the implicit association between material properties and structures, thus opening up an efficient paradigm for the inverse design of functional materials. A significant progress has been made in inverse design of materials based on generative and discriminative models, attracting widespread attention from researchers. Considering this rapid technological progress, in this survey, we look back on the latest advancements in AI-driven inverse design of materials by introducing the background, key findings, and mainstream technological development routes. In addition, we summarize the remaining issues for future directions. This survey provides the latest overview of AI-driven inverse design of materials, which can serve as a useful resource for researchers.

[AI-79] Quantum Machine Learning: An Interplay Between Quantum Computing and Machine Learning

链接: https://arxiv.org/abs/2411.09403
作者: Jun Qi,Chao-Han Yang,Samuel Yen-Chi Chen,Pin-Yu Chen
关键词-EN: rapidly growing field, machine learning, traditional machine learning, combines quantum computing, quantum computing principles
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: In submission

点击查看摘要

Abstract:Quantum machine learning (QML) is a rapidly growing field that combines quantum computing principles with traditional machine learning. It seeks to revolutionize machine learning by harnessing the unique capabilities of quantum mechanics and employs machine learning techniques to advance quantum computing research. This paper introduces quantum computing for the machine learning paradigm, where variational quantum circuits (VQC) are used to develop QML architectures on noisy intermediate-scale quantum (NISQ) devices. We discuss machine learning for the quantum computing paradigm, showcasing our recent theoretical and empirical findings. In particular, we delve into future directions for studying QML, exploring the potential industrial impacts of QML research.

[AI-80] Automated Segmentation of Ischemic Stroke Lesions in Non-Contrast Computed Tomography Images for Enhanced Treatment and Prognosis MICCAI

链接: https://arxiv.org/abs/2411.09402
作者: Toufiq Musah,Prince Ebenezer Adjei,Kojo Obed Otoo
关键词-EN: death worldwide, prevalent in low, middle-income countries, ischemic stroke, increasingly prevalent
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 3 figures, MICCAI Meets Africa Workshop

点击查看摘要

Abstract:Stroke is the second leading cause of death worldwide, and is increasingly prevalent in low- and middle-income countries (LMICs). Timely interventions can significantly influence stroke survivability and the quality of life after treatment. However, the standard and most widely available imaging method for confirming strokes and their sub-types, the NCCT, is more challenging and time-consuming to employ in cases of ischemic stroke. For this reason, we developed an automated method for ischemic stroke lesion segmentation in NCCTs using the nnU-Net frame work, aimed at enhancing early treatment and improving the prognosis of ischemic stroke patients. We achieved Dice scores of 0.596 and Intersection over Union (IoU) scores of 0.501 on the sampled dataset. After adjusting for outliers, these scores improved to 0.752 for the Dice score and 0.643 for the IoU. Proper delineation of the region of infarction can help clinicians better assess the potential impact of the infarction, and guide treatment procedures.

[AI-81] ransferable Adversarial Attacks against ASR

链接: https://arxiv.org/abs/2411.09220
作者: Xiaoxue Gao,Zexin Li,Yiming Chen,Cong Liu,Haizhou Li
关键词-EN: automatic speech recognition, minor input perturbations, ASR models, ASR, ASR model
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: IEEE SPL

点击查看摘要

Abstract:Given the extensive research and real-world applications of automatic speech recognition (ASR), ensuring the robustness of ASR models against minor input perturbations becomes a crucial consideration for maintaining their effectiveness in real-time scenarios. Previous explorations into ASR model robustness have predominantly revolved around evaluating accuracy on white-box settings with full access to ASR models. Nevertheless, full ASR model details are often not available in real-world applications. Therefore, evaluating the robustness of black-box ASR models is essential for a comprehensive understanding of ASR model resilience. In this regard, we thoroughly study the vulnerability of practical black-box attacks in cutting-edge ASR models and propose to employ two advanced time-domain-based transferable attacks alongside our differentiable feature extractor. We also propose a speech-aware gradient optimization approach (SAGO) for ASR, which forces mistranscription with minimal impact on human imperceptibility through voice activity detection rule and a speech-aware gradient-oriented optimizer. Our comprehensive experimental results reveal performance enhancements compared to baseline approaches across five models on two databases.

[AI-82] RibCageImp: A Deep Learning Framework for 3D Ribcage Implant Generation

链接: https://arxiv.org/abs/2411.09204
作者: Gyanendra Chaubey,Aiman Farooq,Azad Singh,Deepak Mishra
关键词-EN: structures requires precise, resected ribcage structures, ribcage structures requires, requires precise, recovery of damaged
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:The recovery of damaged or resected ribcage structures requires precise, custom-designed implants to restore the integrity and functionality of the thoracic cavity. Traditional implant design methods rely mainly on manual processes, making them time-consuming and susceptible to variability. In this work, we explore the feasibility of automated ribcage implant generation using deep learning. We present a framework based on 3D U-Net architecture that processes CT scans to generate patient-specific implant designs. To the best of our knowledge, this is the first investigation into automated thoracic implant generation using deep learning approaches. Our preliminary results, while moderate, highlight both the potential and the significant challenges in this complex domain. These findings establish a foundation for future research in automated ribcage reconstruction and identify key technical challenges that need to be addressed for practical implementation.

[AI-83] IDCIA: Immunocytochemistry Dataset for Cellular Image Analysis

链接: https://arxiv.org/abs/2411.08992
作者: Abdurahman Ali Mohammed,Catherine Fonder,Donald S. Sakaguchi,Wallapak Tavanapong,Surya K. Mallapragada,Azeez Idris
关键词-EN: annotated microscopic cellular, cellular image analysis, cellular image, microscopic cellular image, improve the effectiveness
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present a new annotated microscopic cellular image dataset to improve the effectiveness of machine learning methods for cellular image analysis. Cell counting is an important step in cell analysis. Typically, domain experts manually count cells in a microscopic image. Automated cell counting can potentially eliminate this tedious, time-consuming process. However, a good, labeled dataset is required for training an accurate machine learning model. Our dataset includes microscopic images of cells, and for each image, the cell count and the location of individual cells. The data were collected as part of an ongoing study investigating the potential of electrical stimulation to modulate stem cell differentiation and possible applications for neural repair. Compared to existing publicly available datasets, our dataset has more images of cells stained with more variety of antibodies (protein components of immune responses against invaders) typically used for cell analysis. The experimental results on this dataset indicate that none of the five existing models under this study are able to achieve sufficiently accurate count to replace the manual methods. The dataset is available at this https URL.

[AI-84] Fluoroformer: Scaling multiple instance learning to multiplexed images via attention-based channel fusion ALT ML4H

链接: https://arxiv.org/abs/2411.08975
作者: Marc Harary,Eliezer M. Van Allen,William Lotter
关键词-EN: multiple instance learning, instance learning, current approaches, hematoxylin and eosin, slide images
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 14 pages

点击查看摘要

Abstract:Though multiple instance learning (MIL) has been a foundational strategy in computational pathology for processing whole slide images (WSIs), current approaches are designed for traditional hematoxylin and eosin (HE) slides rather than emerging multiplexed technologies. Here, we present an MIL strategy, the Fluoroformer module, that is specifically tailored to multiplexed WSIs by leveraging scaled dot-product attention (SDPA) to interpretably fuse information across disparate channels. On a cohort of 434 non-small cell lung cancer (NSCLC) samples, we show that the Fluoroformer both obtains strong prognostic performance and recapitulates immuno-oncological hallmarks of NSCLC. Our technique thereby provides a path for adapting state-of-the-art AI techniques to emerging spatial biology assays.

[AI-85] A Machine Learning based Hybrid Receiver for 5G NR PRACH

链接: https://arxiv.org/abs/2411.08919
作者: Rohit Singh,Anil Kumar Yerrapragada,Radha Krishna Ganti
关键词-EN: Random Access Channel, Physical Random Access, Random Access, Base Station, Random Access starts
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 6 pages, 9 figures

点击查看摘要

Abstract:Random Access is a critical procedure using which a User Equipment (UE) identifies itself to a Base Station (BS). Random Access starts with the UE transmitting a random preamble on the Physical Random Access Channel (PRACH). In a conventional BS receiver, the UE’s specific preamble is identified by correlation with all the possible preambles. The PRACH signal is also used to estimate the timing advance which is induced by propagation delay. Correlation-based receivers suffer from false peaks and missed detection in scenarios dominated by high fading and low signal-to-noise ratio. This paper describes the design of a hybrid receiver that consists of an AI/ML model for preamble detection followed by conventional peak detection for the Timing Advance estimation. The proposed receiver combines the Power Delay Profiles of correlation windows across multiple antennas and uses the combination as input to a Neural Network model. The model predicts the presence or absence of a user in a particular preamble window, after which the timing advance is estimated by peak detection. Results show superior performance of the hybrid receiver compared to conventional receivers both for simulated and real hardware-captured datasets.

[AI-86] RNA-GPT: Multimodal Generative System for RNA Sequence Understanding NEURIPS2024

链接: https://arxiv.org/abs/2411.08900
作者: Yijia Xiao,Edward Sun,Yiqiao Jin,Wei Wang
关键词-EN: carry genetic information, genetic information vital, RNA, vital for life, development and biotechnology
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Machine Learning for Structural Biology Workshop, NeurIPS 2024

点击查看摘要

Abstract:RNAs are essential molecules that carry genetic information vital for life, with profound implications for drug development and biotechnology. Despite this importance, RNA research is often hindered by the vast literature available on the topic. To streamline this process, we introduce RNA-GPT, a multi-modal RNA chat model designed to simplify RNA discovery by leveraging extensive RNA literature. RNA-GPT integrates RNA sequence encoders with linear projection layers and state-of-the-art large language models (LLMs) for precise representation alignment, enabling it to process user-uploaded RNA sequences and deliver concise, accurate responses. Built on a scalable training pipeline, RNA-GPT utilizes RNA-QA, an automated system that gathers RNA annotations from RNACentral using a divide-and-conquer approach with GPT-4o and latent Dirichlet allocation (LDA) to efficiently handle large datasets and generate instruction-tuning samples. Our experiments indicate that RNA-GPT effectively addresses complex RNA queries, thereby facilitating RNA research. Additionally, we present RNA-QA, a dataset of 407,616 RNA samples for modality alignment and instruction tuning, further advancing the potential of RNA research tools.

[AI-87] FinVision: A Multi-Agent Framework for Stock Market Prediction

链接: https://arxiv.org/abs/2411.08899
作者: Sorouralsadat Fatemi,Yuheng Hu
关键词-EN: vast amounts, learning methods require, data, methods require large, require large training
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI)
*备注: Accepted at ICAIF 2024

点击查看摘要

Abstract:Financial trading has been a challenging task, as it requires the integration of vast amounts of data from various modalities. Traditional deep learning and reinforcement learning methods require large training data and often involve encoding various data types into numerical formats for model input, which limits the explainability of model behavior. Recently, LLM-based agents have demonstrated remarkable advancements in handling multi-modal data, enabling them to execute complex, multi-step decision-making tasks while providing insights into their thought processes. This research introduces a multi-modal multi-agent system designed specifically for financial trading tasks. Our framework employs a team of specialized LLM-based agents, each adept at processing and interpreting various forms of financial data, such as textual news reports, candlestick charts, and trading signal charts. A key feature of our approach is the integration of a reflection module, which conducts analyses of historical trading signals and their outcomes. This reflective process is instrumental in enhancing the decision-making capabilities of the system for future trading scenarios. Furthermore, the ablation studies indicate that the visual reflection module plays a crucial role in enhancing the decision-making capabilities of our framework.

[AI-88] Deep Learning-Based CKM Construction with Image Super-Resolution

链接: https://arxiv.org/abs/2411.08887
作者: Shiyu Wang,Xiaoli Xu,Yong Zeng
关键词-EN: achieving environment awareness, environment awareness, wireless systems, Channel knowledge, Channel knowledge map
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Channel knowledge map (CKM) is a novel technique for achieving environment awareness, and thereby improving the communication and sensing performance for wireless systems. A fundamental problem associated with CKM is how to construct a complete CKM that provides channel knowledge for a large number of locations based solely on sparse data measurements. This problem bears similarities to the super-resolution (SR) problem in image processing. In this letter, we propose an effective deep learning-based CKM construction method that leverages the image SR network known as SRResNet. Unlike most existing studies, our approach does not require any additional input beyond the sparsely measured data. In addition to the conventional path loss map construction, our approach can also be applied to construct channel angle maps (CAMs), thanks to the use of a new dataset called CKMImageNet. The numerical results demonstrate that our method outperforms interpolation-based methods such as nearest neighbour and bicubic interpolation, as well as the SRGAN method in CKM construction. Furthermore, only 1/16 of the locations need to be measured in order to achieve a root mean square error (RMSE) of 1.1 dB in path loss.

计算机视觉

[CV-0] MagicQuill: An Intelligent Interactive Image Editing System

链接: https://arxiv.org/abs/2411.09703
作者: Zichen Liu,Yue Yu,Hao Ouyang,Qiuyu Wang,Ka Leong Cheng,Wen Wang,Zhiheng Liu,Qifeng Chen,Yujun Shen
关键词-EN: precise manipulation techniques, manipulation techniques, Image editing involves, involves a variety, variety of complex
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code and demo available at this https URL

点击查看摘要

Abstract:Image editing involves a variety of complex tasks and requires efficient and precise manipulation techniques. In this paper, we present MagicQuill, an integrated image editing system that enables swift actualization of creative ideas. Our system features a streamlined yet functionally robust interface, allowing for the articulation of editing operations (e.g., inserting elements, erasing objects, altering color) with minimal input. These interactions are monitored by a multimodal large language model (MLLM) to anticipate editing intentions in real time, bypassing the need for explicit prompt entry. Finally, we apply a powerful diffusion prior, enhanced by a carefully learned two-branch plug-in module, to process editing requests with precise control. Experimental results demonstrate the effectiveness of MagicQuill in achieving high-quality image edits. Please visit this https URL to try out our system.

[CV-1] CropCraft: Inverse Procedural Modeling for 3D Reconstruction of Crop Plants

链接: https://arxiv.org/abs/2411.09693
作者: Albert J. Zhai,Xinlei Wang,Kaiyuan Li,Zhao Jiang,Junxiong Zhou,Sheng Wang,Zhenong Jin,Kaiyu Guan,Shenlong Wang
关键词-EN: environmental science, automatically build, digital twins, ability to automatically, twins of plants
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:The ability to automatically build 3D digital twins of plants from images has countless applications in agriculture, environmental science, robotics, and other fields. However, current 3D reconstruction methods fail to recover complete shapes of plants due to heavy occlusion and complex geometries. In this work, we present a novel method for 3D reconstruction of agricultural crops based on optimizing a parametric model of plant morphology via inverse procedural modeling. Our method first estimates depth maps by fitting a neural radiance field and then employs Bayesian optimization to estimate plant morphological parameters that result in consistent depth renderings. The resulting 3D model is complete and biologically plausible. We validate our method on a dataset of real images of agricultural fields, and demonstrate that the reconstructions can be used for a variety of monitoring and simulation applications.

[CV-2] Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

链接: https://arxiv.org/abs/2411.09691
作者: Wei Wang,Zhaowei Li,Qi Xu,Linfeng Li,YiQing Cai,Botian Jiang,Hang Song,Xingcan Hu,Pengyu Wang,Li Xiao
关键词-EN: Multi-modal large language, achieved remarkable success, Multi-modal large, large language models, large language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have achieved remarkable success in fine-grained visual understanding across a range of tasks. However, they often encounter significant challenges due to inadequate alignment for fine-grained knowledge, which restricts their ability to accurately capture local details and attain a comprehensive global perception. While recent advancements have focused on aligning object expressions with grounding information, they typically lack explicit integration of object images, which contain affluent information beyond mere texts or coordinates. To bridge this gap, we introduce a novel fine-grained visual knowledge alignment method that effectively aligns and integrates multi-scale knowledge of objects, including texts, coordinates, and images. This innovative method is underpinned by our multi-scale fine-grained enhancement data synthesis pipeline, which provides over 300K essential training data to enhance alignment and improve overall performance. Furthermore, we present TinyGroundingGPT, a series of compact models optimized for high-level alignments. With a scale of approximately 3B parameters, TinyGroundingGPT achieves outstanding results in grounding tasks while delivering performance comparable to larger MLLMs in complex visual scenarios.

[CV-3] Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation

链接: https://arxiv.org/abs/2411.09572
作者: Zhenjun Yu,Wenqiang Xu,Pengfei Xie,Yutong Li,Cewu Lu
关键词-EN: visual-tactile framework, hand-object interaction, accurate contact modeling, distributed tactile sensing, hand-object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present ViTaM-D, a novel visual-tactile framework for dynamic hand-object interaction reconstruction, integrating distributed tactile sensing for more accurate contact modeling. While existing methods focus primarily on visual inputs, they struggle with capturing detailed contact interactions such as object deformation. Our approach leverages distributed tactile sensors to address this limitation by introducing DF-Field. This distributed force-aware contact representation models both kinetic and potential energy in hand-object interaction. ViTaM-D first reconstructs hand-object interactions using a visual-only network, VDT-Net, and then refines contact details through a force-aware optimization (FO) process, enhancing object deformation modeling. To benchmark our approach, we introduce the HOT dataset, which features 600 sequences of hand-object interactions, including deformable objects, built in a high-precision simulation environment. Extensive experiments on both the DexYCB and HOT datasets demonstrate significant improvements in accuracy over previous state-of-the-art methods such as gSDF and HOTrack. Our results highlight the superior performance of ViTaM-D in both rigid and deformable object reconstruction, as well as the effectiveness of DF-Field in refining hand poses. This work offers a comprehensive solution to dynamic hand-object interaction reconstruction by seamlessly integrating visual and tactile data. Codes, models, and datasets will be available.

[CV-4] VPBSD:Vessel-Pattern-Based Semi-Supervised Distillation for Efficient 3D Microscopic Cerebrovascular Segmentation

链接: https://arxiv.org/abs/2411.09567
作者: Xi Lin,Shixuan Zhao,Xinxu Wei,Amir Shmuel,Yongjie Li
关键词-EN: presenting significant annotation, large data volumes, significant annotation challenges, microscopic cerebrovascular images, high resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D microscopic cerebrovascular images are characterized by their high resolution, presenting significant annotation challenges, large data volumes, and intricate variations in detail. Together, these factors make achieving high-quality, efficient whole-brain segmentation particularly demanding. In this paper, we propose a novel Vessel-Pattern-Based Semi-Supervised Distillation pipeline (VpbSD) to address the challenges of 3D microscopic cerebrovascular segmentation. This pipeline initially constructs a vessel-pattern codebook that captures diverse vascular structures from unlabeled data during the teacher model’s pretraining phase. In the knowledge distillation stage, the codebook facilitates the transfer of rich knowledge from a heterogeneous teacher model to a student model, while the semi-supervised approach further enhances the student model’s exposure to diverse learning samples. Experimental results on real-world data, including comparisons with state-of-the-art methods and ablation studies, demonstrate that our pipeline and its individual components effectively address the challenges inherent in microscopic cerebrovascular segmentation.

[CV-5] Adaptive Deviation Learning for Visual Anomaly Detection with Data Contamination WACV2025

链接: https://arxiv.org/abs/2411.09558
作者: Anindya Sundar Das,Guansong Pang,Monowar Bhuyan
关键词-EN: Visual anomaly detection, found extensive application, identifying defective parts, anomaly detection targets, Visual anomaly
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)

点击查看摘要

Abstract:Visual anomaly detection targets to detect images that notably differ from normal pattern, and it has found extensive application in identifying defective parts within the manufacturing industry. These anomaly detection paradigms predominantly focus on training detection models using only clean, unlabeled normal samples, assuming an absence of contamination; a condition often unmet in real-world scenarios. The performance of these methods significantly depends on the quality of the data and usually decreases when exposed to noise. We introduce a systematic adaptive method that employs deviation learning to compute anomaly scores end-to-end while addressing data contamination by assigning relative importance to the weights of individual instances. In this approach, the anomaly scores for normal instances are designed to approximate scalar scores obtained from the known prior distribution. Meanwhile, anomaly scores for anomaly examples are adjusted to exhibit statistically significant deviations from these reference scores. Our approach incorporates a constrained optimization problem within the deviation learning framework to update instance weights, resolving this problem for each mini-batch. Comprehensive experiments on the MVTec and VisA benchmark datasets indicate that our proposed method surpasses competing techniques and exhibits both stability and robustness in the presence of data contamination.

[CV-6] Image Processing for Motion Magnification

链接: https://arxiv.org/abs/2411.09555
作者: Nadaniela Egidi,Josephin Giacomini,Paolo Leonesi,Pierluigi Maponi,Federico Mearelli,Edin Trebovic
关键词-EN: relative recent techniques, Image Processing, Motion Magnification, collection of relative, relative recent
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Motion Magnification (MM) is a collection of relative recent techniques within the realm of Image Processing. The main motivation of introducing these techniques in to support the human visual system to capture relevant displacements of an object of interest; these motions can be in object color and in object location. In fact, the goal is to opportunely process a video sequence to obtain as output a new video in which motions are magnified and visible to the viewer. We propose a numerical technique using the Phase-Based Motion Magnification which analyses the video sequence in the Fourier Domain and rely on the Fourier Shifting Property. We describe the mathematical foundation of this method and the corresponding implementation in a numerical algorithm. We present preliminary experiments, focusing on some basic test made up using synthetic images.

[CV-7] OOD-SEG: Out-Of-Distribution detection for image SEGmentation with sparse multi-class positive-only annotations

链接: https://arxiv.org/abs/2411.09553
作者: Junwen Wang,Zhonghao Wang,Oscar MacCormac,Jonathan Shapey,Tom Vercauteren
关键词-EN: deep neural networks, significant advancements, faces several challenges, based on deep, deep neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite significant advancements, segmentation based on deep neural networks in medical and surgical imaging faces several challenges, two of which we aim to address in this work. First, acquiring complete pixel-level segmentation labels for medical images is time-consuming and requires domain expertise. Second, typical segmentation pipelines cannot detect out-of-distribution (OOD) pixels, leaving them prone to spurious outputs during deployment. In this work, we propose a novel segmentation approach exploiting OOD detection that learns only from sparsely annotated pixels from multiple positive-only classes. %but \emphno background class annotation. These multi-class positive annotations naturally fall within the in-distribution (ID) set. Unlabelled pixels may contain positive classes but also negative ones, including what is typically referred to as \emphbackground in standard segmentation formulations. Here, we forgo the need for background annotation and consider these together with any other unseen classes as part of the OOD set. Our framework can integrate, at a pixel-level, any OOD detection approaches designed for classification tasks. To address the lack of existing OOD datasets and established evaluation metric for medical image segmentation, we propose a cross-validation strategy that treats held-out labelled classes as OOD. Extensive experiments on both multi-class hyperspectral and RGB surgical imaging datasets demonstrate the robustness and generalisation capability of our proposed framework.

[CV-8] MFTIQ: Multi-Flow Tracker with Independent Matching Quality Estimation WACV2025

链接: https://arxiv.org/abs/2411.09551
作者: Jonas Serych,Michal Neoral,Jiri Matas
关键词-EN: dense long-term tracking, point-level visual tracking, framework to address, video sequences, dense long-term
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted to WACV 2025

点击查看摘要

Abstract:In this work, we present MFTIQ, a novel dense long-term tracking model that advances the Multi-Flow Tracker (MFT) framework to address challenges in point-level visual tracking in video sequences. MFTIQ builds upon the flow-chaining concepts of MFT, integrating an Independent Quality (IQ) module that separates correspondence quality estimation from optical flow computations. This decoupling significantly enhances the accuracy and flexibility of the tracking process, allowing MFTIQ to maintain reliable trajectory predictions even in scenarios of prolonged occlusions and complex dynamics. Designed to be “plug-and-play”, MFTIQ can be employed with any off-the-shelf optical flow method without the need for fine-tuning or architectural modifications. Experimental validations on the TAP-Vid Davis dataset show that MFTIQ with RoMa optical flow not only surpasses MFT but also performs comparably to state-of-the-art trackers while having substantially faster processing speed. Code and models available at this https URL .

[CV-9] Marker-free Human Gait Analysis using a Smart Edge Sensor System

链接: https://arxiv.org/abs/2411.09538
作者: Eva Katharina Bauer,Simon Bultmann,Sven Behnke
关键词-EN: physiological condition, gait analysis, complex interplay, neurological and physiological, gait
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted for SII 2025

点击查看摘要

Abstract:The human gait is a complex interplay between the neuronal and the muscular systems, reflecting an individual’s neurological and physiological condition. This makes gait analysis a valuable tool for biomechanics and medical experts. Traditional observational gait analysis is cost-effective but lacks reliability and accuracy, while instrumented gait analysis, particularly using marker-based optical systems, provides accurate data but is expensive and time-consuming. In this paper, we introduce a novel markerless approach for gait analysis using a multi-camera setup with smart edge sensors to estimate 3D body poses without fiducial markers. We propose a Siamese embedding network with triplet loss calculation to identify individuals by their gait pattern. This network effectively maps gait sequences to an embedding space that enables clustering sequences from the same individual or activity closely together while separating those of different ones. Our results demonstrate the potential of the proposed system for efficient automated gait analysis in diverse real-world environments, facilitating a wide range of applications.

[CV-10] Golden Noise for Diffusion Models: A Learning Framework

链接: https://arxiv.org/abs/2411.09502
作者: Zikai Zhou,Shitong Shao,Lichen Bai,Zhiqiang Xu,Bo Han,Zeke Xie
关键词-EN: random Gaussian noise, noise, golden noise, noise prompt, golden
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise. While people observe that some noises are golden noises'' that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those golden noises. To learn golden noises for diffusion sampling, we mainly make three contributions in this paper. First, we identify a new concept termed the \textitnoise prompt, which aims at turning a random Gaussian noise into a golden noise by adding a small desirable perturbation derived from the text prompt. Following the concept, we first formulate the \textitnoise prompt learning framework that systematically learns prompted’’ golden noise associated with a text prompt for diffusion models. Second, we design a noise prompt data collection pipeline and collect a large-scale \textitnoise prompt dataset~(NPD) that contains 100k pairs of random noises and golden noises with the associated text prompts. With the prepared NPD as the training dataset, we trained a small \textitnoise prompt network~(NPNet) that can directly learn to transform a random noise into a golden noise. The learned golden noise perturbation can be considered as a kind of prompt for noise, as it is rich in semantic information and tailored to the given text prompt. Third, our extensive experiments demonstrate the impressive effectiveness and generalization of NPNet on improving the quality of synthesized images across various diffusion models, including SDXL, DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and efficient controller that acts as a plug-and-play module with very limited additional inference and computational costs, as it just provides a golden noise instead of a random noise without accessing the original pipeline.

[CV-11] Image Matching Filtering and Refinement by Planes and Beyond

链接: https://arxiv.org/abs/2411.09484
作者: Fabio Bellavia,Zhenjun Zhao,Luca Morelli,Fabio Remondino
关键词-EN: refining sparse correspondences, introduces a modular, paper introduces, filtering and refining, refining sparse
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:This paper introduces a modular, non-deep learning method for filtering and refining sparse correspondences in image matching. Assuming that motion flow within the scene can be approximated by local homography transformations, matches are aggregated into overlapping clusters corresponding to virtual planes using an iterative RANSAC-based approach, with non-conforming correspondences discarded. Moreover, the underlying planar structural design provides an explicit map between local patches associated with the matches, enabling optional refinement of keypoint positions through cross-correlation template matching after patch reprojection. Finally, to enhance robustness and fault-tolerance against violations of the piece-wise planar approximation assumption, a further strategy is designed for minimizing relative patch distortion in the plane reprojection by introducing an intermediate homography that projects both patches into a common plane. The proposed method is extensively evaluated on standard datasets and image matching pipelines, and compared with state-of-the-art approaches. Unlike other current comparisons, the proposed benchmark also takes into account the more general, real, and practical cases where camera intrinsics are unavailable. Experimental results demonstrate that our proposed non-deep learning, geometry-based approach achieves performances that are either superior to or on par with recent state-of-the-art deep learning methods. Finally, this study suggests that there are still development potential in actual image matching solutions in the considered research direction, which could be in the future incorporated in novel deep image matching architectures.

[CV-12] SINETRA: a Versatile Framework for Evaluating Single Neuron Tracking in Behaving Animals

链接: https://arxiv.org/abs/2411.09462
作者: Raphael Reme,Alasdair Newson,Elsa Angelini,Jean-Christophe Olivo-Marin,Thibault Lagach
关键词-EN: Accurately tracking neuronal, presents significant challenges, significant challenges due, animals presents significant, tracking neuronal activity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 3 figures, submitted at 2025 IEEE International Symposium on Biomedical Imaging (ISBI)

点击查看摘要

Abstract:Accurately tracking neuronal activity in behaving animals presents significant challenges due to complex motions and background noise. The lack of annotated datasets limits the evaluation and improvement of such tracking algorithms. To address this, we developed SINETRA, a versatile simulator that generates synthetic tracking data for particles on a deformable background, closely mimicking live animal recordings. This simulator produces annotated 2D and 3D videos that reflect the intricate movements seen in behaving animals like Hydra Vulgaris. We evaluated four state-of-the-art tracking algorithms highlighting the current limitations of these methods in challenging scenarios and paving the way for improved cell tracking techniques in dynamic biological systems.

[CV-13] Long-Tailed Object Detection Pre-training: Dynamic Rebalancing Contrastive Learning with Dual Reconstruction NEURIPS2024

链接: https://arxiv.org/abs/2411.09453
作者: Chen-Long Duan,Yong Li,Xiu-Shen Wei,Lin Zhao
关键词-EN: plays a vital, vital role, Rebalancing Contrastive Learning, Contrastive Learning, Dynamic Rebalancing Contrastive
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Pre-training plays a vital role in various vision tasks, such as object recognition and detection. Commonly used pre-training methods, which typically rely on randomized approaches like uniform or Gaussian distributions to initialize model parameters, often fall short when confronted with long-tailed distributions, especially in detection tasks. This is largely due to extreme data imbalance and the issue of simplicity bias. In this paper, we introduce a novel pre-training framework for object detection, called Dynamic Rebalancing Contrastive Learning with Dual Reconstruction (2DRCL). Our method builds on a Holistic-Local Contrastive Learning mechanism, which aligns pre-training with object detection by capturing both global contextual semantics and detailed local patterns. To tackle the imbalance inherent in long-tailed data, we design a dynamic rebalancing strategy that adjusts the sampling of underrepresented instances throughout the pre-training process, ensuring better representation of tail classes. Moreover, Dual Reconstruction addresses simplicity bias by enforcing a reconstruction task aligned with the self-consistency principle, specifically benefiting underrepresented tail classes. Experiments on COCO and LVIS v1.0 datasets demonstrate the effectiveness of our method, particularly in improving the mAP/AP scores for tail classes.

[CV-14] Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models

链接: https://arxiv.org/abs/2411.09449
作者: Chutian Meng,Fan Ma,Jiaxu Miao,Chi Zhang,Yi Yang,Yueting Zhuang
关键词-EN: playing crucial roles, Diffusion models, reference image, image, image generation domain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of text-to-image models has become increasingly important. Current metrics focus on directly matching the input text with the generated image, but due to cross-modal information asymmetry, this leads to unreliable or incomplete assessment results. Motivated by this, we introduce the Image Regeneration task in this study to assess text-to-image models by tasking the T2I model with generating an image according to the reference image. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model, allowing T2I models to understand image content. This evaluation process is simplified as comparisons between the generated image and the reference image are straightforward. Two regeneration datasets spanning content-diverse and style-diverse evaluation dataset are introduced to evaluate the leading diffusion models currently available. Additionally, we present ImageRepainter framework to enhance the quality of generated images by improving content comprehension via MLLM guided iterative generation and revision. Our comprehensive experiments have showcased the effectiveness of this framework in assessing the generative capabilities of models. By leveraging MLLM, we have demonstrated that a robust T2M can produce images more closely resembling the reference image.

[CV-15] Spider: Any-to-Many Multimodal LLM

链接: https://arxiv.org/abs/2411.09439
作者: Jinxiang Lai,Jie Zhang,Jun Liu,Jian Li,Xiaocheng Lu,Song Guo
关键词-EN: Large Language Models, Large Language, extension of Large, Language Models, Text
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities ‘Text + X’ within a single response, such as Text + Image or Audio or Video. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities ‘Text + Xs’, such as Text + Image and Audio and Video. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents, and an Any-to-Many Instruction Template designed for producing Xs signal prompts. To train Spider, we constructed a novel Text-formatted Many-Modal (TMM) dataset, which facilitates the learning of the X-to-Xs (i.e., Any-to-Many) capability necessary for AMMG. Ultimately, the well-trained Spider generates a pseudo X-to-Xs dataset, the first-ever X-to-Xs many-modal dataset, enhancing the potential for AMMG task in future research. Overall, this work not only pushes the boundary of multimodal interaction but also provides rich data support for advancing the field.

[CV-16] ReMP: Reusable Motion Prior for Multi-domain 3D Human Pose Estimation and Motion Inbetweening WACV2025

链接: https://arxiv.org/abs/2411.09435
作者: Hojun Jang,Young Min Kim
关键词-EN: present Reusable Motion, Reusable Motion prior, present Reusable, effective motion prior, Reusable Motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 main pages, WACV 2025

点击查看摘要

Abstract:We present Reusable Motion prior (ReMP), an effective motion prior that can accurately track the temporal evolution of motion in various downstream tasks. Inspired by the success of foundation models, we argue that a robust spatio-temporal motion prior can encapsulate underlying 3D dynamics applicable to various sensor modalities. We learn the rich motion prior from a sequence of complete parametric models of posed human body shape. Our prior can easily estimate poses in missing frames or noisy measurements despite significant occlusion by employing a temporal attention mechanism. More interestingly, our prior can guide the system with incomplete and challenging input measurements to quickly extract critical information to estimate the sequence of poses, significantly improving the training efficiency for mesh sequence recovery. ReMP consistently outperforms the baseline method on diverse and practical 3D motion data, including depth point clouds, LiDAR scans, and IMU sensor data. Project page is available in this https URL.

[CV-17] Mediffusion: Joint Diffusion for Self-Explainable Semi-Supervised Classification and Medical Image Generation

链接: https://arxiv.org/abs/2411.09434
作者: Joanna Kaleta,Paweł Skierś,Jan Dubiński,Przemysław Korzeniowski,Kamil Deja
关键词-EN: explainable classification based, joint diffusion model, learning with explainable, joint diffusion, introduce Mediffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Mediffusion – a new method for semi-supervised learning with explainable classification based on a joint diffusion model. The medical imaging domain faces unique challenges due to scarce data labelling – insufficient for standard training, and critical nature of the applications that require high performance, confidence, and explainability of the models. In this work, we propose to tackle those challenges with a single model that combines standard classification with a diffusion-based generative task in a single shared parametrisation. By sharing representations, our model effectively learns from both labeled and unlabeled data while at the same time providing accurate explanations through counterfactual examples. In our experiments, we show that our Mediffusion achieves results comparable to recent semi-supervised methods while providing more reliable and precise explanations.

[CV-18] Building Height Estimation Using Shadow Length in Satellite Imagery

链接: https://arxiv.org/abs/2411.09411
作者: Mahd Qureshi,Shayaan Chaudhry,Sana Jabba,Murtaza Taj
关键词-EN: Estimating building height, poses significant challenges, imagery poses significant, Estimating building, building height
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 6 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Estimating building height from satellite imagery poses significant challenges, especially when monocular images are employed, resulting in a loss of essential 3D information during imaging. This loss of spatial depth further complicates the height estimation process. We addressed this issue by using shadow length as an additional cue to compensate for the loss of building height estimation using single-view imagery. We proposed a novel method that first localized a building and its shadow in the given satellite image. After localization, the shadow length is estimated using a regression model. To estimate the final height of each building, we utilize the principles of photogrammetry, specifically considering the relationship between the solar elevation angle, the vertical edge length of the building, and the length of the building’s shadow. For the localization of buildings in our model, we utilized a modified YOLOv7 detector, and to regress the shadow length for each building we utilized the ResNet18 as backbone architecture. Finally, we estimated the associated building height using solar elevation with shadow length through analytical formulation. We evaluated our method on 42 different cities and the results showed that the proposed framework surpasses the state-of-the-art methods with a suitable margin.

[CV-19] Instruction-Driven Fusion of Infrared-Visible Images: Tailoring for Diverse Downstream Tasks

链接: https://arxiv.org/abs/2411.09387
作者: Zengyi Yang,Yafei Zhang,Huafeng Li,Yu Liu
关键词-EN: infrared and visible, lies in applying, downstream tasks, Dynamic Prompt Injection, Task-related Dynamic Prompt
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:The primary value of infrared and visible image fusion technology lies in applying the fusion results to downstream tasks. However, existing methods face challenges such as increased training complexity and significantly compromised performance of individual tasks when addressing multiple downstream tasks simultaneously. To tackle this, we propose Task-Oriented Adaptive Regulation (T-OAR), an adaptive mechanism specifically designed for multi-task environments. Additionally, we introduce the Task-related Dynamic Prompt Injection (T-DPI) module, which generates task-specific dynamic prompts from user-input text instructions and integrates them into target representations. This guides the feature extraction module to produce representations that are more closely aligned with the specific requirements of downstream tasks. By incorporating the T-DPI module into the T-OAR framework, our approach generates fusion images tailored to task-specific requirements without the need for separate training or task-specific weights. This not only reduces computational costs but also enhances adaptability and performance across multiple tasks. Experimental results show that our method excels in object detection, semantic segmentation, and salient object detection, demonstrating its strong adaptability, flexibility, and task specificity. This provides an efficient solution for image fusion in multi-task environments, highlighting the technology’s potential across diverse applications.

[CV-20] DSCformer: A Dual-Branch Network Integrating Enhanced Dynamic Snake Convolution and SegFormer for Crack Segmentation

链接: https://arxiv.org/abs/2411.09371
作者: Kaiwei Yu,I-Ming Chen,Jing Wu
关键词-EN: construction quality monitoring, quality monitoring, accurately detecting, safety and maintenance, construction quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In construction quality monitoring, accurately detecting and segmenting cracks in concrete structures is paramount for safety and maintenance. Current convolutional neural networks (CNNs) have demonstrated strong performance in crack segmentation tasks, yet they often struggle with complex backgrounds and fail to capture fine-grained tubular structures fully. In contrast, Transformers excel at capturing global context but lack precision in detailed feature extraction. We introduce DSCformer, a novel hybrid model that integrates an enhanced Dynamic Snake Convolution (DSConv) with a Transformer architecture for crack segmentation to address these challenges. Our key contributions include the enhanced DSConv through a pyramid kernel for adaptive offset computation and a simultaneous bi-directional learnable offset iteration, significantly improving the model’s performance to capture intricate crack patterns. Additionally, we propose a Weighted Convolutional Attention Module (WCAM), which refines channel attention, allowing for more precise and adaptive feature attention. We evaluate DSCformer on the Crack3238 and FIND datasets, achieving IoUs of 59.22% and 87.24%, respectively. The experimental results suggest that our DSCformer outperforms state-of-the-art methods across different datasets.

[CV-21] me-to-Event Pretraining for 3D Medical Imaging

链接: https://arxiv.org/abs/2411.09361
作者: Zepeng Huo,Jason Alan Fries,Alejandro Lozano,Jeya Maria Jose Valanarasu,Ethan Steinberg,Louis Blankemeier,Akshay S. Chaudhari,Curtis Langlotz,Nigam H. Shah
关键词-EN: scalable pretraining techniques, pretraining techniques offer, medical imaging models, growing availability, techniques offer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 34 pages, 19 figures

点击查看摘要

Abstract:With the rise of medical foundation models and the growing availability of imaging data, scalable pretraining techniques offer a promising way to identify imaging biomarkers predictive of future disease risk. While current self-supervised methods for 3D medical imaging models capture local structural features like organ morphology, they fail to link pixel biomarkers with long-term health outcomes due to a missing context problem. Current approaches lack the temporal context necessary to identify biomarkers correlated with disease progression, as they rely on supervision derived only from images and concurrent text descriptions. To address this, we introduce time-to-event pretraining, a pretraining framework for 3D medical imaging models that leverages large-scale temporal supervision from paired, longitudinal electronic health records (EHRs). Using a dataset of 18,945 CT scans (4.2 million 2D images) and time-to-event distributions across thousands of EHR-derived tasks, our method improves outcome prediction, achieving an average AUROC increase of 23.7% and a 29.4% gain in Harrell’s C-index across 8 benchmark tasks. Importantly, these gains are achieved without sacrificing diagnostic classification performance. This study lays the foundation for integrating longitudinal EHR and 3D imaging data to advance clinical risk prediction.

[CV-22] Adaptively Augmented Consistency Learning: A Semi-supervised Segmentation Framework for Remote Sensing

链接: https://arxiv.org/abs/2411.09344
作者: Hui Ye,Haodong Chen,Xiaoming Chen,Vera Chung
关键词-EN: Remote sensing, manage resources, Augmented Consistency Learning, involves the acquisition, primarily to monitor
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Remote sensing (RS) involves the acquisition of data about objects or areas from a distance, primarily to monitor environmental changes, manage resources, and support planning and disaster response. A significant challenge in RS segmentation is the scarcity of high-quality labeled images due to the diversity and complexity of RS image, which makes pixel-level annotation difficult and hinders the development of effective supervised segmentation algorithms. To solve this problem, we propose Adaptively Augmented Consistency Learning (AACL), a semi-supervised segmentation framework designed to enhances RS segmentation accuracy under condictions of limited labeled data. AACL extracts additional information embedded in unlabeled images through the use of Uniform Strength Augmentation (USAug) and Adaptive Cut-Mix (AdaCM). Evaluations across various RS datasets demonstrate that AACL achieves competitive performance in semi-supervised segmentation, showing up to a 20% improvement in specific categories and 2% increase in overall performance compared to state-of-the-art frameworks.

[CV-23] Exploring Zero-Shot Anomaly Detection with CLIP in Medical Imaging: Are We There Yet? ALT

链接: https://arxiv.org/abs/2411.09310
作者: Aldo Marzullo,Marta Bianca Maria Ranzini
关键词-EN: Zero-shot anomaly detection, Zero-shot anomaly, offers potential, task-specific training, potential for identifying
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted at 3rd AIxIA Workshop on Artificial Intelligence for Healthcare and 5th Data4SmartHealth

点击查看摘要

Abstract:Zero-shot anomaly detection (ZSAD) offers potential for identifying anomalies in medical imaging without task-specific training. In this paper, we evaluate CLIP-based models, originally developed for industrial tasks, on brain tumor detection using the BraTS-MET dataset. Our analysis examines their ability to detect medical-specific anomalies with no or minimal supervision, addressing the challenges posed by limited data annotation. While these models show promise in transferring general knowledge to medical tasks, their performance falls short of the precision required for clinical use. Our findings highlight the need for further adaptation before CLIP-based models can be reliably applied to medical anomaly detection.

[CV-24] LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

链接: https://arxiv.org/abs/2411.09301
作者: Zhenshi Li,Dilxat Muhtar,Feng Gu,Xueliang Zhang,Pengfeng Xiao,Guangjun He,Xiaoxiang Zhu
关键词-EN: Automatically and rapidly, rapidly understanding Earth, Earth surface, analyzing Earth surface, understanding Earth surface
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automatically and rapidly understanding Earth’s surface is fundamental to our grasp of the living environment and informed decision-making. This underscores the need for a unified system with comprehensive capabilities in analyzing Earth’s surface to address a wide range of human needs. The emergence of multimodal large language models (MLLMs) has great potential in boosting the efficiency and convenience of intelligent Earth observation. These models can engage in human-like conversations, serve as unified platforms for understanding images, follow diverse instructions, and provide insightful feedbacks. In this study, we introduce LHRS-Bot-Nova, an MLLM specialized in understanding remote sensing (RS) images, designed to expertly perform a wide range of RS understanding tasks aligned with human instructions. LHRS-Bot-Nova features an enhanced vision encoder and a novel bridge layer, enabling efficient visual compression and better language-vision alignment. To further enhance RS-oriented vision-language alignment, we propose a large-scale RS image-caption dataset, generated through feature-guided image recaptioning. Additionally, we introduce an instruction dataset specifically designed to improve spatial recognition abilities. Extensive experiments demonstrate superior performance of LHRS-Bot-Nova across various RS image understanding tasks. We also evaluate different MLLM performances in complex RS perception and instruction following using a complicated multi-choice question evaluation benchmark, providing a reliable guide for future model selection and improvement. Data, code, and models will be available at this https URL.

[CV-25] LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution

链接: https://arxiv.org/abs/2411.09293
作者: Chenyang Wang,Wenjie An,Kui Jiang,Xianming Liu,Junjun Jiang
关键词-EN: made significant advancements, Existing face super-resolution, limited visual information, primarily super-resolve face, Existing face
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing face super-resolution (FSR) methods have made significant advancements, but they primarily super-resolve face with limited visual information, original pixel-wise space in particular, commonly overlooking the pluralistic clues, like the higher-order depth and semantics, as well as non-visual inputs (text caption and description). Consequently, these methods struggle to produce a unified and meaningful representation from the input face. We suppose that introducing the language-vision pluralistic representation into unexplored potential embedding space could enhance FSR by encoding and exploiting the complementarity across language-vision prior. This motivates us to propose a new framework called LLV-FSR, which marries the power of large vision-language model and higher-order visual prior with the challenging task of FSR. Specifically, besides directly absorbing knowledge from original input, we introduce the pre-trained vision-language model to generate pluralistic priors, involving the image caption, descriptions, face semantic mask and depths. These priors are then employed to guide the more critical feature representation, facilitating realistic and high-quality face super-resolution. Experimental results demonstrate that our proposed framework significantly improves both the reconstruction quality and perceptual quality, surpassing the SOTA by 0.43dB in terms of PSNR on the MMCelebA-HQ dataset.

[CV-26] LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space

链接: https://arxiv.org/abs/2411.09268
作者: Guanwen Feng,Zhihao Qian,Yunan Li,Siyu Jin,Qiguang Miao,Chi-Man Pun
关键词-EN: talking head generation, fine-grained emotion editing, one-shot talking head, existing one-shot talking, emotion editing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While existing one-shot talking head generation models have achieved progress in coarse-grained emotion editing, there is still a lack of fine-grained emotion editing models with high interpretability. We argue that for an approach to be considered fine-grained, it needs to provide clear definitions and sufficiently detailed differentiation. We present LES-Talker, a novel one-shot talking head generation model with high interpretability, to achieve fine-grained emotion editing across emotion types, emotion levels, and facial units. We propose a Linear Emotion Space (LES) definition based on Facial Action Units to characterize emotion transformations as vector transformations. We design the Cross-Dimension Attention Net (CDAN) to deeply mine the correlation between LES representation and 3D model representation. Through mining multiple relationships across different feature and structure dimensions, we enable LES representation to guide the controllable deformation of 3D model. In order to adapt the multimodal data with deviations to the LES and enhance visual quality, we utilize specialized network design and training strategies. Experiments show that our method provides high visual quality along with multilevel and interpretable fine-grained emotion editing, outperforming mainstream methods.

[CV-27] BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation

链接: https://arxiv.org/abs/2411.09265
作者: Zheng Zhou,Wenquan Feng,Shuchang Lyu,Guangliang Cheng,Xiaowei Huang,Qi Zhao
关键词-EN: significantly smaller synthesized, preserving high test, high test performance, compresses large-scale datasets, smaller synthesized datasets
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Dataset Distillation (DD) is an emerging technique that compresses large-scale datasets into significantly smaller synthesized datasets while preserving high test performance and enabling the efficient training of large models. However, current research primarily focuses on enhancing evaluation accuracy under limited compression ratios, often overlooking critical security concerns such as adversarial robustness. A key challenge in evaluating this robustness lies in the complex interactions between distillation methods, model architectures, and adversarial attack strategies, which complicate standardized assessments. To address this, we introduce BEARD, an open and unified benchmark designed to systematically assess the adversarial robustness of DD methods, including DM, IDM, and BACON. BEARD encompasses a variety of adversarial attacks (e.g., FGSM, PGD, CW) on distilled datasets like CIFAR-10/100 and TinyImageNet. Utilizing an adversarial game framework, it introduces three key metrics: Robustness Ratio (RR), Attack Efficiency Ratio (AE), and Comprehensive Robustness-Efficiency Index (CREI). Our analysis includes unified benchmarks, various Images Per Class (IPC) settings, and the effects of adversarial training. Results are available on the BEARD Leaderboard, along with a library providing model and dataset pools to support reproducible research. Access the code at BEARD.

[CV-28] Rethinking Weight-Averaged Model-merging

链接: https://arxiv.org/abs/2411.09263
作者: Hu Wang,Congbo Ma,Ibrahim Almakky,Ian Reid,Gustavo Carneiro,Mohammad Yaqub
关键词-EN: enhancing model performance, capable of enhancing, fine-tuning or retraining, powerful approach, approach in deep
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Weight-averaged model-merging has emerged as a powerful approach in deep learning, capable of enhancing model performance without fine-tuning or retraining. However, the underlying mechanisms that explain its effectiveness remain largely unexplored. In this paper, we investigate this technique from three novel perspectives to provide deeper insights into how and why weight-averaged model-merging works: (1) we examine the intrinsic patterns captured by the learning of the model weights, through the visualizations of their patterns on several datasets, showing that these weights often encode structured and interpretable patterns; (2) we investigate model ensemble merging strategies based on averaging on weights versus averaging on features, providing detailed analyses across diverse architectures and datasets; and (3) we explore the impact on model-merging prediction stability in terms of changing the parameter magnitude, revealing insights into the way of weight averaging works as regularization by showing the robustness across different parameter scales. Our findings shed light on the “black box” of weight-averaged model-merging, offering valuable insights and practical recommendations that advance the model-merging process.

[CV-29] Embedding Space Allocation with Angle-Norm Joint Classifiers for Few-Shot Class-Incremental Learning

链接: https://arxiv.org/abs/2411.09250
作者: Dunwei Tu,Huiyu Yi,Tieyi Zhang,Ruotong Li,Furao Shen,Jian Zhao
关键词-EN: requiring intelligent agents, Few-shot class-incremental learning, aims to continually, requiring intelligent, dynamic environments
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Few-shot class-incremental learning (FSCIL) aims to continually learn new classes from only a few samples without forgetting previous ones, requiring intelligent agents to adapt to dynamic environments. FSCIL combines the characteristics and challenges of class-incremental learning and few-shot learning: (i) Current classes occupy the entire feature space, which is detrimental to learning new classes. (ii) The small number of samples in incremental rounds is insufficient for fully training. In existing mainstream virtual class methods, for addressing the challenge (i), they attempt to use virtual classes as placeholders. However, new classes may not necessarily align with the virtual classes. For the challenge (ii), they replace trainable fully connected layers with Nearest Class Mean (NCM) classifiers based on cosine similarity, but NCM classifiers do not account for sample imbalance issues. To address these issues in previous methods, we propose the class-center guided embedding Space Allocation with Angle-Norm joint classifiers (SAAN) learning framework, which provides balanced space for all classes and leverages norm differences caused by sample imbalance to enhance classification criteria. Specifically, for challenge (i), SAAN divides the feature space into multiple subspaces and allocates a dedicated subspace for each session by guiding samples with the pre-set category centers. For challenge (ii), SAAN establishes a norm distribution for each class and generates angle-norm joint logits. Experiments demonstrate that SAAN can achieve state-of-the-art performance and it can be directly embedded into other SOTA methods as a plug-in, further enhancing their performance.

[CV-30] Harnessing Vision Foundation Models for High-Performance Training-Free Open Vocabulary Segmentation

链接: https://arxiv.org/abs/2411.09219
作者: Yuheng Shi,Minjing Dong,Chang Xu
关键词-EN: Contrastive Language-Image Pre-training, advanced open-vocabulary predictions, Language-Image Pre-training, Contrastive Language-Image, open-vocabulary predictions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP’s image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM’s encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP’s coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to this http URL is available at this https URL.

[CV-31] JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

链接: https://arxiv.org/abs/2411.09209
作者: Xuyang Cao,Sheng Shi,Jun Zhao,Yang Yao,Jintao Fei,Minyu Gao,Guoxin Wang
关键词-EN: made significant advances, lipsync accuracy, facial representation, made significant, significant advances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Audio-driven portrait animation has made significant advances with diffusion-based models, improving video quality and lipsync accuracy. However, the increasing complexity of these models has led to inefficiencies in training and inference, as well as constraints on video length and inter-frame continuity. In this paper, we propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high-quality animations. With the decoupled facial representation and the identity-independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real-time performance and refining expression control, further expanding the applications in portrait animation. The code will be available at: this https URL.

[CV-32] DyGASR: Dynamic Generalized Exponential Splatting with Surface Alignment for Accelerated 3D Mesh Reconstruction

链接: https://arxiv.org/abs/2411.09156
作者: Shengchao Zhao,Yundong Li
关键词-EN: Recent advancements, radiance field reconstruction, Generalized Exponential Splatting, accelerated rendering, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting (3DGS), which lead to high-quality novel view synthesis and accelerated rendering, have remarkably improved the quality of radiance field reconstruction. However, the extraction of mesh from a massive number of minute 3D Gaussian points remains great challenge due to the large volume of Gaussians and difficulty of representation of sharp signals caused by their inherent low-pass characteristics. To address this issue, we propose DyGASR, which utilizes generalized exponential function instead of traditional 3D Gaussian to decrease the number of particles and dynamically optimize the representation of the captured signal. In addition, it is observed that reconstructing mesh with Generalized Exponential Splatting(GES) without modifications frequently leads to failures since the generalized exponential distribution centroids may not precisely align with the scene surface. To overcome this, we adopt Sugar’s approach and introduce Generalized Surface Regularization (GSR), which reduces the smallest scaling vector of each point cloud to zero and ensures normal alignment perpendicular to the surface, facilitating subsequent Poisson surface mesh reconstruction. Additionally, we propose a dynamic resolution adjustment strategy that utilizes a cosine schedule to gradually increase image resolution from low to high during the training stage, thus avoiding constant full resolution, which significantly boosts the reconstruction speed. Our approach surpasses existing 3DGS-based mesh reconstruction methods, as evidenced by extensive evaluations on various scene datasets, demonstrating a 25% increase in speed, and a 30% reduction in memory usage.

[CV-33] VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation NEURIPS2024

链接: https://arxiv.org/abs/2411.09153
作者: Youpeng Wen,Junfan Lin,Yi Zhu,Jianhua Han,Hang Xu,Shen Zhao,Xiaodan Liang
关键词-EN: Recent advancements utilizing, advancements utilizing large-scale, understanding complex physical, Recent advancements, utilizing large-scale video
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Recent advancements utilizing large-scale video data for learning video generation models demonstrate significant potential in understanding complex physical dynamics. It suggests the feasibility of leveraging diverse robot trajectory data to develop a unified, dynamics-aware model to enhance robot manipulation. However, given the relatively small amount of available robot data, directly fitting data without considering the relationship between visual observations and actions could lead to suboptimal data utilization. To this end, we propose VidMan (Video Diffusion for Robot Manipulation), a novel framework that employs a two-stage training mechanism inspired by dual-process theory from neuroscience to enhance stability and improve data utilization efficiency. Specifically, in the first stage, VidMan is pre-trained on the Open X-Embodiment dataset (OXE) for predicting future visual trajectories in a video denoising diffusion manner, enabling the model to develop a long horizontal awareness of the environment’s dynamics. In the second stage, a flexible yet effective layer-wise self-attention adapter is introduced to transform VidMan into an efficient inverse dynamics model that predicts action modulated by the implicit dynamics knowledge via parameter sharing. Our VidMan framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset. These results provide compelling evidence that world models can significantly enhance the precision of robot action prediction. Codes and models will be public.

[CV-34] Mono2Stereo: Monocular Knowledge Transfer for Enhanced Stereo Matching

链接: https://arxiv.org/abs/2411.09151
作者: Yuran Wang,Yingping Liang,Hesong Li,Ying Fu
关键词-EN: existing synthetic datasets, stereo matching networks, stereo matching, networks are limited, limited due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:The generalization and performance of stereo matching networks are limited due to the domain gap of the existing synthetic datasets and the sparseness of GT labels in the real datasets. In contrast, monocular depth estimation has achieved significant advancements, benefiting from large-scale depth datasets and self-supervised strategies. To bridge the performance gap between monocular depth estimation and stereo matching, we propose leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo. We introduce knowledge transfer with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning. In the pre-training stage, we design a data generation pipeline that synthesizes stereo training data from monocular images. This pipeline utilizes monocular depth for warping and novel view synthesis and employs our proposed Edge-Aware (EA) inpainting module to fill in missing contents in the generated images. In the fine-tuning stage, we introduce a Sparse-to-Dense Knowledge Distillation (S2DKD) strategy encouraging the distributions of predictions to align with dense monocular depths. This strategy mitigates issues with edge blurring in sparse real-world labels and enhances overall consistency. Experimental results demonstrate that our pre-trained model exhibits strong zero-shot generalization capabilities. Furthermore, domain-specific fine-tuning using our pre-trained model and S2DKD strategy significantly increments in-domain performance. The code will be made available soon.

[CV-35] UniHOI: Learning Fast Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction Videos

链接: https://arxiv.org/abs/2411.09145
作者: Chengbo Yuan,Geng Chen,Li Yi,Yang Gao
关键词-EN: Hand Object Interaction, Object Interaction, attracting growing interest, Egocentric Hand Object, provide valuable insights
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Egocentric Hand Object Interaction (HOI) videos provide valuable insights into human interactions with the physical world, attracting growing interest from the computer vision and robotics communities. A key task in fully understanding the geometry and dynamics of HOI scenes is dense pointclouds sequence reconstruction. However, the inherent motion of both hands and the camera makes this challenging. Current methods often rely on time-consuming test-time optimization, making them impractical for reconstructing internet-scale videos. To address this, we introduce UniHOI, a model that unifies the estimation of all variables necessary for dense 4D reconstruction, including camera intrinsic, camera poses, and video depth, for egocentric HOI scene in a fast feed-forward manner. We end-to-end optimize all these variables to improve their consistency in 3D space. Furthermore, our model could be trained solely on large-scale monocular video dataset, overcoming the limitation of scarce labeled HOI data. We evaluate UniHOI with both in-domain and zero-shot generalization setting, surpassing all baselines in pointclouds sequence reconstruction and long-term 3D scene flow recovery. UniHOI is the first approach to offer fast, dense, and generalizable monocular egocentric HOI scene reconstruction in the presence of motion. Code and trained model will be released in the future.

[CV-36] Adversarial Vessel-Unveiling Semi-Supervised Segmentation for Retinopathy of Prematurity Diagnosis

链接: https://arxiv.org/abs/2411.09140
作者: Gozde Merve Demirci,Jiachen Yao,Ming-Chih Ho,Xiaoling Hu,Wei-Chi Wu,Chao Chen,Chia-Ling Tsai
关键词-EN: retinal images plays, Accurate segmentation, retinopathy of prematurity, assessing its severity, plays a crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Accurate segmentation of retinal images plays a crucial role in aiding ophthalmologists in diagnosing retinopathy of prematurity (ROP) and assessing its severity. However, due to their underdeveloped, thinner vessels, manual annotation in infant fundus images is very complex, and this presents challenges for fully supervised learning. To address the scarcity of annotations, we propose a semi supervised segmentation framework designed to advance ROP studies without the need for extensive manual vessel annotation. Unlike previous methods that rely solely on limited labeled data, our approach leverages teacher student learning by integrating two powerful components: an uncertainty weighted vessel unveiling module and domain adversarial learning. The vessel unveiling module helps the model effectively reveal obscured and hard to detect vessel structures, while adversarial training aligns feature representations across different domains, ensuring robust and generalizable vessel segmentations. We validate our approach on public datasets (CHASEDB, STARE) and an in-house ROP dataset, demonstrating its superior performance across multiple evaluation metrics. Additionally, we extend the model’s utility to a downstream task of ROP multi-stage classification, where vessel masks extracted by our segmentation model improve diagnostic accuracy. The promising results in classification underscore the model’s potential for clinical application, particularly in early-stage ROP diagnosis and intervention. Overall, our work offers a scalable solution for leveraging unlabeled data in pediatric ophthalmology, opening new avenues for biomarker discovery and clinical research.

[CV-37] SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency

链接: https://arxiv.org/abs/2411.09126
作者: Yangyang Guo,Mohan Kankanhalli
关键词-EN: data efficiency problem, widely employed, efficiency problem, problem has remained, remained relatively under-explored
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While contrastive pre-training is widely employed, its data efficiency problem has remained relatively under-explored thus far. Existing methods often rely on static coreset selection algorithms to pre-identify important data for training. However, this static nature renders them unable to dynamically track the data usefulness throughout pre-training, leading to subpar pre-trained models. To address this challenge, our paper introduces a novel dynamic bootstrapping dataset pruning method. It involves pruning data preparation followed by dataset mutation operations, both of which undergo iterative and dynamic updates. We apply this method to two prevalent contrastive pre-training frameworks: \textbfCLIP and \textbfMoCo, representing vision-language and vision-centric domains, respectively. In particular, we individually pre-train seven CLIP models on two large-scale image-text pair datasets, and two MoCo models on the ImageNet dataset, resulting in a total of 16 pre-trained models. With a data pruning rate of 30-35% across all 16 models, our method exhibits only marginal performance degradation (less than \textbf1% on average) compared to corresponding models trained on the full dataset counterparts across various downstream datasets, and also surpasses several baselines with a large performance margin. Additionally, the byproduct from our method, \ie coresets derived from the original datasets after pre-training, also demonstrates significant superiority in terms of downstream performance over other static coreset selection approaches.

[CV-38] A multidimensional measurement of photorealistic avatar quality of experience

链接: https://arxiv.org/abs/2411.09066
作者: Ross Cutler,Babak Naderi,Vishak Gopal,Dharmendar Palle
关键词-EN: Photorealistic avatars, Photorealistic, avatars, test framework, SSIM
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: arXiv admin note: text overlap with arXiv:2204.06784

点击查看摘要

Abstract:Photorealistic avatars are human avatars that look, move, and talk like real people. The performance of photorealistic avatars has significantly improved recently based on objective metrics such as PSNR, SSIM, LPIPS, FID, and FVD. However, recent photorealistic avatar publications do not provide subjective tests of the avatars to measure human usability factors. We provide an open source test framework to subjectively measure photorealistic avatar performance in ten dimensions: realism, trust, comfortableness using, comfortableness interacting with, appropriateness for work, creepiness, formality, affinity, resemblance to the person, and emotion accuracy. We show that the correlation of nine of these subjective metrics with PSNR, SSIM, LPIPS, FID, and FVD is weak, and moderate for emotion accuracy. The crowdsourced subjective test framework is highly reproducible and accurate when compared to a panel of experts. We analyze a wide range of avatars from photorealistic to cartoon-like and show that some photorealistic avatars are approaching real video performance based on these dimensions. We also find that for avatars above a certain level of realism, eight of these measured dimensions are strongly correlated. In particular, for photorealistic avatars there is a linear relationship between avatar affinity and realism; in other words, there is no uncanny valley effect for photorealistic avatars in the telecommunication scenario. We provide several extensions of this test framework for future work and discuss design implications for telecommunication systems. The test framework is available at this https URL.

[CV-39] A Transformer-Based Visual Piano Transcription Algorithm

链接: https://arxiv.org/abs/2411.09037
作者: Uros Zivanovic,Carlos Eduardo Cancino-Chacón
关键词-EN: Music Information Retrieval, Automatic music transcription, long standing problem, Automatic music, Information Retrieval
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:Automatic music transcription (AMT) for musical performances is a long standing problem in the field of Music Information Retrieval (MIR). Visual piano transcription (VPT) is a multimodal subproblem of AMT which focuses on extracting a symbolic representation of a piano performance from visual information only (e.g., from a top-down video of the piano keyboard). Inspired by the success of Transformers for audio-based AMT, as well as their recent successes in other computer vision tasks, in this paper we present a Transformer based architecture for VPT. The proposed VPT system combines a piano bounding box detection model with an onset and pitch detection model, allowing our system to perform well in more naturalistic conditions like imperfect image crops around the piano and slightly tilted images.

[CV-40] CoMiX: Cross-Modal Fusion with Deformable Convolutions for HSI-X Semantic Segmentation

链接: https://arxiv.org/abs/2411.09023
作者: Xuming Zhang,Xingfa Gu,Qingjiu Tian,Lorenzo Bruzzone
关键词-EN: Improving hyperspectral image, referred to X-modality, Improving hyperspectral, supplementary data type, image content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Improving hyperspectral image (HSI) semantic segmentation by exploiting complementary information from a supplementary data type (referred to X-modality) is promising but challenging due to differences in imaging sensors, image content, and resolution. Current techniques struggle to enhance modality-specific and modality-shared information, as well as to capture dynamic interaction and fusion between different modalities. In response, this study proposes CoMiX, an asymmetric encoder-decoder architecture with deformable convolutions (DCNs) for HSI-X semantic segmentation. CoMiX is designed to extract, calibrate, and fuse information from HSI and X data. Its pipeline includes an encoder with two parallel and interacting backbones and a lightweight all-multilayer perceptron (ALL-MLP) decoder. The encoder consists of four stages, each incorporating 2D DCN blocks for the X model to accommodate geometric variations and 3D DCN blocks for HSIs to adaptively aggregate spatial-spectral features. Additionally, each stage includes a Cross-Modality Feature enhancement and eXchange (CMFeX) module and a feature fusion module (FFM). CMFeX is designed to exploit spatial-spectral correlations from different modalities to recalibrate and enhance modality-specific and modality-shared features while adaptively exchanging complementary information between them. Outputs from CMFeX are fed into the FFM for fusion and passed to the next stage for further information learning. Finally, the outputs from each FFM are integrated by the ALL-MLP decoder for final prediction. Extensive experiments demonstrate that our CoMiX achieves superior performance and generalizes well to various multimodal recognition tasks. The CoMiX code will be released.

[CV-41] Scale Contrastive Learning with Selective Attentions for Blind Image Quality Assessment

链接: https://arxiv.org/abs/2411.09007
作者: Zihao Huang,Xudong Li,Bohan Fu,Xiaohui Chu,Ke Li,Yunhang Shen,Yan Zhang
关键词-EN: human subjective perception, Blind image quality, Blind image, subjective perception, fundamental task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Blind image quality assessment (BIQA) serves as a fundamental task in computer vision, yet it often fails to consistently align with human subjective perception. Recent advances show that multi-scale evaluation strategies are promising due to their ability to replicate the hierarchical structure of human vision. However, the effectiveness of these strategies is limited by a lack of understanding of how different image scales influence perceived quality. This paper addresses two primary challenges: the significant redundancy of information across different scales, and the confusion caused by combining features from these scales, which may vary widely in quality. To this end, a new multi-scale BIQA framework is proposed, namely Contrast-Constrained Scale-Focused IQA Framework (CSFIQA). CSFIQA features a selective focus attention mechanism to minimize information redundancy and highlight critical quality-related information. Additionally, CSFIQA includes a scale-level contrastive learning module equipped with a noise sample matching mechanism to identify quality discrepancies across the same image content at different scales. By exploring the intrinsic relationship between image scales and the perceived quality, the proposed CSFIQA achieves leading performance on eight benchmark datasets, e.g., achieving SRCC values of 0.967 (versus 0.947 in CSIQ) and 0.905 (versus 0.876 in LIVEC).

[CV-42] Computed tomography using meta-optics

链接: https://arxiv.org/abs/2411.08995
作者: Maksym Zhelyeznuyakov,Johannes E. Fröch,Shane Colburn,Steven L. Brunton,Arka Majumdar
关键词-EN: Computer vision tasks, tasks require processing, vision tasks require, processing large amounts, require processing large
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Computer vision tasks require processing large amounts of data to perform image classification, segmentation, and feature extraction. Optical preprocessors can potentially reduce the number of floating point operations required by computer vision tasks, enabling low-power and low-latency operation. However, existing optical preprocessors are mostly learned and hence strongly depend on the training data, and thus lack universal applicability. In this paper, we present a metaoptic imager, which implements the Radon transform obviating the need for training the optics. High quality image reconstruction with a large compression ratio of 0.6% is presented through the use of the Simultaneous Algebraic Reconstruction Technique. Image classification with 90% accuracy is presented on an experimentally measured Radon dataset through neural network trained on digitally transformed images.

[CV-43] Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head

链接: https://arxiv.org/abs/2411.08937
作者: Penghui Yang,Chen-Chen Zong,Sheng-Jun Huang,Lei Feng,Bo An
关键词-EN: student predicted probabilities, teacher predicted probabilities, predicted probabilities, Traditional knowledge distillation, Traditional knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Traditional knowledge distillation focuses on aligning the student’s predicted probabilities with both ground-truth labels and the teacher’s predicted probabilities. However, the transition to predicted probabilities from logits would obscure certain indispensable information. To address this issue, it is intuitive to additionally introduce a logit-level loss function as a supplement to the widely used probability-level loss function, for exploiting the latent information of logits. Unfortunately, we empirically find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration, even trailing behind the performance of employing either loss in isolation. We attribute this phenomenon to the collapse of the classification head, which is verified by our theoretical analysis based on the neural collapse theory. Specifically, the gradients of the two loss functions exhibit contradictions in the linear classifier yet display no such conflict within the backbone. Drawing from the theoretical analysis, we propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses, thereby preserving the beneficial effects of both losses on the backbone while eliminating adverse influences on the classification head. Extensive experiments validate that our method can effectively exploit the information inside the logits and achieve superior performance against state-of-the-art counterparts.

[CV-44] Classification of Keratitis from Eye Corneal Photographs using Deep Learning

链接: https://arxiv.org/abs/2411.08935
作者: Maria Miguel Beirão,João Matos,Tiago Gonçalves,Camila Kase,Luis Filipe Nakayama,Denise de Freitas,Jaime S. Cardoso
关键词-EN: inflammatory corneal condition, corneal condition responsible, common infection etiologies, impairment in low, middle-income countries
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 6 pages; Accepted at IEEE’s International Conference on Bioinformatics and Biomedicine (2024)

点击查看摘要

Abstract:Keratitis is an inflammatory corneal condition responsible for 10% of visual impairment in low- and middle-income countries (LMICs), with bacteria, fungi, or amoeba as the most common infection etiologies. While an accurate and timely diagnosis is crucial for the selected treatment and the patients’ sight outcomes, due to the high cost and limited availability of laboratory diagnostics in LMICs, diagnosis is often made by clinical observation alone, despite its lower accuracy. In this study, we investigate and compare different deep learning approaches to diagnose the source of infection: 1) three separate binary models for infection type predictions; 2) a multitask model with a shared backbone and three parallel classification layers (Multitask V1); and, 3) a multitask model with a shared backbone and a multi-head classification layer (Multitask V2). We used a private Brazilian cornea dataset to conduct the empirical evaluation. We achieved the best results with Multitask V2, with an area under the receiver operating characteristic curve (AUROC) confidence intervals of 0.7413-0.7740 (bacteria), 0.8395-0.8725 (fungi), and 0.9448-0.9616 (amoeba). A statistical analysis of the impact of patient features on models’ performance revealed that sex significantly affects amoeba infection prediction, and age seems to affect fungi and bacteria predictions.

[CV-45] Predicting household socioeconomic position in Mozambique using satellite and household imagery

链接: https://arxiv.org/abs/2411.08934
作者: Carles Milà,Teodimiro Matsena,Edgar Jamisse,Jovito Nunes,Quique Bassat,Paula Petrone,Elisa Sicuri,Charfudin Sacoor,Cathryn Tonne
关键词-EN: predicted SocioEconomic Position, aggregated spatial units, SocioEconomic Position, SEP, income SEP data
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many studies have predicted SocioEconomic Position (SEP) for aggregated spatial units such as villages using satellite data, but SEP prediction at the household level and other sources of imagery have not been yet explored. We assembled a dataset of 975 households in a semi-rural district in southern Mozambique, consisting of self-reported asset, expenditure, and income SEP data, as well as multimodal imagery including satellite images and a ground-based photograph survey of 11 household elements. We fine-tuned a convolutional neural network to extract feature vectors from the images, which we then used in regression analyzes to model household SEP using different sets of image types. The best prediction performance was found when modeling asset-based SEP using random forest models with all image types, while the performance for expenditure- and income-based SEP was lower. Using SHAP, we observed clear differences between the images with the largest positive and negative effects, as well as identified the most relevant household elements in the predictions. Finally, we fitted an additional reduced model using only the identified relevant household elements, which had an only slightly lower performance compared to models using all images. Our results show how ground-based household photographs allow to zoom in from an area-level to an individual household prediction while minimizing the data collection effort by using explainable machine learning. The developed workflow can be potentially integrated into routine household surveys, where the collected household imagery could be used for other purposes, such as refined asset characterization and environmental exposure assessment.

[CV-46] Structured Pattern Expansion with Diffusion Models

链接: https://arxiv.org/abs/2411.08930
作者: Marzia Riso,Giuseppe Vecchio,Fabio Pellacini
关键词-EN: Recent advances, significantly improved, diffusion models, improved the synthesis, Recent
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:Recent advances in diffusion models have significantly improved the synthesis of materials, textures, and 3D shapes. By conditioning these models via text or images, users can guide the generation, reducing the time required to create digital assets. In this paper, we address the synthesis of structured, stationary patterns, where diffusion models are generally less reliable and, more importantly, less controllable. Our approach leverages the generative capabilities of diffusion models specifically adapted for the pattern domain. It enables users to exercise direct control over the synthesis by expanding a partially hand-drawn pattern into a larger design while preserving the structure and details of the input. To enhance pattern quality, we fine-tune an image-pretrained diffusion model on structured patterns using Low-Rank Adaptation (LoRA), apply a noise rolling technique to ensure tileability, and utilize a patch-based approach to facilitate the generation of large-scale assets. We demonstrate the effectiveness of our method through a comprehensive set of experiments, showing that it outperforms existing models in generating diverse, consistent patterns that respond directly to user input. Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) ACMclasses: I.3 Cite as: arXiv:2411.08930 [cs.CV] (or arXiv:2411.08930v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2411.08930 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-47] Aligning Visual Contrastive learning models via Preference Optimization

链接: https://arxiv.org/abs/2411.08923
作者: Amirabbas Afzali,Borna Khodabandeh,Ali Rasekh,Mahyar JafariNodeh,Sepehr kazemi,Simon Gottschalk
关键词-EN: demonstrated impressive abilities, capture semantic similarities, Direct Preference Optimization, Contrastive learning, Preference Optimization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contrastive learning models have demonstrated impressive abilities to capture semantic similarities by aligning representations in the embedding space. However, their performance can be limited by the quality of the training data and its inherent biases. While Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been applied to generative models to align them with human preferences, their use in contrastive learning has yet to be explored. This paper introduces a novel method for training contrastive learning models using Preference Optimization (PO) to break down complex concepts. Our method systematically aligns model behavior with desired preferences, enhancing performance on the targeted task. In particular, we focus on enhancing model robustness against typographic attacks, commonly seen in contrastive models like CLIP. We further apply our method to disentangle gender understanding and mitigate gender biases, offering a more nuanced control over these sensitive attributes. Our experiments demonstrate that models trained using PO outperform standard contrastive learning techniques while retaining their ability to handle adversarial challenges and maintain accuracy on other downstream tasks. This makes our method well-suited for tasks requiring fairness, robustness, and alignment with specific preferences. We evaluate our method on several vision-language tasks, tackling challenges such as typographic attacks. Additionally, we explore the model’s ability to disentangle gender concepts and mitigate gender bias, showcasing the versatility of our approach.

[CV-48] Assessing the Performance of the DINOv2 Self-supervised Learning Vision Transformer Model for the Segmentation of the Left Atrium from MRI Images

链接: https://arxiv.org/abs/2411.09598
作者: Bipasha Kundu,Bidur Khanal,Richard Simon,Cristian A. Linte
关键词-EN: diagnosing atrial fibrillation, Accurate left atrium, supporting surgical interventions, treatment planning, left atrium
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 3 figures, SPIE Medical Imaging, 2025

点击查看摘要

Abstract:Accurate left atrium (LA) segmentation from pre-operative scans is crucial for diagnosing atrial fibrillation, treatment planning, and supporting surgical interventions. While deep learning models are key in medical image segmentation, they often require extensive manually annotated data. Foundation models trained on larger datasets have reduced this dependency, enhancing generalizability and robustness through transfer learning. We explore DINOv2, a self-supervised learning vision transformer trained on natural images, for LA segmentation using MRI. The challenges for LA’s complex anatomy, thin boundaries, and limited annotated data make accurate segmentation difficult before during the image-guided intervention. We demonstrate DINOv2’s ability to provide accurate consistent segmentation, achieving a mean Dice score of .871 a Jaccard Index of .792 for end-to-end fine-tuning. Through few-shot learning across various data sizes patient counts, DINOv2 consistently outperforms baseline models. These results suggest that DINOv2 effectively adapts to MRI with limited data, highlighting its potential as a competitive tool for segmentation encouraging broader use in medical imaging.

[CV-49] GAN-Based Architecture for Low-dose Computed Tomography Imaging Denoising

链接: https://arxiv.org/abs/2411.09512
作者: Yunuo Wang,Ningning Yang,Jialin Li
关键词-EN: Generative Adversarial Networks, low-dose computed tomography, reconciling radiation exposure, Generative Adversarial, Adversarial Networks
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) have surfaced as a revolutionary element within the domain of low-dose computed tomography (LDCT) imaging, providing an advanced resolution to the enduring issue of reconciling radiation exposure with image quality. This comprehensive review synthesizes the rapid advancements in GAN-based LDCT denoising techniques, examining the evolution from foundational architectures to state-of-the-art models incorporating advanced features such as anatomical priors, perceptual loss functions, and innovative regularization strategies. We critically analyze various GAN architectures, including conditional GANs (cGANs), CycleGANs, and Super-Resolution GANs (SRGANs), elucidating their unique strengths and limitations in the context of LDCT denoising. The evaluation provides both qualitative and quantitative results related to the improvements in performance in benchmark and clinical datasets with metrics such as PSNR, SSIM, and LPIPS. After highlighting the positive results, we discuss some of the challenges preventing a wider clinical use, including the interpretability of the images generated by GANs, synthetic artifacts, and the need for clinically relevant metrics. The review concludes by highlighting the essential significance of GAN-based methodologies in the progression of precision medicine via tailored LDCT denoising models, underlining the transformative possibilities presented by artificial intelligence within contemporary radiological practice.

[CV-50] Are nuclear masks all you need for improved out-of-domain generalisation? A closer look at cancer classification in histopathology NEURIPS2024

链接: https://arxiv.org/abs/2411.09373
作者: Dhananjay Tomar,Alexander Binder,Andreas Kleppe
关键词-EN: Domain generalisation, imaging equipment, OOD generalisation, improve OOD generalisation, computational histopathology
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Poster at NeurIPS 2024

点击查看摘要

Abstract:Domain generalisation in computational histopathology is challenging because the images are substantially affected by differences among hospitals due to factors like fixation and staining of tissue and imaging equipment. We hypothesise that focusing on nuclei can improve the out-of-domain (OOD) generalisation in cancer detection. We propose a simple approach to improve OOD generalisation for cancer detection by focusing on nuclear morphology and organisation, as these are domain-invariant features critical in cancer detection. Our approach integrates original images with nuclear segmentation masks during training, encouraging the model to prioritise nuclei and their spatial arrangement. Going beyond mere data augmentation, we introduce a regularisation technique that aligns the representations of masks and original images. We show, using multiple datasets, that our method improves OOD generalisation and also leads to increased robustness to image corruptions and adversarial attacks. The source code is available at this https URL

[CV-51] DT-JRD: Deep Transformer based Just Recognizable Difference Prediction Model for Video Coding for Machines

链接: https://arxiv.org/abs/2411.09308
作者: Junqi Liu,Yun Zhang,Xiaoqi Wang,Xu Long,Sam Kwong
关键词-EN: minimum visual difference, visual signal processing, Recognizable Difference, oriented visual signal, vision oriented visual
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to IEEE Transactions on Multimedia

点击查看摘要

Abstract:Just Recognizable Difference (JRD) represents the minimum visual difference that is detectable by machine vision, which can be exploited to promote machine vision oriented visual signal processing. In this paper, we propose a Deep Transformer based JRD (DT-JRD) prediction model for Video Coding for Machines (VCM), where the accurately predicted JRD can be used reduce the coding bit rate while maintaining the accuracy of machine tasks. Firstly, we model the JRD prediction as a multi-class classification and propose a DT-JRD prediction model that integrates an improved embedding, a content and distortion feature extraction, a multi-class classification and a novel learning strategy. Secondly, inspired by the perception property that machine vision exhibits a similar response to distortions near JRD, we propose an asymptotic JRD loss by using Gaussian Distribution-based Soft Labels (GDSL), which significantly extends the number of training labels and relaxes classification boundaries. Finally, we propose a DT-JRD based VCM to reduce the coding bits while maintaining the accuracy of object detection. Extensive experimental results demonstrate that the mean absolute error of the predicted JRD by the DT-JRD is 5.574, outperforming the state-of-the-art JRD prediction model by 13.1%. Coding experiments shows that comparing with the VVC, the DT-JRD based VCM achieves an average of 29.58% bit rate reduction while maintaining the object detection accuracy.

[CV-52] Leveraging Auxiliary Classification for Rib Fracture Segmentation

链接: https://arxiv.org/abs/2411.09283
作者: Harini G.,Aiman Farooq,Deepak Mishra
关键词-EN: Thoracic trauma, effective treatment, demand swift, swift and accurate, accurate diagnosis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICVGIP’24

点击查看摘要

Abstract:Thoracic trauma often results in rib fractures, which demand swift and accurate diagnosis for effective treatment. However, detecting these fractures on rib CT scans poses considerable challenges, involving the analysis of many image slices in sequence. Despite notable advancements in algorithms for automated fracture segmentation, the persisting challenges stem from the diverse shapes and sizes of these fractures. To address these issues, this study introduces a sophisticated deep-learning model with an auxiliary classification task designed to enhance the accuracy of rib fracture segmentation. The auxiliary classification task is crucial in distinguishing between fractured ribs and negative regions, encompassing non-fractured ribs and surrounding tissues, from the patches obtained from CT scans. By leveraging this auxiliary task, the model aims to improve feature representation at the bottleneck layer by highlighting the regions of interest. Experimental results on the RibFrac dataset demonstrate significant improvement in segmentation performance.

[CV-53] Fast probabilistic snake algorithm

链接: https://arxiv.org/abs/2411.09137
作者: Jérôme Gilles,Bertrand Collin
关键词-EN: achieve image segmentation, snake models, theory in order, order to achieve, achieve image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Few people use the probability theory in order to achieve image segmentation with snake models. In this article, we are presenting an active contour algorithm based on a probability approach inspired by A. Blake work and P. Réfrégier’s team research in France. Our algorithm, both very fast and highly accurate as far as contour description is concerned, is easily adaptable to any specific application.

[CV-54] Computational metaoptics for imaging

链接: https://arxiv.org/abs/2411.09133
作者: Charles Roques-Carmes,Kai Wang,Yuanmu Yang,Arka Majumdar,Zin Lin
关键词-EN: ultrathin structures composed, electromagnetic waves’ amplitude, enabling precise control, subwavelength optical elements, revolutionized light manipulation
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Metasurfaces – ultrathin structures composed of subwavelength optical elements – have revolutionized light manipulation by enabling precise control over electromagnetic waves’ amplitude, phase, polarization, and spectral properties. Concurrently, computational imaging leverages algorithms to reconstruct images from optically processed signals, overcoming limitations of traditional imaging systems. This review explores the synergistic integration of metaoptics and computational imaging, “computational metaoptics,” which combines the physical wavefront shaping ability of metasurfaces with advanced computational algorithms to enhance imaging performance beyond conventional limits. We discuss how computational metaoptics addresses the inherent limitations of single-layer metasurfaces in achieving multifunctionality without compromising efficiency. By treating metasurfaces as physical preconditioners and co-designing them with reconstruction algorithms through end-to-end (inverse) design, it is possible to jointly optimize the optical hardware and computational software. This holistic approach allows for the automatic discovery of optimal metasurface designs and reconstruction methods that significantly improve imaging capabilities. Advanced applications enabled by computational metaoptics are highlighted, including phase imaging and quantum state measurement, which benefit from the metasurfaces’ ability to manipulate complex light fields and the computational algorithms’ capacity to reconstruct high-dimensional information. We also examine performance evaluation challenges, emphasizing the need for new metrics that account for the combined optical and computational nature of these systems. Finally, we identify new frontiers in computational metaoptics which point toward a future where computational metaoptics may play a central role in advancing imaging science and technology.

[CV-55] Clustered Patch Embeddings for Permutation-Invariant Classification of Whole Slide Images

链接: https://arxiv.org/abs/2411.08936
作者: Ravi Kant Gupta,Shounak Das,Amit Sethi
关键词-EN: offering detailed insights, detailed insights critical, Slide Imaging, digital pathology, offering detailed
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2411.08530

点击查看摘要

Abstract:Whole Slide Imaging (WSI) is a cornerstone of digital pathology, offering detailed insights critical for diagnosis and research. Yet, the gigapixel size of WSIs imposes significant computational challenges, limiting their practical utility. Our novel approach addresses these challenges by leveraging various encoders for intelligent data reduction and employing a different classification model to ensure robust, permutation-invariant representations of WSIs. A key innovation of our method is the ability to distill the complex information of an entire WSI into a single vector, effectively capturing the essential features needed for accurate analysis. This approach significantly enhances the computational efficiency of WSI analysis, enabling more accurate pathological assessments without the need for extensive computational resources. This breakthrough equips us with the capability to effectively address the challenges posed by large image resolutions in whole-slide imaging, paving the way for more scalable and effective utilization of WSIs in medical diagnostics and research, marking a significant advancement in the field.

[CV-56] DG-PPU: Dynamical Graphs based Post-processing of Point Clouds extracted from Knee Ultrasounds

链接: https://arxiv.org/abs/2411.08926
作者: Injune Hwang,Karthik Saravanan,Caterina V Coralli,S Jack Tu,Sthephen J Mellon
关键词-EN: Patients undergoing total, experience non-specific anterior, Patients undergoing, abnormal patellofemoral joint, total knee arthroplasty
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was submitted to the IEEE International Symposium on Biomedical Imaging (ISBI). This is a preprint version and may be subject to copyright

点击查看摘要

Abstract:Patients undergoing total knee arthroplasty (TKA) often experience non-specific anterior knee pain, arising from abnormal patellofemoral joint (PFJ) instability. Tracking PFJ motion is challenging since static imaging modalities like CT and MRI are limited by field of view and metal artefact interference. Ultrasounds offer an alternative modality for dynamic musculoskeletal imaging. We aim to achieve accurate visualisation of patellar tracking and PFJ motion, using 3D registration of point clouds extracted from ultrasound scans across different angles of joint flexion. Ultrasound images containing soft tissue are often mislabeled as bone during segmentation, resulting in noisy 3D point clouds that hinder accurate registration of the bony joint anatomy. Machine learning the intrinsic geometry of the knee bone may help us eliminate these false positives. As the intrinsic geometry of the knee does not change during PFJ motion, one may expect this to be robust across multiple angles of joint flexion. Our dynamical graphs-based post-processing algorithm (DG-PPU) is able to achieve this, creating smoother point clouds that accurately represent bony knee anatomy across different angles. After inverting these point clouds back to their original ultrasound images, we evaluated that DG-PPU outperformed manual data cleaning done by our lab technician, deleting false positives and noise with 98.2% precision across three different angles of joint flexion. DG-PPU is the first algorithm to automate the post-processing of 3D point clouds extracted from ultrasound scans. With DG-PPU, we contribute towards the development of a novel patellar mal-tracking assessment system with ultrasound, which currently does not exist.

机器学习

[LG-0] How do Machine Learning Models Change?

链接: https://arxiv.org/abs/2411.09645
作者: Joel Castaño,Rafael Cabañas,Antonio Salmerón,David Lo,Silverio Martínez-Fernández
关键词-EN: transformed Artificial Intelligence, Artificial Intelligence research, Machine Learning, Artificial Intelligence, proliferation of Machine
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of Machine Learning (ML) models and their open-source implementations has transformed Artificial Intelligence research and applications. Platforms like Hugging Face (HF) enable the development, sharing, and deployment of these models, fostering an evolving ecosystem. While previous studies have examined aspects of models hosted on platforms like HF, a comprehensive longitudinal study of how these models change remains underexplored. This study addresses this gap by utilizing both repository mining and longitudinal analysis methods to examine over 200,000 commits and 1,200 releases from over 50,000 models on HF. We replicate and extend an ML change taxonomy for classifying commits and utilize Bayesian networks to uncover patterns in commit and release activities over time. Our findings indicate that commit activities align with established data science methodologies, such as CRISP-DM, emphasizing iterative refinement and continuous improvement. Additionally, release patterns tend to consolidate significant updates, particularly in documentation, distinguishing between granular changes and milestone-based releases. Furthermore, projects with higher popularity prioritize infrastructure enhancements early in their lifecycle, and those with intensive collaboration practices exhibit improved documentation standards. These and other insights enhance the understanding of model changes on community platforms and provide valuable guidance for best practices in model maintenance.

[LG-1] MCCE: Missingness-aware Causal Concept Explainer

链接: https://arxiv.org/abs/2411.09639
作者: Jifan Gao,Guanhua Chen
关键词-EN: Causal concept effect, gaining increasing interest, Causal concept, interpretable machine learning, machine learning models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal concept effect estimation is gaining increasing interest in the field of interpretable machine learning. This general approach explains the behaviors of machine learning models by estimating the causal effect of human-understandable concepts, which represent high-level knowledge more comprehensibly than raw inputs like tokens. However, existing causal concept effect explanation methods assume complete observation of all concepts involved within the dataset, which can fail in practice due to incomplete annotations or missing concept data. We theoretically demonstrate that unobserved concepts can bias the estimation of the causal effects of observed concepts. To address this limitation, we introduce the Missingness-aware Causal Concept Explainer (MCCE), a novel framework specifically designed to estimate causal concept effects when not all concepts are observable. Our framework learns to account for residual bias resulting from missing concepts and utilizes a linear predictor to model the relationships between these concepts and the outputs of black-box machine learning models. It can offer explanations on both local and global levels. We conduct validations using a real-world dataset, demonstrating that MCCE achieves promising performance compared to state-of-the-art explanation methods in causal concept effect estimation.

[LG-2] Local deployment of large-scale music AI models on commodity hardware

链接: https://arxiv.org/abs/2411.09625
作者: Xun Zhou,Charlie Ruan,Zihe Zhao,Tianqi Chen,Chris Donahue
关键词-EN: generating symbolic music, Machine Learning Compilation, Anticipatory Music Transformer, Lakh MIDI dataset, present the MIDInfinite
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 2 pages

点击查看摘要

Abstract:We present the MIDInfinite, a web application capable of generating symbolic music using a large-scale generative AI model locally on commodity hardware. Creating this demo involved porting the Anticipatory Music Transformer, a large language model (LLM) pre-trained on the Lakh MIDI dataset, to the Machine Learning Compilation (MLC) framework. Once the model is ported, MLC facilitates inference on a variety of runtimes including C++, mobile, and the browser. We envision that MLC has the potential to bridge the gap between the landscape of increasingly capable music AI models and technology more familiar to music software developers. As a proof of concept, we build a web application that allows users to generate endless streams of multi-instrumental MIDI in the browser, either from scratch or conditioned on a prompt. On commodity hardware (an M3 Macbook Pro), our demo can generate 51 notes per second, which is faster than real-time playback for 72.9% of generations, and increases to 86.3% with 2 seconds of upfront buffering.

[LG-3] Latency Optimization in LEO Satellite Communications with Hybrid Beam Pattern and Interference Control

链接: https://arxiv.org/abs/2411.09600
作者: Qianqian Zhang,Ye Hu,Minchae Jung
关键词-EN: low Earth orbit, enhanced global connectivity, significantly enhanced global, low-latency services crucial, Earth orbit
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement of low Earth orbit (LEO) satellite communication systems has significantly enhanced global connectivity, offering high-capacity, low-latency services crucial for next-generation applications. However, the dense configuration of LEO constellations poses challenges in resource allocation optimization and interference management, complicating coexistence with other communication systems. To address these limitations, this paper proposes a novel framework for optimizing the beam scheduling and resource allocation in multi-beam LEO systems. To satisfy the uneven terrestrial traffic demand, a hybrid beam pattern is employed to enhance the downlink quality of service and minimize the transmission latency from LEO satellites to ground user terminals. Additionally, a dynamic co-channel interference (CCI) control mechanism is developed to mitigate inter-beam interference within the LEO constellation and limit cross-system interference affecting protected users from other networks. The problem of user-beam-frequency allocation with power optimization is formulated as a mixed-integer dynamic programming model and solved using a low-complexity neural network-based graph generation algorithm. Simulation results show that the proposed approach outperforms the baseline methods of full frequency reuse and single-channel transmission, and highlights the potential for further performance improvement with multi-user transmissions.

[LG-4] Expert Study on Interpretable Machine Learning Models with Missing Data ALT ML4H

链接: https://arxiv.org/abs/2411.09591
作者: Lena Stempfle,Arthur James,Julie Josse,Tobias Gauss,Fredrik D. Johansson
关键词-EN: Inherently interpretable machine, Inherently interpretable, interpretable machine learning, decision-making but face, face challenges
类目: Machine Learning (cs.LG)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 13 pages

点击查看摘要

Abstract:Inherently interpretable machine learning (IML) models provide valuable insights for clinical decision-making but face challenges when features have missing values. Classical solutions like imputation or excluding incomplete records are often unsuitable in applications where values are missing at test time. In this work, we conducted a survey with 71 clinicians from 29 trauma centers across France, including 20 complete responses to study the interaction between medical professionals and IML applied to data with missing values. This provided valuable insights into how missing data is interpreted in clinical machine learning. We used the prediction of hemorrhagic shock as a concrete example to gauge the willingness and readiness of the participants to adopt IML models from three classes of methods. Our findings show that, while clinicians value interpretability and are familiar with common IML methods, classical imputation techniques often misalign with their intuition, and that models that natively handle missing values are preferred. These results emphasize the need to integrate clinical intuition into future IML models for better human-computer interaction.

[LG-5] Randomized Truthful Auctions with Learning Agents

链接: https://arxiv.org/abs/2411.09517
作者: Gagan Aggarwal,Anupam Gupta,Andres Perlroth,Grigoris Velegkas
关键词-EN: no-regret bidding algorithms, no-regret learning algorithms, auctions, participate in repeated, revenue
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:We study a setting where agents use no-regret learning algorithms to participate in repeated auctions. \citetkolumbus2022auctions showed, rather surprisingly, that when bidders participate in second-price auctions using no-regret bidding algorithms, no matter how large the number of interactions T is, the runner-up bidder may not converge to bidding truthfully. Our first result shows that this holds for \emphgeneral deterministic truthful auctions. We also show that the ratio of the learning rates of the bidders can \emphqualitatively affect the convergence of the bidders. Next, we consider the problem of revenue maximization in this environment. In the setting with fully rational bidders, \citetmyerson1981optimal showed that revenue can be maximized by using a second-price auction with this http URL show that, in stark contrast, in our setting with learning bidders, \emphrandomized auctions can have strictly better revenue guarantees than second-price auctions with reserves, when T is large enough. Finally, we study revenue maximization in the non-asymptotic regime. We define a notion of \em auctioneer regret comparing the revenue generated to the revenue of a second price auction with truthful bids. When the auctioneer has to use the same auction throughout the interaction, we show an (almost) tight regret bound of \smash\widetilde \Theta(T^3/4). If the auctioneer can change auctions during the interaction, but in a way that is oblivious to the bids, we show an (almost) tight bound of \smash\widetilde \Theta(\sqrtT).

[LG-6] Developement of Reinforcement Learning based Optimisation Method for Side-Sill Design

链接: https://arxiv.org/abs/2411.09499
作者: Aditya Borse,Rutwik Gulakala,Marcus Stoffel
关键词-EN: vehicle development process, development process, vehicle development, Optimisation, critical part
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimisation for crashworthiness is a critical part of the vehicle development process. Due to stringent regulations and increasing market demands, multiple factors must be considered within a limited timeframe. However, for optimal crashworthiness design, multiobjective optimisation is necessary, and for complex parts, multiple design parameters must be evaluated. This crashworthiness analysis requires computationally intensive finite element simulations. This challenge leads to the need for inverse multi-parameter multi-objective optimisation. This challenge leads to the need for multi-parameter, multi-objective inverse optimisation. This article investigates a machine learning-based method for this type of optimisation, focusing on the design optimisation of a multi-cell side sill to improve crashworthiness results. Furthermore, the optimiser is coupled with an FE solver to achieve improved results.

[LG-7] What makes a good BIM design: quantitative linking between design behavior and quality

链接: https://arxiv.org/abs/2411.09481
作者: Xiang-Rui Ni,Peng Pan,Jia-Rui Lin
关键词-EN: Architecture Engineering Construction, Building Information Modeling, Engineering Construction, quality remains unclear, Architecture Engineering
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the Architecture Engineering Construction (AEC) industry, how design behaviors impact design quality remains unclear. This study proposes a novel approach, which, for the first time, identifies and quantitatively describes the relationship between design behaviors and quality of design based on Building Information Modeling (BIM). Real-time collection and log mining are integrated to collect raw data of design behaviors. Feature engineering and various machine learning models are then utilized for quantitative modeling and interpretation. Results confirm an existing quantifiable relationship which can be learned by various models. The best-performing model using Extremely Random Trees achieved an R2 value of 0.88 on the test set. Behavioral features related to designer’s skill level and changes of design intentions are identified to have significant impacts on design quality. These findings deepen our understanding of the design process and help forming BIM designs with better quality.

[LG-8] Harnessing Machine Learning for Single-Shot Measurement of Free Electron Laser Pulse Power NEURIPS2024 NEURIPS

链接: https://arxiv.org/abs/2411.09468
作者: Till Korten(1),Vladimir Rybnikov(2),Mathias Vogt(2),Juliane Roensch-Schulenburg(2),Peter Steinbach(1),Najmeh Mirian(1) ((1) Helmholtz-Zentrum Dresden-Rossendorf HZDR, (2) Deutsches Elektronen-Synchrotron DESY)
关键词-EN: Electron beam accelerators, Electron beam, technological fields, accelerators are essential, scientific and technological
类目: Machine Learning (cs.LG); Accelerator Physics (physics.acc-ph)
*备注: 10 pages, 4 figures, Machine Learning and the Physical Sciences Workshop, NeurIPS 2024 this https URL

点击查看摘要

Abstract:Electron beam accelerators are essential in many scientific and technological fields. Their operation relies heavily on the stability and precision of the electron beam. Traditional diagnostic techniques encounter difficulties in addressing the complex and dynamic nature of electron beams. Particularly in the context of free-electron lasers (FELs), it is fundamentally impossible to measure the lasing-on and lasingoff electron power profiles for a single electron bunch. This is a crucial hurdle in the exact reconstruction of the photon pulse profile. To overcome this hurdle, we developed a machine learning model that predicts the temporal power profile of the electron bunch in the lasing-off regime using machine parameters that can be obtained when lasing is on. The model was statistically validated and showed superior predictions compared to the state-of-the-art batch calibrations. The work we present here is a critical element for a virtual pulse reconstruction diagnostic (VPRD) tool designed to reconstruct the power profile of individual photon pulses without requiring repeated measurements in the lasing-off regime. This promises to significantly enhance the diagnostic capabilities in FELs at large.

[LG-9] Caravan MultiMet: Extending Caravan with Multiple Weather Nowcasts and Forecasts

链接: https://arxiv.org/abs/2411.09459
作者: Guy Shalev,Frederik Kratzert
关键词-EN: harmonize streamflow data, https URL, combined with globally, catchment attributes, ECMWF IFS HRES
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Caravan large-sample hydrology dataset (Kratzert et al., 2023) was created to standardize and harmonize streamflow data from various regional datasets, combined with globally available meteorological forcing and catchment attributes. This community-driven project also allows researchers to conveniently extend the dataset for additional basins, as done 6 times to date (see this https URL). We present a novel extension to Caravan, focusing on enriching the meteorological forcing data. Our extension adds three precipitation nowcast products (CPC, IMERG v07 Early, and CHIRPS) and three weather forecast products (ECMWF IFS HRES, GraphCast, and CHIRPS-GEFS) to the existing ERA5-Land reanalysis data. The inclusion of diverse data sources, particularly weather forecasts, enables more robust evaluation and benchmarking of hydrological models, especially for real-time forecasting scenarios. To the best of our knowledge, this extension makes Caravan the first large-sample hydrology dataset to incorporate weather forecast data, significantly enhancing its capabilities and fostering advancements in hydrological research, benchmarking, and real-time hydrologic forecasting. The data is publicly available under a CC-BY-4.0 license on Zenodo in two parts (this https URL, this https URL) and on Google Cloud Platform (GCP) - see more under the Data Availability chapter.

[LG-10] Learning efficient and provably convergent splitting methods

链接: https://arxiv.org/abs/2411.09444
作者: L. M. Kreusser,H. E. Lockyer,E. H. Müller,P. Singh
关键词-EN: simplify complicated evolutions, initial value problems, methods, solving initial, ability to simplify
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Splitting methods are widely used for solving initial value problems (IVPs) due to their ability to simplify complicated evolutions into more manageable subproblems which can be solved efficiently and accurately. Traditionally, these methods are derived using analytic and algebraic techniques from numerical analysis, including truncated Taylor series and their Lie algebraic analogue, the Baker–Campbell–Hausdorff formula. These tools enable the development of high-order numerical methods that provide exceptional accuracy for small timesteps. Moreover, these methods often (nearly) conserve important physical invariants, such as mass, unitarity, and energy. However, in many practical applications the computational resources are limited. Thus, it is crucial to identify methods that achieve the best accuracy within a fixed computational budget, which might require taking relatively large timesteps. In this regime, high-order methods derived with traditional methods often exhibit large errors since they are only designed to be asymptotically optimal. Machine Learning techniques offer a potential solution since they can be trained to efficiently solve a given IVP with less computational resources. However, they are often purely data-driven, come with limited convergence guarantees in the small-timestep regime and do not necessarily conserve physical invariants. In this work, we propose a framework for finding machine learned splitting methods that are computationally efficient for large timesteps and have provable convergence and conservation guarantees in the small-timestep limit. We demonstrate numerically that the learned methods, which by construction converge quadratically in the timestep size, can be significantly more efficient than established methods for the Schrödinger equation if the computational budget is limited.

[LG-11] Inherently Interpretable and Uncertainty-Aware Models for Online Learning in Cyber-Security Problems

链接: https://arxiv.org/abs/2411.09393
作者: Benjamin Kolicic,Alberto Caron,Chris Hicks,Vasilios Mavroudis
关键词-EN: uncertainty-aware machine learning, Additive Gaussian Processes, machine learning models, high-risk industries, address the critical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we address the critical need for interpretable and uncertainty-aware machine learning models in the context of online learning for high-risk industries, particularly cyber-security. While deep learning and other complex models have demonstrated impressive predictive capabilities, their opacity and lack of uncertainty quantification present significant questions about their trustworthiness. We propose a novel pipeline for online supervised learning problems in cyber-security, that harnesses the inherent interpretability and uncertainty awareness of Additive Gaussian Processes (AGPs) models. Our approach aims to balance predictive performance with transparency while improving the scalability of AGPs, which represents their main drawback, potentially enabling security analysts to better validate threat detection, troubleshoot and reduce false positives, and generally make trustworthy, informed decisions. This work contributes to the growing field of interpretable AI by proposing a class of models that can be significantly beneficial for high-stake decision problems such as the ones typical of the cyber-security domain. The source code is available.

[LG-12] A survey of probabilistic generative frameworks for molecular simulations

链接: https://arxiv.org/abs/2411.09388
作者: Richard John,Lukas Herron,Pratyush Tiwary
关键词-EN: Neural Spline Flows, Conditional Flow Matching, Generative artificial intelligence, Denoising Diffusion Probabilistic, Diffusion Probabilistic Models
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Soft Condensed Matter (cond-mat.soft); Statistical Mechanics (cond-mat.stat-mech)
*备注:

点击查看摘要

Abstract:Generative artificial intelligence is now a widely used tool in molecular science. Despite the popularity of probabilistic generative models, numerical experiments benchmarking their performance on molecular data are lacking. In this work, we introduce and explain several classes of generative models, broadly sorted into two categories: flow-based models and diffusion models. We select three representative models: Neural Spline Flows, Conditional Flow Matching, and Denoising Diffusion Probabilistic Models, and examine their accuracy, computational cost, and generation speed across datasets with tunable dimensionality, complexity, and modal asymmetry. Our findings are varied, with no one framework being the best for all purposes. In a nutshell, (i) Neural Spline Flows do best at capturing mode asymmetry present in low-dimensional data, (ii) Conditional Flow Matching outperforms other models for high-dimensional data with low complexity, and (iii) Denoising Diffusion Probabilistic Models appears the best for low-dimensional data with high complexity. Our datasets include a Gaussian mixture model and the dihedral torsion angle distribution of the Aib\textsubscript9 peptide, generated via a molecular dynamics simulation. We hope our taxonomy of probabilistic generative frameworks and numerical results may guide model selection for a wide range of molecular tasks.

[LG-13] Stability and Generalization for Distributed SGDA

链接: https://arxiv.org/abs/2411.09365
作者: Miaoxi Zhu,Yan Sun,Li Shen,Bo Du,Dacheng Tao
关键词-EN: machine learning applications, gaining increasing attention, modern machine learning, Local Decentralized SGDA, Gradient Descent Ascent
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Minimax optimization is gaining increasing attention in modern machine learning applications. Driven by large-scale models and massive volumes of data collected from edge devices, as well as the concern to preserve client privacy, communication-efficient distributed minimax optimization algorithms become popular, such as Local Stochastic Gradient Descent Ascent (Local-SGDA), and Local Decentralized SGDA (Local-DSGDA). While most existing research on distributed minimax algorithms focuses on convergence rates, computation complexity, and communication efficiency, the generalization performance remains underdeveloped, whereas generalization ability is a pivotal indicator for evaluating the holistic performance of a model when fed with unknown data. In this paper, we propose the stability-based generalization analytical framework for Distributed-SGDA, which unifies two popular distributed minimax algorithms including Local-SGDA and Local-DSGDA, and conduct a comprehensive analysis of stability error, generalization gap, and population risk across different metrics under various settings, e.g., (S)C-(S)C, PL-SC, and NC-NC cases. Our theoretical results reveal the trade-off between the generalization gap and optimization error and suggest hyperparameters choice to obtain the optimal population risk. Numerical experiments for Local-SGDA and Local-DSGDA validate the theoretical results.

[LG-14] Approximated Variational Bayesian Inverse Reinforcement Learning for Large Language Model Alignment

链接: https://arxiv.org/abs/2411.09341
作者: Yuang Cai,Yuyu Yuan,Jinsheng Shi,Qinhong Lin
关键词-EN: large language models, LLM alignment, reward, language models, harmless content
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The alignment of large language models (LLMs) is crucial for generating helpful and harmless content. Existing approaches leverage preference-based human feedback data to learn the reward function and align the LLM with the feedback data. However, these approaches focus on modeling the reward difference between the chosen and rejected demonstrations, rather than directly modeling the true reward from each demonstration. Moreover, these approaches assume that the reward is only obtained at the end of the sentence, which overlooks the modeling of intermediate rewards. These issues lead to insufficient use of training signals in the feedback data, limiting the representation and generalization ability of the reward and potentially resulting in reward hacking. In this paper, we formulate LLM alignment as a Bayesian Inverse Reinforcement Learning (BIRL) problem and propose a novel training objective, Approximated Variational Alignment (AVA), to perform LLM alignment through Approximated Variational Reward Imitation Learning (AVRIL). The BIRL formulation facilitates intermediate reward modeling and direct reward modeling on each single demonstration, which enhances the utilization of training signals in the feedback data. Experiments show that AVA outperforms existing LLM alignment approaches in reward modeling, RL fine-tuning, and direct optimization.

[LG-15] Improving hp-Variational Physics-Informed Neural Networks for Steady-State Convection-Dominated Problems

链接: https://arxiv.org/abs/2411.09329
作者: Thivin Anandh,Divij Ghose,Himanshu Jain,Pratham Sunkad,Sashikumaar Ganesan,Volker John
关键词-EN: applying hp-variational physics-informed, hp-variational physics-informed neural, physics-informed neural networks, FastVPINNs framework, paper proposes
类目: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 25 pages, 11 figures, 8 tables

点击查看摘要

Abstract:This paper proposes and studies two extensions of applying hp-variational physics-informed neural networks, more precisely the FastVPINNs framework, to convection-dominated convection-diffusion-reaction problems. First, a term in the spirit of a SUPG stabilization is included in the loss functional and a network architecture is proposed that predicts spatially varying stabilization parameters. Having observed that the selection of the indicator function in hard-constrained Dirichlet boundary conditions has a big impact on the accuracy of the computed solutions, the second novelty is the proposal of a network architecture that learns good parameters for a class of indicator functions. Numerical studies show that both proposals lead to noticeably more accurate results than approaches that can be found in the literature.

[LG-16] Pie: Pooling CPU Memory for LLM Inference

链接: https://arxiv.org/abs/2411.09317
作者: Yi Xu,Ziming Mao,Xiangxi Mo,Shu Liu,Ion Stoica
关键词-EN: revolutionized natural language, natural language processing, demands present significant, present significant challenges, memory demands present
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memory swapping often results in higher latency and lower throughput. This paper introduces Pie, an LLM inference framework that addresses these challenges with performance-transparent swapping and adaptive expansion. By leveraging predictable memory access patterns and the high bandwidth of modern hardware like the NVIDIA GH200 Grace Hopper Superchip, Pie enables concurrent data swapping without affecting foreground computation, expanding effective memory without added latency. Adaptive expansion dynamically adjusts CPU memory allocation based on real-time information, optimizing memory usage and performance under varying conditions. Pie maintains low computation latency, high throughput, and high elasticity. Our experimental evaluation demonstrates that Pie achieves optimal swapping policy during cache warmup and effectively balances increased memory capacity with negligible impact on computation. With its extended capacity, Pie outperforms vLLM by up to 1.9X in throughput and 2X in latency. Additionally, Pie can reduce GPU memory usage by up to 1.67X while maintaining the same performance. Compared to FlexGen, an offline profiling-based swapping solution, Pie achieves magnitudes lower latency and 9.4X higher throughput. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2411.09317 [cs.LG] (or arXiv:2411.09317v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.09317 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Approximate Probabilistic Inference forTime-Series Data A Robust Latent Gaussian Model With Temporal Awareness

链接: https://arxiv.org/abs/2411.09312
作者: Anton Johansson,Arunselvan Ramaswamy
关键词-EN: Deep Latent Gaussian, Latent Gaussian Model, Time Deep Latent, highly varied non-stationary, Latent Gaussian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The development of robust generative models for highly varied non-stationary time series data is a complex yet important problem. Traditional models for time series data prediction, such as Long Short-Term Memory (LSTM), are inefficient and generalize poorly as they cannot capture complex temporal relationships. In this paper, we present a probabilistic generative model that can be trained to capture temporal information, and that is robust to data errors. We call it Time Deep Latent Gaussian Model (tDLGM). Its novel architecture is inspired by Deep Latent Gaussian Model (DLGM). Our model is trained to minimize a loss function based on the negative log loss. One contributing factor to Time Deep Latent Gaussian Model (tDLGM) robustness is our regularizer, which accounts for data trends. Experiments conducted show that tDLGM is able to reconstruct and generate complex time series data, and that it is robust against to noise and faulty data.

[LG-18] Compression Method for Solar Polarization Spectra Collected from Hinode SOT/SP Observations

链接: https://arxiv.org/abs/2411.09311
作者: Jargalmaa Batmunkh,Yusuke Iida,Takayoshi Oba,Haruhisa Iijima
关键词-EN: present significant processing, significant processing challenges, surge in volume, present significant, processing challenges
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR)
*备注:

点击查看摘要

Abstract:The complex structure and extensive details of solar spectral data, combined with a recent surge in volume, present significant processing challenges. To address this, we propose a deep learning-based compression technique using deep autoencoder (DAE) and 1D-convolutional autoencoder (CAE) models developed with Hinode SOT/SP data. We focused on compressing Stokes I and V polarization spectra from the quiet Sun, as well as from active regions, providing a novel insight into comprehensive spectral analysis by incorporating spectra from extreme magnetic fields. The results indicate that the CAE model outperforms the DAE model in reconstructing Stokes profiles, demonstrating greater robustness and achieving reconstruction errors around the observational noise level. The proposed method has proven effective in compressing Stokes I and V spectra from both the quiet Sun and active regions, highlighting its potential for impactful applications in solar spectral analysis, such as detection of unusual spectral signals.

[LG-19] A Centralized-Distributed Transfer Model for Cross-Domain Recommendation Based on Multi-Source Heterogeneous Transfer Learning ICDM

链接: https://arxiv.org/abs/2411.09286
作者: Ke Xu,Ziliang Wang,Wei Zheng,Yuhao Ma,Chenglin Wang,Nengxue Jiang,Cai Cao
关键词-EN: Cross-domain recommendation, Existing CDR methods, CDR methods directly, click through rate, proposed to tackle
类目: Machine Learning (cs.LG)
*备注: Published in: 2022 IEEE International Conference on Data Mining (ICDM) (The authors were affiliated Hangzhou NetEase Cloud Music Technology Co., Ltd.)

点击查看摘要

Abstract:Cross-domain recommendation (CDR) methods are proposed to tackle the sparsity problem in click through rate (CTR) estimation. Existing CDR methods directly transfer knowledge from the source domains to the target domain and ignore the heterogeneities among domains, including feature dimensional heterogeneity and latent space heterogeneity, which may lead to negative transfer. Besides, most of the existing methods are based on single-source transfer, which cannot simultaneously utilize knowledge from multiple source domains to further improve the model performance in the target domain. In this paper, we propose a centralized-distributed transfer model (CDTM) for CDR based on multi-source heterogeneous transfer learning. To address the issue of feature dimension heterogeneity, we build a dual embedding structure: domain specific embedding (DSE) and global shared embedding (GSE) to model the feature representation in the single domain and the commonalities in the global space,separately. To solve the latent space heterogeneity, the transfer matrix and attention mechanism are used to map and combine DSE and GSE adaptively. Extensive offline and online experiments demonstrate the effectiveness of our model.

[LG-20] owards efficient compression and communication for prototype-based decentralized learning

链接: https://arxiv.org/abs/2411.09267
作者: Pablo Fernández-Piñeiro,Manuel Ferández-Veiga,Rebeca P. Díaz-Redondo,Ana Fernández-Vilas,Martín González-Soto
关键词-EN: prototype-based federated learning, master server, aggregation server, prototype-based federated, exchange of model
类目: Machine Learning (cs.LG)
*备注: 15 pages, 2 tables, 7 figures, 6 algorithms

点击查看摘要

Abstract:In prototype-based federated learning, the exchange of model parameters between clients and the master server is replaced by transmission of prototypes or quantized versions of the data samples to the aggregation server. A fully decentralized deployment of prototype- based learning, without a central agregartor of prototypes, is more robust upon network failures and reacts faster to changes in the statistical distribution of the data, suggesting potential advantages and quick adaptation in dynamic learning tasks, e.g., when the data sources are IoT devices or when data is non-iid. In this paper, we consider the problem of designing a communication-efficient decentralized learning system based on prototypes. We address the challenge of prototype redundancy by leveraging on a twofold data compression technique, i.e., sending only update messages if the prototypes are informationtheoretically useful (via the Jensen-Shannon distance), and using clustering on the prototypes to compress the update messages used in the gossip protocol. We also use parallel instead of sequential gossiping, and present an analysis of its age-of-information (AoI). Our experimental results show that, with these improvements, the communications load can be substantially reduced without decreasing the convergence rate of the learning algorithm.

[LG-21] FluidML: Fast and Memory Efficient Inference Optimization

链接: https://arxiv.org/abs/2411.09242
作者: Jinjie Liu,Hang Qiu
关键词-EN: Machine learning models, enabled numerous exciting, Machine learning, learning models deployed, exciting new applications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models deployed on edge devices have enabled numerous exciting new applications, such as humanoid robots, AR glasses, and autonomous vehicles. However, the computing resources available on these edge devices are not catching up with the ever-growing number of parameters in these models. As the models become bigger and more complicated, the novel yet sophisticated structure challenges the inference runtime optimization. We present FluidML, a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference. Evaluations across different platforms show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models and reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches. FluidML is of ~30K line of codes, built for general-purpose usage, and will be released as an open-source inference runtime optimization framework to the community.

[LG-22] Rethinking the “Heatmap Monte Carlo Tree Search” Paradigm for Solving Large Scale TSP

链接: https://arxiv.org/abs/2411.09238
作者: Xuanhao Pan,Chenguang Wang,Chaolong Ying,Ye Xue,Tianshu Yu
关键词-EN: Travelling Salesman Problem, Salesman Problem, inspiring diverse algorithmic, Travelling Salesman, Monte Carlo Tree
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Travelling Salesman Problem (TSP) remains a fundamental challenge in combinatorial optimization, inspiring diverse algorithmic strategies. This paper revisits the “heatmap + Monte Carlo Tree Search (MCTS)” paradigm that has recently gained traction for learning-based TSP solutions. Within this framework, heatmaps encode the likelihood of edges forming part of the optimal tour, and MCTS refines this probabilistic guidance to discover optimal solutions. Contemporary approaches have predominantly emphasized the refinement of heatmap generation through sophisticated learning models, inadvertently sidelining the critical role of MCTS. Our extensive empirical analysis reveals two pivotal insights: 1) The configuration of MCTS strategies profoundly influences the solution quality, demanding meticulous tuning to leverage their full potential; 2) Our findings demonstrate that a rudimentary and parameter-free heatmap, derived from the intrinsic k -nearest nature of TSP, can rival or even surpass the performance of complicated heatmaps, with strong generalizability across various scales. Empirical evaluations across various TSP scales underscore the efficacy of our approach, achieving competitive results. These observations challenge the prevailing focus on heatmap sophistication, advocating a reevaluation of the paradigm to harness both components synergistically. Our code is available at: this https URL.

[LG-23] Ghost-Connect Net: A Generalization-Enhanced Guidance For Sparse Deep Networks Under Distribution Shifts

链接: https://arxiv.org/abs/2411.09199
作者: Mary Isabelle Wisell,Salimeh Yasaei Sekeh
关键词-EN: Sparse deep neural, reducing computational demands, Sparse deep, deep neural networks, excel in real-world
类目: Machine Learning (cs.LG)
*备注: 21 pages, 4 figures, 3 subfigures, 42 tables

点击查看摘要

Abstract:Sparse deep neural networks (DNNs) excel in real-world applications like robotics and computer vision, by reducing computational demands that hinder usability. However, recent studies aim to boost DNN efficiency by trimming redundant neurons or filters based on task relevance, but neglect their adaptability to distribution shifts. We aim to enhance these existing techniques by introducing a companion network, Ghost Connect-Net (GC-Net), to monitor the connections in the original network with distribution generalization advantage. GC-Net’s weights represent connectivity measurements between consecutive layers of the original network. After pruning GC-Net, the pruned locations are mapped back to the original network as pruned connections, allowing for the combination of magnitude and connectivity-based pruning methods. Experimental results using common DNN benchmarks, such as CIFAR-10, Fashion MNIST, and Tiny ImageNet show promising results for hybridizing the method, and using GC-Net guidance for later layers of a network and direct pruning on earlier layers. We provide theoretical foundations for GC-Net’s approach to improving generalization under distribution shifts.

[LG-24] SAFES: Sequential Privacy and Fairness Enhancing Data Synthesis for Responsible AI

链接: https://arxiv.org/abs/2411.09178
作者: Spencer Giddens,Fang Liu
关键词-EN: AI-based decision making, decision making gains, making gains widespread, gains widespread adoption, appropriately addressed
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As data-driven and AI-based decision making gains widespread adoption in most disciplines, it is crucial that both data privacy and decision fairness are appropriately addressed. While differential privacy (DP) provides a robust framework for guaranteeing privacy and several widely accepted methods have been proposed for improving fairness, the vast majority of existing literature treats the two concerns independently. For methods that do consider privacy and fairness simultaneously, they often only apply to a specific machine learning task, limiting their generalizability. In response, we introduce SAFES, a Sequential PrivAcy and Fairness Enhancing data Synthesis procedure that sequentially combines DP data synthesis with a fairness-aware data transformation. SAFES allows full control over the privacy-fairness-utility trade-off via tunable privacy and fairness parameters. We illustrate SAFES by combining AIM, a graphical model-based DP data synthesizer, with a popular fairness-aware data pre-processing transformation. Empirical evaluations on the Adult and COMPAS datasets demonstrate that for reasonable privacy loss, SAFES-generated synthetic data achieve significantly improved fairness metrics with relatively low utility loss.

[LG-25] GRAINRec: Graph and Attention Integrated Approach for Real-Time Session-Based Item Recommendations

链接: https://arxiv.org/abs/2411.09152
作者: Bhavtosh Rath,Pushkar Chennu,David Relyea,Prathyusha Kanmanth Reddy,Amit Pande
关键词-EN: deep learning techniques, Recent advancements, demonstrated significant performance, Attention Integrated session-based, deep learning
类目: Machine Learning (cs.LG)
*备注: Accepted to the 2024 IEEE International Conference on Big Data (IEEE BigData 2024)

点击查看摘要

Abstract:Recent advancements in session-based recommendation models using deep learning techniques have demonstrated significant performance improvements. While they can enhance model sophistication and improve the relevance of recommendations, they also make it challenging to implement a scalable real-time solution. To addressing this challenge, we propose GRAINRec- a Graph and Attention Integrated session-based recommendation model that generates recommendations in real-time. Our scope of work is item recommendations in online retail where a session is defined as an ordered sequence of digital guest actions, such as page views or adds to cart. The proposed model generates recommendations by considering the importance of all items in the session together, letting us predict relevant recommendations dynamically as the session evolves. We also propose a heuristic approach to implement real-time inferencing that meets Target platform’s service level agreement (SLA). The proposed architecture lets us predict relevant recommendations dynamically as the session evolves, rather than relying on pre-computed recommendations for each item. Evaluation results of the proposed model show an average improvement of 1.5% across all offline evaluation metrics. A/B tests done over a 2 week duration showed an increase of 10% in click through rate and 9% increase in attributable demand. Extensive ablation studies are also done to understand our model performance for different parameters.

[LG-26] Laplace Transform Interpretation of Differential Privacy

链接: https://arxiv.org/abs/2411.09142
作者: Rishav Chourasia,Uzair Javaid,Biplap Sikdar
关键词-EN: privacy loss distribution, Differential Privacy, privacy loss, loss distribution, introduce a set
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We introduce a set of useful expressions of Differential Privacy (DP) notions in terms of the Laplace transform of the privacy loss distribution. Its bare form expression appears in several related works on analyzing DP, either as an integral or an expectation. We show that recognizing the expression as a Laplace transform unlocks a new way to reason about DP properties by exploiting the duality between time and frequency domains. Leveraging our interpretation, we connect the (q, \rho(q)) -Rényi DP curve and the (\epsilon, \delta(\epsilon)) -DP curve as being the Laplace and inverse-Laplace transforms of one another. This connection shows that the Rényi divergence is well-defined for complex orders q = \gamma + i \omega . Using our Laplace transform-based analysis, we also prove an adaptive composition theorem for (\epsilon, \delta) -DP guarantees that is exactly tight (i.e., matches even in constants) for all values of \epsilon . Additionally, we resolve an issue regarding symmetry of f -DP on subsampling that prevented equivalence across all functional DP notions.

[LG-27] Complexity-Aware Training of Deep Neural Networks for Optimal Structure Discovery

链接: https://arxiv.org/abs/2411.09127
作者: Valentin Frank Ingmar Guenter,Athanasios Sideris
关键词-EN: deep neural networks, pruning, deep neural, Random Variables scaling, network
类目: Machine Learning (cs.LG)
*备注: 28 pages, 4 figures, 5 tables

点击查看摘要

Abstract:We propose a novel algorithm for combined unit/filter and layer pruning of deep neural networks that functions during training and without requiring a pre-trained network to apply. Our algorithm optimally trades-off learning accuracy and pruning levels while balancing layer vs. unit/filter pruning and computational vs. parameter complexity using only three user-defined parameters, which are easy to interpret and tune. The optimal network structure is found as the solution of a stochastic optimization problem over the network weights and the parameters of variational Bernoulli distributions for 0/1 Random Variables scaling the units and layers of the network. Pruning occurs when a variational parameter converges to 0 rendering the corresponding structure permanently inactive, thus saving computations during training and prediction. A key contribution of our approach is to define a cost function that combines the objectives of prediction accuracy and network pruning in a computational/parameter complexity-aware manner and the automatic selection of the many regularization parameters. We show that the solutions of the optimization problem to which the algorithm converges are deterministic networks. We analyze the ODE system that underlies our stochastic optimization algorithm and establish domains of attraction around zero for the dynamics of the network parameters. These results provide theoretical support for safely pruning units/filters and/or layers during training and lead to practical pruning conditions. We evaluate our method on the CIFAR-10/100 and ImageNet datasets using ResNet architectures and demonstrate that our method improves upon layer only or unit only pruning and favorably competes with combined unit/filter and layer pruning algorithms requiring pre-trained networks with respect to pruning ratios and test accuracy.

[LG-28] Neural Graph Simulator for Complex Systems

链接: https://arxiv.org/abs/2411.09120
作者: Hoyun Choi,Sungyeop Lee,B. Kahng,Junghyo Jo
关键词-EN: Neural Graph Simulator, large-scale simulations, predominant tool, tool for studying, studying the dynamics
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Numerical simulation is a predominant tool for studying the dynamics in complex systems, but large-scale simulations are often intractable due to computational limitations. Here, we introduce the Neural Graph Simulator (NGS) for simulating time-invariant autonomous systems on graphs. Utilizing a graph neural network, the NGS provides a unified framework to simulate diverse dynamical systems with varying topologies and sizes without constraints on evaluation times through its non-uniform time step and autoregressive approach. The NGS offers significant advantages over numerical solvers by not requiring prior knowledge of governing equations and effectively handling noisy or missing data with a robust training scheme. It demonstrates superior computational efficiency over conventional methods, improving performance by over 10^5 times in stiff problems. Furthermore, it is applied to real traffic data, forecasting traffic flow with state-of-the-art accuracy. The versatility of the NGS extends beyond the presented cases, offering numerous potential avenues for enhancement.

[LG-29] Efficiently learning and sampling multimodal distributions with data-based initialization

链接: https://arxiv.org/abs/2411.09117
作者: Frederic Koehler,Holden Lee,Thuy-Duong Vuong
关键词-EN: Markov chain, stationary measure, problem of sampling, sampling a multimodal, small number
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of sampling a multimodal distribution with a Markov chain given a small number of samples from the stationary measure. Although mixing can be arbitrarily slow, we show that if the Markov chain has a k th order spectral gap, initialization from a set of \tilde O(k/\varepsilon^2) samples from the stationary distribution will, with high probability over the samples, efficiently generate a sample whose conditional law is \varepsilon -close in TV distance to the stationary measure. In particular, this applies to mixtures of k distributions satisfying a Poincaré inequality, with faster convergence when they satisfy a log-Sobolev inequality. Our bounds are stable to perturbations to the Markov chain, and in particular work for Langevin diffusion over \mathbb R^d with score estimation error, as well as Glauber dynamics combined with approximation error from pseudolikelihood estimation. This justifies the success of data-based initialization for score matching methods despite slow mixing for the data distribution, and improves and generalizes the results of Koehler and Vuong (2023) to have linear, rather than exponential, dependence on k and apply to arbitrary semigroups. As a consequence of our results, we show for the first time that a natural class of low-complexity Ising measures can be efficiently learned from samples.

[LG-30] Reducing Reasoning Costs - The Path of Optimization for Chain of Thought via Sparse Attention Mechanism NEURIPS2024

链接: https://arxiv.org/abs/2411.09111
作者: Libo Wang
关键词-EN: inference cost surge, large language model, language model inference, model inference cost, sparse attention mechanism
类目: Machine Learning (cs.LG)
*备注: The main text is 9 pages, totaling 13 pages; 5 figures, 3 tables; preprints have been submitted to NeurIPS 2024 Workshop MusIML and OpenReview

点击查看摘要

Abstract:In order to address the chain of thought in the large language model inference cost surge, this research proposes to use a sparse attention mechanism that only focuses on a few relevant tokens. The researcher constructed a new attention mechanism and used GiantRabbit trained with custom GPTs as an experimental tool. The experiment tested and compared the reasoning time, correctness score and chain of thought length of this model and o1 Preview in solving the linear algebra test questions of MIT OpenCourseWare. The results show that GiantRabbit’s reasoning time and chain of thought length are significantly lower than o1 Preview, confirming the feasibility of the sparse attention mechanism in reducing chain of thought reasoning. Detailed architectural details and experimental process have been uploaded to Github, the link is:this https URL.

[LG-31] Continuous GNN-based Anomaly Detection on Edge using Efficient Adaptive Knowledge Graph Learning DATE2025

链接: https://arxiv.org/abs/2411.09072
作者: Sanggeon Yun,Ryozo Masukawa,William Youngwoo Chung,Minhyoung Na,Nathaniel Bastian,Mohsen Imani
关键词-EN: made Video Anomaly, robust security solutions, made Video, Video Anomaly Detection, evidence investigation
类目: Machine Learning (cs.LG)
*备注: Accepted to DATE 2025

点击查看摘要

Abstract:The increasing demand for robust security solutions across various industries has made Video Anomaly Detection (VAD) a critical task in applications such as intelligent surveillance, evidence investigation, and violence detection. Traditional approaches to VAD often rely on finetuning large pre-trained models, which can be computationally expensive and impractical for real-time or resource-constrained environments. To address this, MissionGNN introduced a more efficient method by training a graph neural network (GNN) using a fixed knowledge graph (KG) derived from large language models (LLMs) like GPT-4. While this approach demonstrated significant efficiency in computational power and memory, it faces limitations in dynamic environments where frequent updates to the KG are necessary due to evolving behavior trends and shifting data patterns. These updates typically require cloud-based computation, posing challenges for edge computing applications. In this paper, we propose a novel framework that facilitates continuous KG adaptation directly on edge devices, overcoming the limitations of cloud dependency. Our method dynamically modifies the KG through a three-phase process: pruning, alternating, and creating nodes, enabling real-time adaptation to changing data trends. This continuous learning approach enhances the robustness of anomaly detection models, making them more suitable for deployment in dynamic and resource-constrained environments.

[LG-32] Optimisation Strategies for Ensuring Fairness in Machine Learning: With and Without Demographics

链接: https://arxiv.org/abs/2411.09056
作者: Quan Zhou
关键词-EN: machine learning fairness, machine learning, learning fairness, primary concerns, learning
类目: Machine Learning (cs.LG)
*备注: PhD thesis. arXiv admin note: text overlap with arXiv:2310.11407

点击查看摘要

Abstract:Ensuring fairness has emerged as one of the primary concerns in AI and its related algorithms. Over time, the field of machine learning fairness has evolved to address these issues. This paper provides an extensive overview of this field and introduces two formal frameworks to tackle open questions in machine learning fairness. In one framework, operator-valued optimisation and min-max objectives are employed to address unfairness in time-series problems. This approach showcases state-of-the-art performance on the notorious COMPAS benchmark dataset, demonstrating its effectiveness in real-world scenarios. In the second framework, the challenge of lacking sensitive attributes, such as gender and race, in commonly used datasets is addressed. This issue is particularly pressing because existing algorithms in this field predominantly rely on the availability or estimations of such attributes to assess and mitigate unfairness. Here, a framework for a group-blind bias-repair is introduced, aiming to mitigate bias without relying on sensitive attributes. The efficacy of this approach is showcased through analyses conducted on the Adult Census Income dataset. Additionally, detailed algorithmic analyses for both frameworks are provided, accompanied by convergence guarantees, ensuring the robustness and reliability of the proposed methodologies. Comments: PhD thesis. arXiv admin note: text overlap with arXiv:2310.11407 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2411.09056 [cs.LG] (or arXiv:2411.09056v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.09056 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Quan Zhou [view email] [v1] Wed, 13 Nov 2024 22:29:23 UTC (1,722 KB)

[LG-33] ClevrSkills: Compositional Language and Visual Reasoning in Robotics NEURIPS2024

链接: https://arxiv.org/abs/2411.09052
作者: Sanjay Haresh,Daniel Dijkman,Apratim Bhattacharyya,Roland Memisevic
关键词-EN: Robotics tasks, tasks, cleaning the table, table, Robotics
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: To appear at NeurIPS 2024 (DB track)

点击查看摘要

Abstract:Robotics tasks are highly compositional by nature. For example, to perform a high-level task like cleaning the table a robot must employ low-level capabilities of moving the effectors to the objects on the table, pick them up and then move them off the table one-by-one, while re-evaluating the consequently dynamic scenario in the process. Given that large vision language models (VLMs) have shown progress on many tasks that require high level, human-like reasoning, we ask the question: if the models are taught the requisite low-level capabilities, can they compose them in novel ways to achieve interesting high-level tasks like cleaning the table without having to be explicitly taught so? To this end, we present ClevrSkills - a benchmark suite for compositional reasoning in robotics. ClevrSkills is an environment suite developed on top of the ManiSkill2 simulator and an accompanying dataset. The dataset contains trajectories generated on a range of robotics tasks with language and visual annotations as well as multi-modal prompts as task specification. The suite includes a curriculum of tasks with three levels of compositional understanding, starting with simple tasks requiring basic motor skills. We benchmark multiple different VLM baselines on ClevrSkills and show that even after being pre-trained on large numbers of tasks, these models fail on compositional reasoning in robotics tasks.

[LG-34] Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset

链接: https://arxiv.org/abs/2411.09047
作者: Mohammad Saiful Islam,Mohamed Sami Rakha,William Pourmajidi,Janakan Sivaloganathan,John Steinbacher,Andriy Miranskyy
关键词-EN: ensuring system reliability, IBM Cloud Console, effective anomaly detection, increasingly complex, anomaly detection
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:As Large-Scale Cloud Systems (LCS) become increasingly complex, effective anomaly detection is critical for ensuring system reliability and performance. However, there is a shortage of large-scale, real-world datasets available for benchmarking anomaly detection methods. To address this gap, we introduce a new high-dimensional dataset from IBM Cloud, collected over 4.5 months from the IBM Cloud Console. This dataset comprises 39,365 rows and 117,448 columns of telemetry data. Additionally, we demonstrate the application of machine learning models for anomaly detection and discuss the key challenges faced in this process. This study and the accompanying dataset provide a resource for researchers and practitioners in cloud system monitoring. It facilitates more efficient testing of anomaly detection methods in real-world data, helping to advance the development of robust solutions to maintain the health and performance of large-scale cloud infrastructures. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE) Cite as: arXiv:2411.09047 [cs.LG] (or arXiv:2411.09047v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2411.09047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] ransformer-based Time-Series Biomarker Discovery for COPD Diagnosis NEURIPS2024

链接: https://arxiv.org/abs/2411.09027
作者: Soham Gadgil,Joshua Galanter,Mohammadreza Negahdar
关键词-EN: Chronic Obstructive Pulmonary, Obstructive Pulmonary Disorder, Chronic Obstructive, Pulmonary Disorder, Obstructive Pulmonary
类目: Machine Learning (cs.LG)
*备注: Accepted as a workshop paper to NeurIPS 2024

点击查看摘要

Abstract:Chronic Obstructive Pulmonary Disorder (COPD) is an irreversible and progressive disease which is highly heritable. Clinically, COPD is defined using the summary measures derived from a spirometry test but these are not always adequate. Here we show that using the high-dimensional raw spirogram can provide a richer signal compared to just using the summary measures. We design a transformer-based deep learning technique to process the raw spirogram values along with demographic information and predict clinically-relevant endpoints related to COPD. Our method is able to perform better than prior works while being more computationally efficient. Using the weights learned by the model, we make the framework more interpretable by identifying parts of the spirogram that are important for the model predictions. Pairing up with a board-certified pulmonologist, we also provide clinical insights into the different aspects of the spirogram and show that the explanations obtained from the model align with underlying medical knowledge.

[LG-36] Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

链接: https://arxiv.org/abs/2411.08982
作者: Vima Gupta,Kartik Sinha,Ada Gavrilovska,Anand Padmanabha Iyer
关键词-EN: recently gained popularity, enabling efficient scaling, architectures have recently, large language models, recently gained
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures have recently gained popularity in enabling efficient scaling of large language models. However, we uncover a fundamental tension: while MoEs are designed for selective expert activation, production serving requires request batching, which forces the activation of all experts and negates MoE’s efficiency benefits during the decode phase. We present Lynx, a system that enables efficient MoE inference through dynamic, batch-aware expert selection. Our key insight is that expert importance varies significantly across tokens and inference phases, creating opportunities for runtime optimization. Lynx leverages this insight through a lightweight framework that dynamically reduces active experts while preserving model accuracy. Our evaluations show that Lynx achieves up to 1.55x reduction in inference latency while maintaining negligible accuracy loss from baseline model across complex code generation and mathematical reasoning tasks.

[LG-37] SoccerGuard: Investigating Injury Risk Factors for Professional Soccer Players with Machine Learning

链接: https://arxiv.org/abs/2411.08901
作者: Finn Bartels,Lu Xing,Cise Midoglu,Matthias Boeker,Toralf Kirsten,Pål Halvorsen
关键词-EN: Machine Learning, soccer using Machine, present SoccerGuard, predicting injuries, injuries in women
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present SoccerGuard, a novel framework for predicting injuries in women’s soccer using Machine Learning (ML). This framework can ingest data from multiple sources, including subjective wellness and training load reports from players, objective GPS sensor measurements, third-party player statistics, and injury reports verified by medical personnel. We experiment with a number of different settings related to synthetic data generation, input and output window sizes, and ML models for prediction. Our results show that, given the right configurations and feature combinations, injury event prediction can be undertaken with considerable accuracy. The optimal results are achieved when input windows are reduced and larger combined output windows are defined, in combination with an ideally balanced data set. The framework also includes a dashboard with a user-friendly Graphical User Interface (GUI) to support interactive analysis and visualization.

[LG-38] Conditional regression for the Nonlinear Single-Variable Model

链接: https://arxiv.org/abs/2411.09686
作者: Yantao Wu,Mauro Maggioni
关键词-EN: exploiting geometric assumptions, strong smoothness assumptions, mathbb, gamma, special structure
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 55 pages, 10 figures

点击查看摘要

Abstract:Several statistical models for regression of a function F on \mathbbR^d without the statistical and computational curse of dimensionality exist, for example by imposing and exploiting geometric assumptions on the distribution of the data (e.g. that its support is low-dimensional), or strong smoothness assumptions on F , or a special structure F . Among the latter, compositional models assume F=f\circ g with g mapping to \mathbbR^r with r\ll d , have been studied, and include classical single- and multi-index models and recent works on neural networks. While the case where g is linear is rather well-understood, much less is known when g is nonlinear, and in particular for which g 's the curse of dimensionality in estimating F , or both f and g , may be circumvented. In this paper, we consider a model F(X):=f(\Pi_\gamma X) where \Pi_\gamma:\mathbbR^d\to[0,\rmlen_\gamma] is the closest-point projection onto the parameter of a regular curve \gamma: [0,\rmlen_\gamma]\to\mathbbR^d and f:[0,\rmlen_\gamma]\to\mathbbR^1 . The input data X is not low-dimensional, far from \gamma , conditioned on \Pi_\gamma(X) being well-defined. The distribution of the data, \gamma and f are unknown. This model is a natural nonlinear generalization of the single-index model, which corresponds to \gamma being a line. We propose a nonparametric estimator, based on conditional regression, and show that under suitable assumptions, the strongest of which being that f is coarsely monotone, it can achieve the one - dimensional optimal min-max rate for non-parametric regression, up to the level of noise in the observations, and be constructed in time \mathcalO(d^2n\log n) . All the constants in the learning bounds, in the minimal number of samples required for our bounds to hold, and in the computational complexity are at most low-order polynomials in d .

[LG-39] Neural Operators Can Play Dynamic Stackelberg Games

链接: https://arxiv.org/abs/2411.09644
作者: Guillermo Alvarez,Ibrahim Ekren,Anastasis Kratsios,Xuwei Yang
关键词-EN: Dynamic Stackelberg games, Dynamic Stackelberg, follower best-response operator, Stackelberg games, stylized Stackelberg games
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:Dynamic Stackelberg games are a broad class of two-player games in which the leader acts first, and the follower chooses a response strategy to the leader’s strategy. Unfortunately, only stylized Stackelberg games are explicitly solvable since the follower’s best-response operator (as a function of the control of the leader) is typically analytically intractable. This paper addresses this issue by showing that the \textitfollower’s best-response operator can be approximately implemented by an \textitattention-based neural operator, uniformly on compact subsets of adapted open-loop controls for the leader. We further show that the value of the Stackelberg game where the follower uses the approximate best-response operator approximates the value of the original Stackelberg game. Our main result is obtained using our universal approximation theorem for attention-based neural operators between spaces of square-integrable adapted stochastic processes, as well as stability results for a general class of Stackelberg games.

[LG-40] Counterfactual Uncertainty Quantification of Factual Estimand of Efficacy from Before-and-After Treatment Repeated Measures Randomized Controlled Trials

链接: https://arxiv.org/abs/2411.09635
作者: Xingya Wang,Yang Han,Yushi Liu,Szu-Yu Tang,Jason C. Hsu
关键词-EN: Randomized Controlled Trials, expected differential outcome, textit, ideal estimand, estimand for comparing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 39 pages, 7 figures

点击查看摘要

Abstract:The ideal estimand for comparing a new treatment Rx with a control C is the \textitcounterfactual efficacy Rx:C , the expected differential outcome between Rx and C if each patient were given \textitboth . While counterfactual \textitpoint estimation from \textitfactual Randomized Controlled Trials (RCTs) has been available, this article shows \textitcounterfactual uncertainty quantification (CUQ), quantifying uncertainty for factual point estimates but in a counterfactual setting, is surprisingly achievable. We achieve CUQ whose variability is typically smaller than factual UQ, by creating a new statistical modeling principle called ETZ which is applicable to RCTs with \textitBefore-and-After treatment Repeated Measures, common in many therapeutic areas. We urge caution when estimate of the unobservable true condition of a patient before treatment has measurement error, because that violation of standard regression assumption can cause attenuation in estimating treatment effects. Fortunately, we prove that, for traditional medicine in general, and for targeted therapy with efficacy defined as averaged over the population, counterfactual point estimation is unbiased. However, for targeted therapy, both Real Human and Digital Twins approaches should respect this limitation, lest predicted treatment effect in \textitsubgroups will have bias. Comments: 39 pages, 7 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2411.09635 [stat.ML] (or arXiv:2411.09635v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2411.09635 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] MICCAI-CDMRI 2023 QuantConn Challenge Findings on Achieving Robust Quantitative Connectivity through Harmonized Preprocessing of Diffusion MRI

链接: https://arxiv.org/abs/2411.09618
作者: Nancy R. Newlin,Kurt Schilling,Serge Koudoro,Bramsh Qamar Chandio,Praitayini Kanakaraj,Daniel Moyer,Claire E. Kelly,Sila Genc,Jian Chen,Joseph Yuan-Mou Yang,Ye Wu,Yifei He,Jiawei Zhang,Qingrun Zeng,Fan Zhang,Nagesh Adluru,Vishwesh Nath,Sudhir Pathak,Walter Schneider,Anurag Gade,Yogesh Rathi,Tom Hendriks,Anna Vilanova,Maxime Chamberland,Tomasz Pieciak,Dominika Ciupek,Antonio Tristán Vega,Santiago Aja-Fernández,Maciej Malawski,Gani Ouedraogo,Julia Machnio,Christian Ewert,Paul M. Thompson,Neda Jahanshad,Eleftherios Garyfallidis,Bennett A. Landman
关键词-EN: White matter alterations, White matter, white matter microstructure, alterations are increasingly, increasingly implicated
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:White matter alterations are increasingly implicated in neurological diseases and their progression. International-scale studies use diffusion-weighted magnetic resonance imaging (DW-MRI) to qualitatively identify changes in white matter microstructure and connectivity. Yet, quantitative analysis of DW-MRI data is hindered by inconsistencies stemming from varying acquisition protocols. There is a pressing need to harmonize the preprocessing of DW-MRI datasets to ensure the derivation of robust quantitative diffusion metrics across acquisitions. In the MICCAI-CDMRI 2023 QuantConn challenge, participants were provided raw data from the same individuals collected on the same scanner but with two different acquisitions and tasked with preprocessing the DW-MRI to minimize acquisition differences while retaining biological variation. Submissions are evaluated on the reproducibility and comparability of cross-acquisition bundle-wise microstructure measures, bundle shape features, and connectomics. The key innovations of the QuantConn challenge are that (1) we assess bundles and tractography in the context of harmonization for the first time, (2) we assess connectomics in the context of harmonization for the first time, and (3) we have 10x additional subjects over prior harmonization challenge, MUSHAC and 100x over SuperMUDI. We find that bundle surface area, fractional anisotropy, connectome assortativity, betweenness centrality, edge count, modularity, nodal strength, and participation coefficient measures are most biased by acquisition and that machine learning voxel-wise correction, RISH mapping, and NeSH methods effectively reduce these biases. In addition, microstructure measures AD, MD, RD, bundle length, connectome density, efficiency, and path length are least biased by these acquisition differences.

[LG-42] Equation-informed data-driven identification of flow budgets and dynamics

链接: https://arxiv.org/abs/2411.09545
作者: Nataliya Sevryugina,Serena Costanzo,Steve de Bruyn Kops,Colm-cille Caulfield,Iraj Mortazavi,Taraneh Sayadi
关键词-EN: Computational Fluid Dynamics, Computational Fluid, fluid modelling, Fluid Dynamics, engineering applications
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Computational Fluid Dynamics (CFD) is an indispensable method of fluid modelling in engineering applications, reducing the need for physical prototypes and testing for tasks such as design optimisation and performance analysis. Depending on the complexity of the system under consideration, models ranging from low to high fidelity can be used for prediction, allowing significant speed-up. However, the choice of model requires information about the actual dynamics of the flow regime. Correctly identifying the regions/clusters of flow that share the same dynamics has been a challenging research topic to date. In this study, we propose a novel hybrid approach to flow clustering. It consists of characterising each sample point of the system with equation-based features, i.e. features are budgets that represent the contribution of each term from the original governing equation to the local dynamics at each sample point. This was achieved by applying the Sparse Identification of Nonlinear Dynamical systems (SINDy) method pointwise to time evolution data. The method proceeds with equation-based clustering using the Girvan-Newman algorithm. This allows the detection of communities that share the same physical dynamics. The algorithm is implemented in both Eulerian and Lagrangian frameworks. In the Lagrangian, i.e. dynamic approach, the clustering is performed on the trajectory of each point, allowing the change of clusters to be represented also in time. The performance of the algorithm is first tested on a flow around a cylinder. The construction of the dynamic clusters in this test case clearly shows the evolution of the wake from the steady state solution through the transient to the oscillatory solution. Dynamic clustering was then successfully tested on turbulent flow data. Two distinct and well-defined clusters were identified and their temporal evolution was reconstructed.

[LG-43] Sparse Bayesian Generative Modeling for Compressive Sensing

链接: https://arxiv.org/abs/2411.09483
作者: Benedikt Böck,Sadaf Syed,Wolfgang Utschick
关键词-EN: fundamental linear inverse, compressive sensing, linear inverse problem, work addresses, addresses the fundamental
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This work addresses the fundamental linear inverse problem in compressive sensing (CS) by introducing a new type of regularizing generative prior. Our proposed method utilizes ideas from classical dictionary-based CS and, in particular, sparse Bayesian learning (SBL), to integrate a strong regularization towards sparse solutions. At the same time, by leveraging the notion of conditional Gaussianity, it also incorporates the adaptability from generative models to training data. However, unlike most state-of-the-art generative models, it is able to learn from a few compressed and noisy data samples and requires no optimization algorithm for solving the inverse problem. Additionally, similar to Dirichlet prior networks, our model parameterizes a conjugate prior enabling its application for uncertainty quantification. We support our approach theoretically through the concept of variational inference and validate it empirically using different types of compressible signals.

[LG-44] Graph Neural Networks and Differential Equations: A hybrid approach for data assimilation of fluid flows

链接: https://arxiv.org/abs/2411.09476
作者: M. Quattromini,M.A. Bucci,S. Cherubini,O. Semeraro
关键词-EN: Reynolds-Averaged Navier Stokes, Graph Neural Networks, combines Graph Neural, Navier Stokes, data-driven Neural Networks
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents a novel hybrid approach that combines Graph Neural Networks (GNNs) with Reynolds-Averaged Navier Stokes (RANS) equations to enhance the accuracy of mean flow reconstruction across a range of fluid dynamics applications. Traditional purely data-driven Neural Networks (NNs) models, often struggle maintaining physical consistency. Moreover, they typically require large datasets to achieve reliable performances. The GNN framework, which naturally handles unstructured data such as complex geometries in Computational Fluid Dynamics (CFD), is here integrated with RANS equations as a physical baseline model. The methodology leverages the adjoint method, enabling the use of RANS-derived gradients as optimization terms in the GNN training process. This ensures that the learned model adheres to the governing physics, maintaining physical consistency while improving the prediction accuracy. We test our approach on multiple CFD scenarios, including cases involving generalization with respect to the Reynolds number, sparse measurements, denoising and inpainting of missing portions of the mean flow. The results demonstrate significant improvements in the accuracy of the reconstructed mean flow compared to purely data-driven models, using limited amounts of data in the training dataset. The key strengths of this study are the integration of physical laws into the training process of the GNN, and the ability to achieve high-accuracy predictions with a limited amount of data, making this approach particularly valuable for applications in fluid dynamics where data is often scarce.

[LG-45] Enhancing generalization in high energy physics using white-box adversarial attacks

链接: https://arxiv.org/abs/2411.09296
作者: Franck Rothen,Samuel Klein,Matthew Leigh,Tobias Golling
关键词-EN: Monte Carlo simulations, Machine learning, Monte Carlo, labeled Monte Carlo, particle physics
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, 8 tables, 3 algorithms, to be published in Physical Review D (PRD), presented at the ML4Jets 2024 conference

点击查看摘要

Abstract:Machine learning is becoming increasingly popular in the context of particle physics. Supervised learning, which uses labeled Monte Carlo (MC) simulations, remains one of the most widely used methods for discriminating signals beyond the Standard Model. However, this paper suggests that supervised models may depend excessively on artifacts and approximations from Monte Carlo simulations, potentially limiting their ability to generalize well to real data. This study aims to enhance the generalization properties of supervised models by reducing the sharpness of local minima. It reviews the application of four distinct white-box adversarial attacks in the context of classifying Higgs boson decay signals. The attacks are divided into weight space attacks, and feature space attacks. To study and quantify the sharpness of different local minima this paper presents two analysis methods: gradient ascent and reduced Hessian eigenvalue analysis. The results show that white-box adversarial attacks significantly improve generalization performance, albeit with increased computational complexity.

[LG-46] Classical Verification of Quantum Learning Advantages with Noises

链接: https://arxiv.org/abs/2411.09210
作者: Yinghao Ma,Jiaxi Su,Dong-Ling Deng
关键词-EN: untrusted quantum servers, reliably leverage quantum, leverage quantum computing, Classical verification, quantum
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 13 pages 1 figure

点击查看摘要

Abstract:Classical verification of quantum learning allows classical clients to reliably leverage quantum computing advantages by interacting with untrusted quantum servers. Yet, current quantum devices available in practice suffers from a variety of noises and whether existed classical verification protocols carry over to noisy scenarios remains unclear. Here, we propose an efficient classical error rectification algorithm to reconstruct the noise-free results given by the quantum Fourier sampling circuit with practical constant-level noises. In particular, we prove that the error rectification algorithm can restore the heavy Fourier coefficients by using a small number of noisy samples that scales logarithmically with the problem size. We apply this algorithm to the agnostic parity learning task with uniform input marginal and prove that this task can be accomplished in an efficient way on noisy quantum devices with our algorithm. In addition, we prove that a classical client with access to the random example oracle can verify the agnostic parity learning results from the noisy quantum prover in an efficient way, under the condition that the Fourier coefficients are sparse. Our results demonstrate the feasibility of classical verification of quantum learning advantages with noises, which provide a valuable guide for both theoretical studies and practical applications with current noisy intermediate scale quantum devices.

[LG-47] Hybrid deep additive neural networks

链接: https://arxiv.org/abs/2411.09175
作者: Gyu Min Kim,Jeong Min Jeon
关键词-EN: data science due, neural networks, Traditional neural networks, multi-layer perceptrons, range of tasks
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages, 13 figures

点击查看摘要

Abstract:Traditional neural networks (multi-layer perceptrons) have become an important tool in data science due to their success across a wide range of tasks. However, their performance is sometimes unsatisfactory, and they often require a large number of parameters, primarily due to their reliance on the linear combination structure. Meanwhile, additive regression has been a popular alternative to linear regression in statistics. In this work, we introduce novel deep neural networks that incorporate the idea of additive regression. Our neural networks share architectural similarities with Kolmogorov-Arnold networks but are based on simpler yet flexible activation and basis functions. Additionally, we introduce several hybrid neural networks that combine this architecture with that of traditional neural networks. We derive their universal approximation properties and demonstrate their effectiveness through simulation studies and a real-data application. The numerical results indicate that our neural networks generally achieve better performance than traditional neural networks while using fewer parameters.

[LG-48] FxTS-Net: Fixed-Time Stable Learning Framework for Neural ODEs

链接: https://arxiv.org/abs/2411.09118
作者: Chaoyang Luo,Yan Zou,Wanying Li,Nanjing Huang
关键词-EN: Ordinary Differential Equations, Neural Ordinary Differential, Differential Equations, Ordinary Differential, cleverly link traditional
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural Ordinary Differential Equations (Neural ODEs), as a novel category of modeling big data methods, cleverly link traditional neural networks and dynamical systems. However, it is challenging to ensure the dynamics system reaches a correctly predicted state within a user-defined fixed time. To address this problem, we propose a new method for training Neural ODEs using fixed-time stability (FxTS) Lyapunov conditions. Our framework, called FxTS-Net, is based on the novel FxTS loss (FxTS-Loss) designed on Lyapunov functions, which aims to encourage convergence to accurate predictions in a user-defined fixed time. We also provide an innovative approach for constructing Lyapunov functions to meet various tasks and network architecture requirements, achieved by leveraging supervised information during training. By developing a more precise time upper bound estimation for bounded non-vanishingly perturbed systems, we demonstrate that minimizing FxTS-Loss not only guarantees FxTS behavior of the dynamics but also input perturbation robustness. For optimising FxTS-Loss, we also propose a learning algorithm, in which the simulated perturbation sampling method can capture sample points in critical regions to approximate FxTS-Loss. Experimentally, we find that FxTS-Net provides better prediction performance and better robustness under input perturbation.

[LG-49] Minimax Optimal Two-Sample Testing under Local Differential Privacy

链接: https://arxiv.org/abs/2411.09064
作者: Jongmin Mun,Seungwoo Kwak,Ilmun Kim
关键词-EN: local differential privacy, statistical utility, local differential, private two-sample testing, Google RAPPOR
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 59 pages, 5 figures

点击查看摘要

Abstract:We explore the trade-off between privacy and statistical utility in private two-sample testing under local differential privacy (LDP) for both multinomial and continuous data. We begin by addressing the multinomial case, where we introduce private permutation tests using practical privacy mechanisms such as Laplace, discrete Laplace, and Google’s RAPPOR. We then extend our multinomial approach to continuous data via binning and study its uniform separation rates under LDP over Hölder and Besov smoothness classes. The proposed tests for both discrete and continuous cases rigorously control the type I error for any finite sample size, strictly adhere to LDP constraints, and achieve minimax separation rates under LDP. The attained minimax rates reveal inherent privacy-utility trade-offs that are unavoidable in private testing. To address scenarios with unknown smoothness parameters in density testing, we propose an adaptive test based on a Bonferroni-type approach that ensures robust performance without prior knowledge of the smoothness parameters. We validate our theoretical findings with extensive numerical experiments and demonstrate the practical relevance and effectiveness of our proposed methods.

[LG-50] Microfoundation Inference for Strategic Prediction

链接: https://arxiv.org/abs/2411.08998
作者: Daniele Bracale,Subha Maity,Felipe Maia Polo,Seamus Somerstep,Moulinath Banerjee,Yuekai Sun
关键词-EN: phenomenon termed performative, termed performative prediction, target variable, phenomenon termed, prediction tasks
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Often in prediction tasks, the predictive model itself can influence the distribution of the target variable, a phenomenon termed performative prediction. Generally, this influence stems from strategic actions taken by stakeholders with a vested interest in predictive models. A key challenge that hinders the widespread adaptation of performative prediction in machine learning is that practitioners are generally unaware of the social impacts of their predictions. To address this gap, we propose a methodology for learning the distribution map that encapsulates the long-term impacts of predictive models on the population. Specifically, we model agents’ responses as a cost-adjusted utility maximization problem and propose estimates for said cost. Our approach leverages optimal transport to align pre-model exposure (ex ante) and post-model exposure (ex post) distributions. We provide a rate of convergence for this proposed estimate and assess its quality through empirical demonstrations on a credit-scoring dataset.

[LG-51] Parameter Inference via Differentiable Diffusion Bridge Importance Sampling

链接: https://arxiv.org/abs/2411.08993
作者: Nicklas Boserup,Gefan Yang,Michael Lind Severinsen,Christy Anna Hipsley,Stefan Sommer
关键词-EN: non-linear diffusion processes, introduce a methodology, methodology for performing, non-linear diffusion, diffusion processes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a methodology for performing parameter inference in high-dimensional, non-linear diffusion processes. We illustrate its applicability for obtaining insights into the evolution of and relationships between species, including ancestral state reconstruction. Estimation is performed by utilising score matching to approximate diffusion bridges, which are subsequently used in an importance sampler to estimate log-likelihoods. The entire setup is differentiable, allowing gradient ascent on approximated log-likelihoods. This allows both parameter inference and diffusion mean estimation. This novel, numerically stable, score matching-based parameter inference framework is presented and demonstrated on biological two- and three-dimensional morphometry data.

[LG-52] Non-Euclidean High-Order Smooth Convex Optimization

链接: https://arxiv.org/abs/2411.08987
作者: Juan Pablo Contreras,Cristóbal Guzmán,David Martínez-Rubio
关键词-EN: Hölder continuous, derivatives with respect, order oracle, Hölder, convex objectives
类目: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We develop algorithms for the optimization of convex objectives that have Hölder continuous q -th derivatives with respect to a p -norm by using a q -th order oracle, for p, q \geq 1 . We can also optimize other structured functions. We do this by developing a non-Euclidean inexact accelerated proximal point method that makes use of an inexact uniformly convex regularizer. We also provide nearly matching lower bounds for any deterministic algorithm that interacts with the function via a local oracle.

[LG-53] A Message Passing Neural Network Surrogate Model for Bond-Associated Peridynamic Material Correspondence Formulation

链接: https://arxiv.org/abs/2411.08911
作者: Xuan Hu,Qijun Chen,Nicholas H. Luo,Richy J. Zheng,Shaofan Li
关键词-EN: problems involving discontinuities, non-local continuum mechanics, continuum mechanics theory, offers unique advantages, modeling problems involving
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: substantial text overlap with arXiv:2410.00934

点击查看摘要

Abstract:Peridynamics is a non-local continuum mechanics theory that offers unique advantages for modeling problems involving discontinuities and complex deformations. Within the peridynamic framework, various formulations exist, among which the material correspondence formulation stands out for its ability to directly incorporate traditional continuum material models, making it highly applicable to a range of engineering challenges. A notable advancement in this area is the bond-associated correspondence model, which not only resolves issues of material instability but also achieves high computational accuracy. However, the bond-associated model typically requires higher computational costs than FEA, which can limit its practical application. To address this computational challenge, we propose a novel surrogate model based on a message-passing neural network (MPNN) specifically designed for the bond-associated peridynamic material correspondence formulation. Leveraging the similarities between graph structure and the neighborhood connectivity inherent to peridynamics, we construct an MPNN that can transfers domain knowledge from peridynamics into a computational graph and shorten the computation time via GPU acceleration. Unlike conventional graph neural networks that focus on node features, our model emphasizes edge-based features, capturing the essential material point interactions in the formulation. A key advantage of this neural network approach is its flexibility: it does not require fixed neighborhood connectivity, making it adaptable across diverse configurations and scalable for complex systems. Furthermore, the model inherently possesses translational and rotational invariance, enabling it to maintain physical objectivity: a critical requirement for accurate mechanical modeling.

[LG-54] Long-context Protein Language Model

链接: https://arxiv.org/abs/2411.08909
作者: Yingheng Wang,Zichen Wang,Gil Sadeh,Luca Zancato,Alessandro Achille,George Karypis,Huzefa Rangwala
关键词-EN: generative drug design, Self-supervised training, drug design, great success, generative drug
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 32 pages, 17 figures, 11 tables

点击查看摘要

Abstract:Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design. Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths. Such protein LMs cannot extrapolate to longer proteins and protein complexes well. They also fail to account for the underlying biological mechanisms carried out by biomolecular interactions and dynamics i.e., proteins often interact with other proteins, molecules, and pathways in complex biological systems. In this work, we propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models, to learn high-quality universal protein representations at the amino acid token level using masked language modeling. We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction (PPI) graphs for a second stage of training. LC-PLM demonstrates favorable neural scaling laws, better length extrapolation capability, and a 7% to 34% improvement on protein downstream tasks than Transformer-based ESM-2. LC-PLM-G further trained within the context of PPI graphs shows promising results on protein structure and function prediction tasks. Our study demonstrates the benefit of increasing the context size with computationally efficient LM architecture (e.g. structured state space models) in learning universal protein representations and incorporating molecular interaction context contained in biological graphs.

[LG-55] urkeys Earthquakes: Damage Prediction and Feature Significance Using A Multivariate Analysis

链接: https://arxiv.org/abs/2411.08903
作者: Shrey Shah,Alex Lin,Scott Lin,Josh Patel,Michael Lam,Kevin Zhu
关键词-EN: Accurate damage prediction, response strategies, Accurate damage, crucial for disaster, disaster preparedness
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate damage prediction is crucial for disaster preparedness and response strategies, particularly given the frequent earthquakes in Turkey. Utilizing datasets on earthquake data, infrastructural quality metrics, and contemporary socioeconomic factors, we tested various machine-learning architectures to forecast death tolls and fatalities per affected population. Our findings indicate that the Random Forest model provides the most reliable predictions. The model highlights earthquake magnitude and building stability as the primary determinants of damage. This research contributes to the reduction of fatalities in future seismic events in Turkey.

[LG-56] Demand-Aware Beam Hopping and Power Allocation for Load Balancing in Digital Twin empowered LEO Satellite Networks

链接: https://arxiv.org/abs/2411.08896
作者: Ruili Zhao,Jun Cai,Jiangtao Luo,Junpeng Gao,Yongyi Ran
关键词-EN: technology offer extensive, utilizing beam hopping, offer extensive coverage, Low-Earth orbit, low latency
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Low-Earth orbit (LEO) satellites utilizing beam hopping (BH) technology offer extensive coverage, low latency, high bandwidth, and significant flexibility. However, the uneven geographical distribution and temporal variability of ground traffic demands, combined with the high mobility of LEO satellites, present significant challenges for efficient beam resource utilization. Traditional BH methods based on GEO satellites fail to address issues such as satellite interference, overlapping coverage, and mobility. This paper explores a Digital Twin (DT)-based collaborative resource allocation network for multiple LEO satellites with overlapping coverage areas. A two-tier optimization problem, focusing on load balancing and cell service fairness, is proposed to maximize throughput and minimize inter-cell service delay. The DT layer optimizes the allocation of overlapping coverage cells by designing BH patterns for each satellite, while the LEO layer optimizes power allocation for each selected service cell. At the DT layer, an Actor-Critic network is deployed on each agent, with a global critic network in the cloud center. The A3C algorithm is employed to optimize the DT layer. Concurrently, the LEO layer optimization is performed using a Multi-Agent Reinforcement Learning algorithm, where each beam functions as an independent agent. The simulation results show that this method reduces satellite load disparity by about 72.5% and decreases the average delay to 12ms. Additionally, our approach outperforms other benchmarks in terms of throughput, ensuring a better alignment between offered and requested data.

[LG-57] Network scaling and scale-driven loss balancing for intelligent poroelastography

链接: https://arxiv.org/abs/2411.08886
作者: Yang Xu,Fatemeh Pourahmadian
关键词-EN: deep learning framework, full waveform data, deep learning, learning framework, framework is developed
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A deep learning framework is developed for multiscale characterization of poroelastic media from full waveform data which is known as poroelastography. Special attention is paid to heterogeneous environments whose multiphase properties may drastically change across several scales. Described in space-frequency, the data takes the form of focal solid displacement and pore pressure fields in various neighborhoods furnished either by reconstruction from remote data or direct measurements depending on the application. The objective is to simultaneously recover the six hydromechanical properties germane to Biot equations and their spatial distribution in a robust and efficient manner. Two major challenges impede direct application of existing state-of-the-art techniques for this purpose: (i) the sought-for properties belong to vastly different and potentially uncertain scales, and~(ii) the loss function is multi-objective and multi-scale (both in terms of its individual components and the total loss). To help bridge the gap, we propose the idea of \emphnetwork scaling where the neural property maps are constructed by unit shape functions composed into a scaling layer. In this model, the unknown network parameters (weights and biases) remain of O(1) during training. This forms the basis for explicit scaling of the loss components and their derivatives with respect to the network parameters. Thereby, we propose the physics-based \emphdynamic scaling approach for adaptive loss balancing. The idea is first presented in a generic form for multi-physics and multi-scale PDE systems, and then applied through a set of numerical experiments to poroelastography. The results are presented along with reconstructions by way of gradient normalization (GradNorm) and Softmax adaptive weights (SoftAdapt) for loss balancing. A comparative analysis of the methods and corresponding results is provided.

信息检索

[IR-0] MARM: Unlocking the Future of Recommendation Systems through Memory Augmentation and Scalable Complexity

链接: https://arxiv.org/abs/2411.09425
作者: Xiao Lv,Jiangxia Cao,Shijie Guan,Xiaoyou Zhou,Zhiguang Qi,Yaqiang Zang,Ming Li,Ben Wang,Kun Gai,Guorui Zhou
关键词-EN: language model designing, laws of NLP, past years, model parameters, amount of training
类目: Information Retrieval (cs.IR)
*备注: Work in progress

点击查看摘要

Abstract:Scaling-law has guided the language model designing for past years, however, it is worth noting that the scaling laws of NLP cannot be directly applied to RecSys due to the following reasons: (1) The amount of training samples and model parameters is typically not the bottleneck for the model. Our recommendation system can generate over 50 billion user samples daily, and such a massive amount of training data can easily allow our model parameters to exceed 200 billion, surpassing many LLMs (about 100B). (2) To ensure the stability and robustness of the recommendation system, it is essential to control computational complexity FLOPs carefully. Considering the above differences with LLM, we can draw a conclusion that: for a RecSys model, compared to model parameters, the computational complexity FLOPs is a more expensive factor that requires careful control. In this paper, we propose our milestone work, MARM (Memory Augmented Recommendation Model), which explores a new cache scaling-laws successfully.

[IR-1] LLM -assisted Explicit and Implicit Multi-interest Learning Framework for Sequential Recommendation

链接: https://arxiv.org/abs/2411.09410
作者: Shutong Qiao,Chen Gao,Yong Li,Hongzhi Yin
关键词-EN: current recommender systems, behavioral data, user behavioral data, behavioral, Behavioral Interest Module
类目: Information Retrieval (cs.IR)
*备注: 10 pages

点击查看摘要

Abstract:Multi-interest modeling in current recommender systems (RS) is mainly based on user behavioral data, capturing user interest preferences from multiple dimensions. However, since behavioral data is implicit and often highly sparse, it is challenging to understand users’ complex and diverse interests. Recent studies have shown that the rich semantic information in the text can effectively supplement the deficiencies of behavioral data. Despite this, it is still difficult for small models to directly extract semantic features associated with users’ deep interests. That is, how to effectively align semantics with behavioral information to form a more comprehensive and accurate understanding of user interests has become a critical research this http URL address this, we propose an LLM-assisted explicit and implicit multi-interest learning framework (named EIMF) to model user interests on two levels: behavior and semantics. The framework consists of two parts: Implicit Behavioral Interest Module (IBIM) and Explicit Semantic Interest Module (ESIM). The traditional multi-interest RS model in IBIM can learn users’ implicit behavioral interests from interactions with items. In ESIM, we first adopt a clustering algorithm to select typical samples and design a prompting strategy on LLM to obtain explicit semantic interests. Furthermore, in the training phase, the semantic interests of typical samples can enhance the representation learning of behavioral interests based on the multi-task learning on semantic prediction and modality alignment. Therefore, in the inference stage, accurate recommendations can be achieved with only the user’s behavioral data. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed EIMF framework, which effectively and efficiently combines small models with LLM to improve the accuracy of multi-interest modeling.

[IR-2] Information Need in Metaverse Recordings - A Field Study

链接: https://arxiv.org/abs/2411.09053
作者: Patrick Steinert,Jan Mischkies,Stefan Wagenpfeil,Ingo Frommholz,Matthias L. Hemmje
关键词-EN: underexplored media type, Metaverse Recordings, Multimedia Information Retrieval, MVR retrieval, represent an emerging
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: 12 pages, 3 Figures, 8 Tables

点击查看摘要

Abstract:Metaverse Recordings (MVRs) represent an emerging and underexplored media type within the field of Multimedia Information Retrieval (MMIR). This paper presents findings from a field study aimed at understanding the users information needs and search behaviors specific to MVR retrieval. By conducting and analyzing expert interviews, the study identifies application scenarios and highlights challenges in retrieving multimedia content from the metaverse. The results reveal existing application scenarios of MVRs and confirm the relevance of capturing time-series data from the graphical rendering process and related input-output devices, which are also highly relevant to user needs. Furthermore, the study provides a foundation for developing retrieval systems tailored to MVRs by defining use cases, user stereotypes, and specific requirements for MVR Retrieval systems. The findings contribute to a better understanding of information search behaviors in MVR Retrieval and pave the way for future research and system design in this field.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-11-15

目录

概览 (2024-11-15)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载