本篇博文主要展示 2024-09-19 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-09-19)

今日共更新432篇论文,其中:

  • 自然语言处理54篇(Computation and Language (cs.CL))
  • 人工智能104篇(Artificial Intelligence (cs.AI))
  • 计算机视觉99篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习121篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Gender Representation and Bias in Indian Civil Service Mock Interviews
[NLP-0] 印度公务员模拟面试中的性别代表性和偏见

链接: https://arxiv.org/abs/2409.12194
作者: Somonnoy Banerjee,Sujan Dutta,Soumyajit Datta,Ashiqur R. KhudaBukhsh
关键词-EN: key contributions, paper makes, makes three key, Indian civil service, gender bias
关键词-ZH: 关键贡献,论文制作,制作三个关键,印度公务员制度,性别偏见
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This paper makes three key contributions. First, via a substantial corpus of 51,278 interview questions sourced from 888 YouTube videos of mock interviews of Indian civil service candidates, we demonstrate stark gender bias in the broad nature of questions asked to male and female candidates. Second, our experiments with large language models show a strong presence of gender bias in explanations provided by the LLMs on the gender inference task. Finally, we present a novel dataset of 51,278 interview questions that can inform future social science studies.
摘要:本文做出了三个关键贡献。首先,通过来自888个印度公务员候选人模拟面试YouTube视频的51,278个面试问题的大量数据库,我们展示了向男性和女性候选人提出的问题的广泛性质中存在明显的性别偏见。其次,我们对大型语言模型的实验表明,LLM在性别推断任务中提供的解释中存在强烈的性别偏见。最后,我们提出了一个包含51,278个采访问题的新颖数据集,可以为未来的社会科学研究提供信息。

[NLP-1] Qwen2-VL: Enhancing Vision-Language Models Perception of the World at Any Resolution
[NLP-1] Qwen 2-BL:增强视觉语言模型以任何方式感知世界

链接: https://arxiv.org/abs/2409.12191
作者: Peng Wang,Shuai Bai,Sinan Tan,Shijie Wang,Zhihao Fan,Jinze Bai,Keqin Chen,Xuejing Liu,Jialin Wang,Wenbin Ge,Yang Fan,Kai Dang,Mengfei Du,Xuancheng Ren,Rui Men,Dayiheng Liu,Chang Zhou,Jingren Zhou,Junyang Lin
关键词-EN: Naive Dynamic Resolution, conventional predetermined-resolution approach, previous Qwen-VL models, Dynamic Resolution mechanism, Rotary Position Embedding
关键词-ZH: 原始动态分辨率、传统的预定分辨率方法、之前的Qwen-BL模型、动态分辨率机制、旋转位置嵌入
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at this https URL . arXiv admin note: text overlap with arXiv:2408.15262 by other authors

点击查看摘要

Abstract:We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model’s visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at \urlthis https URL.
摘要:我们推出了Qwen2-VL系列,这是以前Qwen-VL模型的高级升级,重新定义了传统的视觉处理中的预定分辨率方法。Qwen2-VL引入了朴素动态分辨率机制,使模型能够将不同分辨率的图像动态处理为不同数量的视觉标记。这种方法使模型能够生成更高效、更准确的视觉表示,与人类的感知过程紧密一致。该模型还集成了多模式旋转位置嵌入(M-ROPE),促进了跨文本、图像和视频的位置信息的有效融合。我们采用统一的模式来处理图像和视频,增强了模型的视觉感知能力。为了探索大型多通道模型的潜力,Qwen2-VL研究了大型视觉语言模型(LVLM)的标度律。通过扩展模型大小(版本为2B、8B和72B参数)和训练数据量,Qwen2-VL系列实现了极具竞争力的性能。值得注意的是,Qwen2-VL-72B模型在各种多模式基准上实现了与GPT-4o和Claude3.5-Sonnet等领先模型相当的结果,表现优于其他通用型模型。代码位于此HTTPS URL。

[NLP-2] Qwen2.5-Coder Technical Report
[NLP-2] Qwen 2.5-编码器技术报告

链接: https://arxiv.org/abs/2409.12186
作者: Binyuan Hui,Jian Yang,Zeyu Cui,Jiaxi Yang,Dayiheng Liu,Lei Zhang,Tianyu Liu,Jiajun Zhang,Bowen Yu,Kai Dang,An Yang,Rui Men,Fei Huang,Xingzhang Ren,Xuancheng Ren,Jingren Zhou,Junyang Lin
关键词-EN: significant upgrade, series, Abstract, report, predecessor
关键词-ZH: 重大升级、系列、摘要、报告、前身
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general versatility. The model has been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will not only push the boundaries of research in code intelligence but also, through its permissive licensing, encourage broader adoption by developers in real-world applications.
摘要:在这份报告中,我们介绍了Qwen2.5-Coder系列,这是对其前身CodeQwen1.5的重大升级。该系列包括两个型号:Qwen2.5-Coder-1.5B和Qwen2.5-Coder-7B。作为一种特定于代码的模型,Qwen2.5-Coder构建在Qwen2.5架构之上,并在超过5.5万亿个令牌的庞大语料库上继续进行预训练。通过细致的数据清理、可扩展的合成数据生成和平衡的数据混合,Qwen2.5-Coder展示了令人印象深刻的代码生成能力,同时保持了通用的通用性。该模型已经在广泛的代码相关任务上进行了评估,在包括代码生成、完成、推理和修复在内的10多个基准测试中实现了最先进的(SOTA)性能,始终优于相同模型大小的较大模型。我们相信,Qwen2.5-Coder系列的发布不仅将推动代码智能研究的边界,还将通过其许可许可,鼓励开发人员在现实世界的应用程序中更广泛地采用它。

[NLP-3] o CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
[NLP-3] o CoT还是不CoT?思想链主要帮助数学和符号推理

链接: https://arxiv.org/abs/2409.12183
作者: Zayne Sprague,Fangcong Yin,Juan Diego Rodriguez,Dongwei Jiang,Manya Wadhwa,Prasann Singhal,Xinyu Zhao,Xi Ye,Kyle Mahowald,Greg Durrett
关键词-EN: large language models, eliciting reasoning capabilities, facto method, method for eliciting, capabilities from large
关键词-ZH: 大型语言模型、引发推理能力、事实方法、引发方法、来自大型的能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking’’ really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model’s response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT’s gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.
摘要:通过提示的思维链(CoT)是从大型语言模型(LLM)中获得推理能力的事实上的方法。但是,这种额外的“思考”对哪些类型的任务真的有帮助呢?为了分析这一点,我们使用COT对100多篇论文进行了定量的荟萃分析,并对14个模型中的20个数据集进行了自己的评估。我们的结果表明,COT主要在涉及数学或逻辑的任务上提供了很强的性能优势,而在其他类型的任务上的收益要小得多。在MMLU上,直接生成没有COT的答案会产生与COT几乎相同的准确率,除非问题或模型的响应包含等号,这表明符号运算和推理。根据这一发现,我们通过分离计划和执行并与工具增强的LLM进行比较来分析COT在这些问题上的行为。COT的大部分收益来自于符号执行的改进,但与使用符号求解器相比,它的表现较差。我们的结果表明,COT可以有选择地应用,在保持性能的同时节省了推理成本。此外,他们建议需要超越基于提示的COT,转向更好地利用整个LLM应用程序范围内的中间计算的新范例。

[NLP-4] A Controlled Study on Long Context Extension and Generalization in LLMs
[NLP-4] LLM中长上下文扩展和概括的对照研究

链接: https://arxiv.org/abs/2409.12181
作者: Yi Lu,Jing Nathan Yan,Songlin Yang,Justin T. Chiu,Siyu Ren,Fei Yuan,Wenting Zhao,Zhiyong Wu,Alexander M. Rush
关键词-EN: Broad textual understanding, in-context learning require, learning require language, utilize full document, full document contexts
关键词-ZH: 广泛的文本理解,需要背景学习,学习需要语言,利用完整的文档,完整的文档上下文
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.
摘要:广泛的文本理解和上下文学习需要使用完整文档上下文的语言模型。由于直接训练长上下文模型带来的实现挑战,已经提出了许多方法来扩展模型以处理长上下文。然而,由于数据和模型类别的不同,比较这些方法一直具有挑战性,导致不确定如何评估长期背景绩效以及它是否不同于标准评估。我们利用一致的基本模型和扩展数据,通过标准化评估实现了扩展方法的受控协议。我们的研究得出了对长期背景行为的几点见解。首先,我们重申困惑作为一项通用业绩指标的关键作用,即使在长期任务中也是如此。第二,我们发现,当前的近似注意方法在长背景任务中系统性地表现不佳。最后,我们确认,基于精确微调的方法在其扩展范围内通常是有效的,而外推仍然具有挑战性。所有代码库、模型和检查点都将是开源的,从而提高透明度并促进这一人工智能开发关键领域的进一步研究。

[NLP-5] Finetuning Language Models to Emit Linguistic Expressions of Uncertainty
[NLP-5] 微调语言模型以产生不确定性的语言表达

链接: https://arxiv.org/abs/2409.12180
作者: Arslan Chaudhry,Sridhar Thiagarajan,Dilan Gorur
关键词-EN: Large language models, Large language, decision-making tasks, increasingly employed, employed in information-seeking
关键词-ZH: 大型语言模型,大型语言,决策任务,越来越多地使用,用于信息搜索
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly employed in information-seeking and decision-making tasks. Despite their broad utility, LLMs tend to generate information that conflicts with real-world facts, and their persuasive style can make these inaccuracies appear confident and convincing. As a result, end-users struggle to consistently align the confidence expressed by LLMs with the accuracy of their predictions, often leading to either blind trust in all outputs or a complete disregard for their reliability. In this work, we explore supervised finetuning on uncertainty-augmented predictions as a method to develop models that produce linguistic expressions of uncertainty. Specifically, we measure the calibration of pre-trained models and then fine-tune language models to generate calibrated linguistic expressions of uncertainty. Through experiments on various question-answering datasets, we demonstrate that LLMs are well-calibrated in assessing their predictions, and supervised finetuning based on the model’s own confidence leads to well-calibrated expressions of uncertainty, particularly for single-claim answers.
摘要:大型语言模型越来越多地被用于信息搜索和决策任务。尽管LLM具有广泛的实用性,但它们往往会产生与现实世界的事实相冲突的信息,其说服性的风格可以使这些不准确的信息显得自信和令人信服。其结果是,最终用户难以始终如一地使LLMS表达的信心与其预测的准确性保持一致,这往往导致对所有产出的盲目信任,或者完全无视其可靠性。在这项工作中,我们探索了对不确定性增加的预测进行有监督的微调,作为一种方法来开发产生不确定性语言表达的模型。具体地说,我们测量预先训练的模型的校准,然后微调语言模型,以生成校准的不确定性语言表达。通过在不同的问答数据集上的实验,我们证明了LLM在评估他们的预测时是很好地校准的,并且基于模型本身的置信度的有监督的微调导致了很好地校准的不确定性表达,特别是对于单一声明的答案。

[NLP-6] You Only Read Once (YORO): Learning to Internalize Database Knowledge for Text-to-SQL
[NLP-6] 您只读一次(YORO):学习内化文本到SQL的数据库知识

链接: https://arxiv.org/abs/2409.12172
作者: Hideo Kobayashi,Wuwei Lan,Peng Shi,Shuaichen Chang,Jiang Guo,Henghui Zhu,Zhiguo Wang,Patrick Ng
关键词-EN: recent solutions repeatedly, solutions repeatedly encode, unnecessary high inference, high inference cost, overlooking crucial database
关键词-ZH: 最近的解决方案重复,解决方案重复编码,不必要的高推理,高推理成本,忽视关键数据库
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While significant progress has been made on the text-to-SQL task, recent solutions repeatedly encode the same database schema for every question, resulting in unnecessary high inference cost and often overlooking crucial database knowledge. To address these issues, we propose You Only Read Once (YORO), a novel paradigm that directly internalizes database knowledge into the parametric knowledge of a text-to-SQL model during training and eliminates the need for schema encoding during inference. YORO significantly reduces the input token length by 66%-98%. Despite its shorter inputs, our empirical results demonstrate YORO’s competitive performances with traditional systems on three benchmarks as well as its significant outperformance on large databases. Furthermore, YORO excels in handling questions with challenging value retrievals such as abbreviation.
摘要:虽然文本到SQL任务取得了重大进展,但最近的解决方案为每个问题重复编码相同的数据库模式,导致不必要的高推理成本,并且经常忽视关键的数据库知识。为了解决这些问题,我们提出了“You Only Read Once”(YORO),这是一种新颖的范式,可以在训练期间将数据库知识直接内化为文本到SQL模型的参数知识,并消除了推理期间对模式编码的需要。YORO将输入令牌长度显着减少了66%-98%。尽管输入较短,但我们的实证结果证明了YORO在三个基准上与传统系统的竞争表现,以及在大型数据库上的显着优于它的表现。此外,YORO擅长处理缩写等具有挑战性价值检索的问题。

[NLP-7] MAgICoRe: Multi-Agent Iterative Coarse-to-Fine Refinement for Reasoning
[NLP-7] MAgICoRe:多智能体迭代从粗到细的推理细化

链接: https://arxiv.org/abs/2409.12147
作者: Justin Chih-Yao Chen,Archiki Prasad,Swarnadeep Saha,Elias Stengel-Eskin,Mohit Bansal
关键词-EN: Large Language Models’, Large Language, Language Models’, test-time aggregation strategies, generating multiple samples
关键词-ZH: 大型语言模型’、大型语言、语言模型’、测试时聚合策略、生成多个样本
类目: Computation and Language (cs.CL)
备注: 22 pages, code: this https URL

点击查看摘要

Abstract:Large Language Models’ (LLM) reasoning can be improved using test-time aggregation strategies, i.e., generating multiple samples and voting among generated samples. While these improve performance, they often reach a saturation point. Refinement offers an alternative by using LLM-generated feedback to improve solution quality. However, refinement introduces 3 key challenges: (1) Excessive refinement: Uniformly refining all instances can over-correct and reduce the overall performance. (2) Inability to localize and address errors: LLMs have a limited ability to self-correct and struggle to identify and correct their own mistakes. (3) Insufficient refinement: Deciding how many iterations of refinement are needed is non-trivial, and stopping too soon could leave errors unaddressed. To tackle these issues, we propose MAgICoRe, which avoids excessive refinement by categorizing problem difficulty as easy or hard, solving easy problems with coarse-grained aggregation and hard ones with fine-grained and iterative multi-agent refinement. To improve error localization, we incorporate external step-wise reward model (RM) scores. Moreover, to ensure effective refinement, we employ a multi-agent loop with three agents: Solver, Reviewer (which generates targeted feedback based on step-wise RM scores), and the Refiner (which incorporates feedback). To ensure sufficient refinement, we re-evaluate updated solutions, iteratively initiating further rounds of refinement. We evaluate MAgICoRe on Llama-3-8B and GPT-3.5 and show its effectiveness across 5 math datasets. Even one iteration of MAgICoRe beats Self-Consistency by 3.4%, Best-of-k by 3.2%, and Self-Refine by 4.0% while using less than half the samples. Unlike iterative refinement with baselines, MAgICoRe continues to improve with more iterations. Finally, our ablations highlight the importance of MAgICoRe’s RMs and multi-agent communication.
摘要:使用测试时间聚合策略,即生成多个样本和在生成的样本之间进行投票,可以改进大型语言模型的推理。虽然这些提高了性能,但它们通常会达到饱和点。精化通过使用LLM生成的反馈来提高解决方案质量,从而提供了一种替代方案。然而,精化带来了3个关键挑战:(1)过度精化:统一精化所有实例可能会过度更正,降低整体性能。(2)无法定位和解决错误:LLM自我纠正的能力有限,难以识别和纠正自己的错误。(3)精化不足:决定需要多少次精化并非无关紧要,过早停止可能会留下未解决的错误。为了解决这些问题,我们提出了MAgICoRe,它通过将问题的难度分为容易或困难来避免过度求精,使用粗粒度聚集来解决容易的问题,而用细粒度和迭代的多代理求精来解决困难的问题。为了改进错误定位,我们加入了外部逐步奖励模型(RM)分数。此外,为了确保有效的精化,我们使用了一个包含三个代理的多代理循环:求解器、审阅者(根据逐步的RM分数生成目标反馈)和精炼器(结合反馈)。为了确保足够的精细化,我们重新评估更新的解决方案,迭代地启动进一步的精细化。我们在Llama-3-8B和GPT-3.5上对MAgICoRe进行了评估,并在5个数学数据集上展示了它的有效性。即使是MAgICoRe的一次迭代,在使用不到一半的样本的情况下,也比自我一致性高3.4%,比k中最好的高3.2%,比自我优化高4.0%。与基线的迭代精化不同,MAgICoRe会随着更多的迭代而不断改进。最后,我们的消融突出了MAgICoRe的RMS和多代理通信的重要性。

[NLP-8] GRIN: GRadient-INformed MoE
[NLP-8] GRIN:GRient-INformed MoE

链接: https://arxiv.org/abs/2409.12136
作者: Liyuan Liu,Young Jin Kim,Shuohang Wang,Chen Liang,Yelong Shen,Hao Cheng,Xiaodong Liu,Masahiro Tanaka,Xiaoxia Wu,Wenxiang Hu,Vishrav Chaudhary,Zeqi Lin,Chenruidong Zhang,Jilong Xue,Hany Awadalla,Jianfeng Gao,Weizhu Chen
关键词-EN: selectively activating, expert routing, scale more effectively, small subset, sparse computation
关键词-ZH: 选择性激活、专家路由、更有效地扩展、小子集、稀疏计算
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 58 pages

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16 \times 3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.
摘要:专家混合(MOE)模型比密集模型更有效地扩展,这是因为通过专家路由进行稀疏计算,选择性地只激活专家模块的一小部分。然而,稀疏计算挑战了传统的训练实践,因为离散专家路由阻碍了标准的反向传播,从而阻碍了基于梯度的优化,而这是深度学习的基石。为了更好地发挥MOE的可伸缩性,我们引入了梯度信息MOE训练(GRIN),它结合了稀疏梯度估计用于专家路由,并配置了模型并行性以避免令牌丢失。将GRIN应用于自回归语言建模,建立了TOP-216倍3.8B的MOE模型。我们的模型只有6.6B的激活参数,性能优于7B密度模型,与基于相同数据训练的14B密度模型的性能相当。对不同任务的广泛评估表明,GRIN具有显著提高MOE效率的潜力,在MMLU上达到79.4分,在HellaSwg上达到83.7分,在Human Eval上达到74.4分,在数学上达到58.9分。

[NLP-9] BERT-VBD: Vietnamese Multi-Document Summarization Framework
[NLP-9] BERT-VBD:越南多文档摘要框架

链接: https://arxiv.org/abs/2409.12134
作者: Tuan-Cuong Vuong,Trang Mai Xuan,Thien Van Luong
关键词-EN: abstractive summarization, tackling the challenge, challenge of Multi-Document, abstractive summarization methods, abstractive summarization techniques
关键词-ZH: 抽象摘要,应对挑战,多文档的挑战,抽象摘要方法,抽象摘要技术
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:In tackling the challenge of Multi-Document Summarization (MDS), numerous methods have been proposed, spanning both extractive and abstractive summarization techniques. However, each approach has its own limitations, making it less effective to rely solely on either one. An emerging and promising strategy involves a synergistic fusion of extractive and abstractive summarization methods. Despite the plethora of studies in this domain, research on the combined methodology remains scarce, particularly in the context of Vietnamese language processing. This paper presents a novel Vietnamese MDS framework leveraging a two-component pipeline architecture that integrates extractive and abstractive techniques. The first component employs an extractive approach to identify key sentences within each document. This is achieved by a modification of the pre-trained BERT network, which derives semantically meaningful phrase embeddings using siamese and triplet network structures. The second component utilizes the VBD-LLaMA2-7B-50b model for abstractive summarization, ultimately generating the final summary document. Our proposed framework demonstrates a positive performance, attaining ROUGE-2 scores of 39.6% on the VN-MDS dataset and outperforming the state-of-the-art baselines.
摘要:为了应对多文档摘要(MDS)的挑战,人们提出了许多方法,包括摘要摘要技术和抽象摘要技术。然而,每种方法都有自己的局限性,这使得单独依赖其中一种方法的效果较差。一种新兴和有前景的战略涉及摘录和抽象摘要方法的协同融合。尽管在这一领域有过多的研究,但关于组合方法论的研究仍然很少,特别是在越南语处理的背景下。本文提出了一种新的越南MDS框架,该框架利用了一个集成了提取和抽象技术的双组件流水线体系结构。第一部分采用摘录方法来识别每个文档中的关键句子。这是通过对预先训练的BERT网络进行修改来实现的,该网络使用暹罗和三元组网络结构来导出语义上有意义的短语嵌入。第二个组件利用VBD-LLaMA2-7B-50b模型进行抽象摘要,最终生成最终摘要文档。我们提出的框架表现出了积极的性能,在VN-MDS数据集上获得了39.6%的Rouge-2分数,并超过了最先进的基线。

[NLP-10] Linguini: A benchmark for language-agnostic linguistic reasoning
[NLP-10] Linguini:语言不可知语言推理的基准

链接: https://arxiv.org/abs/2409.12126
作者: Eduardo Sánchez,Belen Alastruey,Christophe Ropers,Pontus Stenetorp,Mikel Artetxe,Marta R. Costa-jussà
关键词-EN: linguistic reasoning skills, pre-existing language-specific knowledge, International Linguistic Olympiad, model linguistic reasoning, Linguistic Olympiad corpus
关键词-ZH: 语言推理技能、预先存在的特定语言知识、国际语言奥林匹克竞赛、模型语言推理、语言奥林匹克竞赛文集
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a new benchmark to measure a language model’s linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don’t need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model at 24.05% and the best-performing open model at 8.84%.
摘要:我们提出了一个新的基准来衡量语言模型的语言推理能力,而不依赖于预先存在的语言特定知识。该测试涵盖894个问题,分为160个问题,涉及75种(大多数)资源极低的语言,这些问题摘自国际语言奥林匹克竞赛文集。为了在此基准上获得高准确性,模型不需要先前了解测试语言,因为解决语言难题所需的所有信息都在上下文中呈现。我们发现,虽然所有分析的模型的准确性都低于25%,但开放模型和封闭模型之间存在显着差距,表现最好的专有模型为24.05%,表现最好的开放模型为8.84%。

[NLP-11] Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
[NLP-11] Qwen 2.5-数学技术报告:通过自我改进迈向数学专家模型

链接: https://arxiv.org/abs/2409.12122
作者: An Yang,Beichen Zhang,Binyuan Hui,Bofei Gao,Bowen Yu,Chengpeng Li,Dayiheng Liu,Jianhong Tu,Jingren Zhou,Junyang Lin,Keming Lu,Mingfeng Xue,Runji Lin,Tianyu Liu,Xingzhang Ren,Zhenru Zhang
关键词-EN: math-specific large language, large language models, math-specific large, SFT model, SFT
关键词-ZH: 数学专用大型语言、大型语言模型、数学专用大型、SFT模型、SFT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it’s possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model’s performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.12122 [cs.CL] (or arXiv:2409.12122v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.12122 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:在这份报告中,我们提出了一系列特定于数学的大型语言模型:Qwen2.5-Math和Qwen2.5-Math-Instruct-1.5B/7B/72B。Qwen2.5系列的核心创新之处在于将自强不息的理念整合到整个流程中,从培训前、培训后到推理:(1)在培训前阶段,利用Qwen2-Math-Indict生成大规模、高质量的数学数据。(2)在训练后阶段,我们从Qwen2-Math-Indict中进行大量抽样,建立了一个奖励模型(RM)。然后将该RM应用于有监督微调(SFT)中的数据的迭代进化。有了更强大的SFT模型,就有可能迭代地训练和更新RM,这反过来又指导下一轮SFT数据迭代。在最终的SFT模型上,我们使用最终的RM进行强化学习,得到了Qwen2.5-Math-Indict。(3)在推理阶段,利用模型模型指导采样,优化模型的性能。Qwen2.5-Math-Induct支持中文和英文,并具有高级数学推理能力,包括思想链(CoT)和工具集成推理(TIR)。我们在10个中英文数学数据集上对我们的模型进行了评估,包括GSM8K、数学、高考、AMC23和AIME24,涵盖了从小学水平到数学竞赛题的一系列困难。主题:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG)引用AS:arxiv:2409.12122cs.CLhttps://doi.org/10.48550/arXiv.2409.12122 Focus通过DataCite了解更多arxiv发布的DOI(等待注册)

[NLP-12] Measuring Human and AI Values based on Generative Psychometrics with Large Language Models
[NLP-12] 基于具有大型语言模型的生成性心理测量学测量人类和人工智能价值

链接: https://arxiv.org/abs/2409.12106
作者: Haoran Ye,Yuhang Xie,Yuanyi Ren,Hanjun Fang,Xin Zhang,Guojie Song
关键词-EN: long-standing interdisciplinary inquiry, measurement, LLM, Human, GPV
关键词-ZH: 长期的跨学科研究、测量、LLM、人类、GPV
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human values and their measurement are long-standing interdisciplinary inquiry. Recent advances in AI have sparked renewed interest in this area, with large language models (LLMs) emerging as both tools and subjects of value measurement. This work introduces Generative Psychometrics for Values (GPV), an LLM-based, data-driven value measurement paradigm, theoretically grounded in text-revealed selective perceptions. We begin by fine-tuning an LLM for accurate perception-level value measurement and verifying the capability of LLMs to parse texts into perceptions, forming the core of the GPV pipeline. Applying GPV to human-authored blogs, we demonstrate its stability, validity, and superiority over prior psychological tools. Then, extending GPV to LLM value measurement, we advance the current art with 1) a psychometric methodology that measures LLM values based on their scalable and free-form outputs, enabling context-specific measurement; 2) a comparative analysis of measurement paradigms, indicating response biases of prior methods; and 3) an attempt to bridge LLM values and their safety, revealing the predictive power of different value systems and the impacts of various values on LLM safety. Through interdisciplinary efforts, we aim to leverage AI for next-generation psychometrics and psychometrics for value-aligned AI.
摘要:人的价值及其衡量是一个由来已久的跨学科问题。人工智能领域的最新进展重新引发了人们对这一领域的兴趣,大型语言模型(LLM)成为价值衡量的工具和主题。这项工作介绍了价值的生成心理测量学(GPV),这是一种基于LLM的、数据驱动的价值测量范式,理论上基于文本揭示的选择性感知。我们首先微调LLM以精确测量感知水平的价值,并验证LLM将文本解析为感知的能力,从而形成GPV管道的核心。将GPV应用于人类创作的博客,我们证明了它的稳定性、有效性和优于以往的心理工具。然后,将GPV扩展到LLM值测量,我们通过1)心理测量学方法来测量LLM值,该方法基于其可伸缩和自由形式的输出来测量LLM值,从而实现上下文特定测量;2)对测量范式进行比较分析,指出先前方法的响应偏差;以及3)尝试将LLM值与其安全性联系起来,揭示不同价值系统的预测能力以及不同值对LLM值的影响。通过跨学科的努力,我们的目标是利用人工智能来实现下一代心理测量学,并利用心理测量学来实现与价值一致的人工智能。

[NLP-13] Skill matching at scale: freelancer-project alignment for efficient multilingual candidate retrieval
[NLP-13] 大规模技能匹配:自由职业者与项目的一致,以高效地检索多语言候选人

链接: https://arxiv.org/abs/2409.12097
作者: Warren Jouanneau,Marc Palyart,Emma Jouffroy
关键词-EN: Finding the perfect, perform at scale, perfect match, job proposal, easy task
关键词-ZH: 寻找完美、大规模执行、完美匹配、工作提案、简单任务
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Finding the perfect match between a job proposal and a set of freelancers is not an easy task to perform at scale, especially in multiple languages. In this paper, we propose a novel neural retriever architecture that tackles this problem in a multilingual setting. Our method encodes project descriptions and freelancer profiles by leveraging pre-trained multilingual language models. The latter are used as backbone for a custom transformer architecture that aims to keep the structure of the profiles and project. This model is trained with a contrastive loss on historical data. Thanks to several experiments, we show that this approach effectively captures skill matching similarity and facilitates efficient matching, outperforming traditional methods.
摘要:在一份工作提案和一组自由职业者之间找到完美匹配并不是一件容易的事情,尤其是在多种语言中。在本文中,我们提出了一种新型的神经检索器架构,可以在多语言环境中解决这个问题。我们的方法通过利用预先训练的多语言语言模型来编码项目描述和自由职业者个人资料。后者用作自定义Transformer架构的支柱,该架构旨在保留配置文件和项目的结构。该模型是用历史数据的对比损失来训练的。由于几次实验,我们表明这种方法有效地捕获技能匹配相似性并促进高效匹配,优于传统方法。

[NLP-14] PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models
[NLP-14] PARAPHRASUS:评估短语检测模型的综合基准

链接: https://arxiv.org/abs/2409.12060
作者: Andrianos Michail,Simon Clematide,Juri Opitz
关键词-EN: challenge in NLP, task of determining, NLP, paraphrase, paraphrase detection models
关键词-ZH: NLP挑战,确定任务,NLP,重述,重述检测模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The task of determining whether two texts are paraphrases has long been a challenge in NLP. However, the prevailing notion of paraphrase is often quite simplistic, offering only a limited view of the vast spectrum of paraphrase phenomena. Indeed, we find that evaluating models in a paraphrase dataset can leave uncertainty about their true semantic understanding. To alleviate this, we release paraphrasus, a benchmark designed for multi-dimensional assessment of paraphrase detection models and finer model selection. We find that paraphrase detection models under a fine-grained evaluation lens exhibit trade-offs that cannot be captured through a single classification dataset.
摘要:确定两个文本是否是转述的任务长期以来一直是NLP中的一个挑战。然而,流行的重述概念往往相当简单化,只能对大量的重述现象提供有限的看法。事实上,我们发现在重述数据集中评估模型可能会给它们的真正语义理解留下不确定性。为了缓解这一问题,我们发布了Paraphrasus,这是一个专为Paraphrasus检测模型的多维评估和更好的模型选择而设计的基准。我们发现,细粒度评估镜头下的重述检测模型表现出无法通过单个分类数据集捕捉到的权衡。

[NLP-15] Dual-Layer Training and Decoding of Large Language Model with Simultaneously Thinking and Speaking
[NLP-15] 同时思考、说话的大型语言模型的双层训练与解码

链接: https://arxiv.org/abs/2409.12059
作者: Ningyuan Xi,Xiaoyu Wang,Yetao Wu,Teng Chen,Qingqing Gu,Jinxian Qu,Zhonglin Jiang,Yong Chen,Luo Ji
关键词-EN: Large Language Model, generate human expressions, Large Language, human expressions, Language Model
关键词-ZH: 大型语言模型,生成人类表情,大型语言,人类表情,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Large Language Model can reasonably understand and generate human expressions but may lack of thorough thinking and reasoning mechanisms. Recently there have been several studies which enhance the thinking ability of language models but most of them are not data-driven or training-based. In this paper, we are motivated by the cognitive mechanism in the natural world, and design a novel model architecture called TaS which allows it to first consider the thoughts and then express the response based upon the query. We design several pipelines to annotate or generate the thought contents from prompt-response samples, then add language heads in a middle layer which behaves as the thinking layer. We train the language model by the thoughts-augmented data and successfully let the thinking layer automatically generate reasonable thoughts and finally output more reasonable responses. Both qualitative examples and quantitative results validate the effectiveness and performance of TaS. Our code is available at https://anonymous.4open.science/r/TadE.
摘要:大语言模型能够合理地理解和生成人类的表情,但可能缺乏深入的思考和推理机制。近年来,已经有一些关于提高语言模型思维能力的研究,但大多数都不是基于数据驱动或训练的。本文从自然界的认知机制出发,设计了一种新颖的模型体系结构TAS,它允许它首先考虑思想,然后根据查询表达响应。我们设计了几个管道来标注或从即时响应样本中生成思维内容,然后在中间层添加语言头,作为思考层。我们通过思维增强的数据训练语言模型,成功地让思维层自动生成合理的思维,最终输出更合理的回答。定性算例和定量结果都验证了该算法的有效性和有效性。我们的代码可以在https://anonymous.4open.science/r/TadE.上找到

[NLP-16] Using Large Language Models to Generate Clinical Trial Tables and Figures
[NLP-16] 使用大型语言模型生成临床试验表格和图表

链接: https://arxiv.org/abs/2409.12046
作者: Yumeng Yang,Peter Krusche,Kristyn Pantoja,Cheng Shi,Ethan Ludmir,Kirk Roberts,Gen Zhu
关键词-EN: summarizing clinical trial, clinical trial data, clinical trial, essential tools, tools for summarizing
关键词-ZH: 总结临床试验、临床试验数据、临床试验、基本工具、总结工具
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tables, figures, and listings (TFLs) are essential tools for summarizing clinical trial data. Creation of TFLs for reporting activities is often a time-consuming task encountered routinely during the execution of clinical trials. This study explored the use of large language models (LLMs) to automate the generation of TFLs through prompt engineering and few-shot transfer learning. Using public clinical trial data in ADaM format, our results demonstrated that LLMs can efficiently generate TFLs with prompt instructions, showcasing their potential in this domain. Furthermore, we developed a conservational agent named Clinical Trial TFL Generation Agent: An app that matches user queries to predefined prompts that produce customized programs to generate specific predefined TFLs.
摘要:表格、图表和列表(TFL)是总结临床试验数据的重要工具。创建用于报告活动的TFL通常是临床试验执行过程中经常遇到的一项耗时的任务。本研究探索了使用大型语言模型(LLM)通过即时工程和少量迁移学习来自动生成TFL。使用ADaM格式的公共临床试验数据,我们的结果表明LLM可以有效地生成带有提示指令的TFL,展示了其在该领域的潜力。此外,我们开发了一个名为Clinical Trial TFL Generation Agent的保守代理:一个将用户查询与预定义提示匹配的应用程序,这些提示会生成定制程序来生成特定的预定义TFL。

[NLP-17] ASR Benchmarking: Need for a More Representative Conversational Dataset
[NLP-17] ASB基准:需要更具代表性的对话数据集

链接: https://arxiv.org/abs/2409.12042
作者: Gaurav Maheshwari,Dmitry Ivanov,Théo Johannet,Kevin El Haddad
关键词-EN: Automatic Speech Recognition, LibriSpeech and Fleurs, Automatic Speech, Speech Recognition, achieved remarkable performance
关键词-ZH: 自动语音识别,LibriSpeech和Fleurs,自动语音,语音识别,取得了非凡的性能
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.
摘要:自动语音识别(ASB)系统在LibriSpeech和Fleurs等广泛使用的基准测试上取得了出色的性能。然而,这些基准并不能充分反映现实世界对话环境的复杂性,其中言语通常是非结构化的,并且包含停顿、打断和不同口音等不流利的地方。在这项研究中,我们引入了一个源自TalkBank的多语言对话数据集,由成人之间的非结构化电话对话组成。我们的结果显示,在对话环境中进行测试时,各种最先进的ASB模型的性能均显着下降。此外,我们观察到字错误率与语音不流利的存在之间的相关性,凸显了对更现实、对话式的ASB基准的迫切需求。

[NLP-18] Sampling Latent Material-Property Information From LLM-Derived Embedding Representations
[NLP-18] 从LLM派生的嵌入表示中采样潜在材料属性信息

链接: https://arxiv.org/abs/2409.11971
作者: Luke P. J. Gilligan,Matteo Cobelli,Hasan M. Sayeed,Taylor D. Sparks,Stefano Sanvito
关键词-EN: large language models, capturing latent information, Vector embeddings derived, language models, show promise
关键词-ZH: 大型语言模型,捕获潜在信息,派生的载体嵌入,语言模型,显示前景
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Vector embeddings derived from large language models (LLMs) show promise in capturing latent information from the literature. Interestingly, these can be integrated into material embeddings, potentially useful for data-driven predictions of materials properties. We investigate the extent to which LLM-derived vectors capture the desired information and their potential to provide insights into material properties without additional training. Our findings indicate that, although LLMs can be used to generate representations reflecting certain property information, extracting the embeddings requires identifying the optimal contextual clues and appropriate comparators. Despite this restriction, it appears that LLMs still have the potential to be useful in generating meaningful materials-science representations.
摘要:源自大型语言模型(LLM)的载体嵌入在从文献中捕获潜在信息方面表现出了希望。有趣的是,这些可以集成到材料嵌入中,这对于材料特性的数据驱动预测可能有用。我们研究LLM衍生的载体捕获所需信息的程度以及它们在无需额外训练的情况下提供材料性质见解的潜力。我们的研究结果表明,尽管LLM可用于生成反映某些属性信息的表示,但提取嵌入需要识别最佳上下文线索和适当的比较者。尽管存在这种限制,但LLM似乎仍然有潜力在生成有意义的材料科学表示方面发挥作用。

[NLP-19] Efficacy of Synthetic Data as a Benchmark
[NLP-19] 合成数据作为基准的有效性

链接: https://arxiv.org/abs/2409.11968
作者: Gaurav Maheshwari,Dmitry Ivanov,Kevin El Haddad
关键词-EN: Large language models, few-shot learning settings, Large language, learning settings, including the generation
关键词-ZH: 大型语言模型,少数镜头学习设置,大型语言,学习设置,包括生成
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have enabled a range of applications in zero-shot and few-shot learning settings, including the generation of synthetic datasets for training and testing. However, to reliably use these synthetic datasets, it is essential to understand how representative they are of real-world data. We investigate this by assessing the effectiveness of generating synthetic data through LLM and using it as a benchmark for various NLP tasks. Our experiments across six datasets, and three different tasks, show that while synthetic data can effectively capture performance of various methods for simpler tasks, such as intent classification, it falls short for more complex tasks like named entity recognition. Additionally, we propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks. We find that smaller LLMs exhibit biases towards their own generated data, whereas larger models do not. Overall, our findings suggest that the effectiveness of synthetic data as a benchmark varies depending on the task, and that practitioners should rely on data generated from multiple larger models whenever possible.
摘要:大型语言模型(LLM)在零次和少次学习环境中得到了广泛的应用,包括生成用于训练和测试的合成数据集。然而,为了可靠地使用这些合成数据集,了解它们对真实世界数据的代表性是至关重要的。我们通过评估LLM生成合成数据的有效性并将其用作各种NLP任务的基准来研究这一点。我们在六个数据集和三个不同任务上的实验表明,虽然合成数据可以有效地捕获各种方法在更简单任务(如意图分类)中的性能,但它不适用于命名实体识别等更复杂的任务。此外,我们提出了一个新的度量,称为偏差因子,它评估了当使用相同的LLM来生成基准数据和执行任务时引入的偏差。我们发现,较小的LLM对自己生成的数据表现出偏见,而较大的模型则不会。总体而言,我们的发现表明,合成数据作为基准的有效性因任务而异,从业者应尽可能依赖于从多个较大模型生成的数据。

[NLP-20] LLMs in Education: Novel Perspectives Challenges and Opportunities COLING2025
[NLP-20] 教育法学硕士:新视角挑战和机遇

链接: https://arxiv.org/abs/2409.11917
作者: Bashar Alhafni,Sowmya Vajjala,Stefano Bannò,Kaushal Kumar Maurya,Ekaterina Kochmar
关键词-EN: large language models, language models, interest today, offer for teaching, educational applications
关键词-ZH: 大型语言模型、语言模型、今天的兴趣、教学、教育应用
类目: Computation and Language (cs.CL)
备注: COLING 2025 Tutorial

点击查看摘要

Abstract:The role of large language models (LLMs) in education is an increasing area of interest today, considering the new opportunities they offer for teaching, learning, and assessment. This cutting-edge tutorial provides an overview of the educational applications of NLP and the impact that the recent advances in LLMs have had on this field. We will discuss the key challenges and opportunities presented by LLMs, grounding them in the context of four major educational applications: reading, writing, and speaking skills, and intelligent tutoring systems (ITS). This COLING 2025 tutorial is designed for researchers and practitioners interested in the educational applications of NLP and the role LLMs have to play in this area. It is the first of its kind to address this timely topic.
摘要:考虑到大型语言模型(LLM)为教学、学习和评估提供的新机会,大型语言模型(LLM)在教育中的作用当今越来越受到关注。这本尖端的教程概述了NLP的教育应用以及LLM的最新进展对该领域的影响。我们将讨论法学硕士带来的关键挑战和机遇,将它们置于四种主要教育应用的背景下:阅读、写作和口语技能以及智能辅导系统(ITS)。这本COLING 2025教程专为对NLP的教育应用以及法学硕士在该领域所扮演的角色感兴趣的研究人员和从业者设计。这是第一个解决这一及时话题的此类文章。

[NLP-21] LLMs Persona-Plug = Personalized LLMs
[NLP-21] LLM Persona-Plug =个性化LLM

链接: https://arxiv.org/abs/2409.11901
作者: Jiongnan Liu,Yutao Zhu,Shuting Wang,Xiaochi Wei,Erxue Min,Yu Lu,Shuaiqiang Wang,Dawei Yin,Zhicheng Dou
关键词-EN: prefer diverse outputs, diverse outputs based, plays a critical, critical role, role in numerous
关键词-ZH: 更喜欢多样化的输出,基于多样化的输出,在许多方面发挥着至关重要的作用
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalization plays a critical role in numerous language tasks and applications, since users with the same requirements may prefer diverse outputs based on their individual interests. This has led to the development of various personalized approaches aimed at adapting large language models (LLMs) to generate customized outputs aligned with user preferences. Some of them involve fine-tuning a unique personalized LLM for each user, which is too expensive for widespread application. Alternative approaches introduce personalization information in a plug-and-play manner by retrieving the user’s relevant historical texts as demonstrations. However, this retrieval-based strategy may break the continuity of the user history and fail to capture the user’s overall styles and patterns, hence leading to sub-optimal performance. To address these challenges, we propose a novel personalized LLM model, \ours. It constructs a user-specific embedding for each individual by modeling all her historical contexts through a lightweight plug-in user embedder module. By attaching this embedding to the task input, LLMs can better understand and capture user habits and preferences, thereby producing more personalized outputs without tuning their own parameters. Extensive experiments on various tasks in the language model personalization (LaMP) benchmark demonstrate that the proposed model significantly outperforms existing personalized LLM approaches.
摘要:个性化在许多语言任务和应用中起着至关重要的作用,因为具有相同需求的用户可能会根据他们的个人兴趣选择不同的输出。这导致了各种个性化方法的发展,目的是调整大型语言模型(LLM),以生成与用户偏好一致的定制产出。其中一些涉及为每个用户微调唯一的个性化LLM,这对于广泛应用来说太昂贵了。替代方法通过检索用户的相关历史文本作为演示,以即插即用的方式引入个性化信息。然而,这种基于检索的策略可能会打破用户历史的连续性,无法捕获用户的整体风格和模式,从而导致性能次优。为了应对这些挑战,我们提出了一种新的个性化LLM模型–OUR模型。它通过一个轻量级插件用户嵌入器模块对每个人的所有历史上下文进行建模,从而为每个人构造特定于用户的嵌入。通过将此嵌入附加到任务输入,LLM可以更好地理解和捕获用户习惯和偏好,从而在不调整自己的参数的情况下生成更个性化的输出。在语言模型个性化(LAMP)基准测试中的各种任务上的大量实验表明,该模型的性能明显优于现有的个性化LLM方法。

[NLP-22] DocMamba: Efficient Document Pre-training with State Space Model
[NLP-22] DocMamba:使用状态空间模型的高效文档预训练

链接: https://arxiv.org/abs/2409.11887
作者: Pengfei Hu,Zhenrong Zhang,Jiefeng Ma,Shuhang Liu,Jun Du,Jianshu Zhang
关键词-EN: attracted increasing attention, visually-rich document understanding, recent years, increasing attention, understanding has attracted
关键词-ZH: 越来越受到关注,视觉丰富的文档理解,近年来,越来越受到关注,理解已经吸引
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism’s quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba’s potential for length extrapolation. The code will be available online.
摘要:近年来,视觉丰富的文档理解受到越来越多的关注。基于变压器的预训练模型已成为主流方法,在该领域产生了显著的性能提升。然而,自我注意机制的二次方计算复杂性阻碍了它们处理长文档的效率和能力。本文提出了一种基于状态空间模型的新框架DocMamba。它的设计目的是在保持全局建模能力的同时将计算复杂性降低到线性。为了进一步提高其在文档处理中的有效性,我们引入了分段优先双向扫描(SFBS)来捕获连续的语义信息。实验结果表明,DocMamba在FUNSD、CORD和SORIE等下游数据集上取得了最新的结果,同时显著提高了速度和减少了内存使用量。值得注意的是,对HRDoc的实验证实了DocMamba进行长度外推的潜力。代码将在网上提供。

[NLP-23] Retrieve Annotate Evaluate Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation
[NLP-23] 注释评估重复:利用多模式LLM进行大规模产品检索评估

链接: https://arxiv.org/abs/2409.11860
作者: Kasra Hosseini,Thomas Kober,Josip Krapac,Roland Vollgraf,Weiwei Cheng,Ana Peleteiro Ramallo
关键词-EN: Evaluating production-level retrieval, well-trained human annotators, challenging task due, Large Language Models, production-level retrieval systems
关键词-ZH: 评估生产级检索、训练有素的人类注释者、具有挑战性的任务、大型语言模型、生产级检索系统
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: 13 pages, 5 figures, 4 Tables

点击查看摘要

Abstract:Evaluating production-level retrieval systems at scale is a crucial yet challenging task due to the limited availability of a large pool of well-trained human annotators. Large Language Models (LLMs) have the potential to address this scaling issue and offer a viable alternative to humans for the bulk of annotation tasks. In this paper, we propose a framework for assessing the product search engines in a large-scale e-commerce setting, leveraging Multimodal LLMs for (i) generating tailored annotation guidelines for individual queries, and (ii) conducting the subsequent annotation task. Our method, validated through deployment on a large e-commerce platform, demonstrates comparable quality to human annotations, significantly reduces time and cost, facilitates rapid problem discovery, and provides an effective solution for production-level quality control at scale.
摘要:由于大量训练有素的人类注释者的可用性有限,大规模评估生产级检索系统是一项至关重要但具有挑战性的任务。大型语言模型(LLM)有潜力解决这个扩展问题,并为人类提供大量注释任务的可行替代方案。在本文中,我们提出了一个用于评估大规模电子商务环境中的产品搜索引擎的框架,利用多模式LLM来(i)为单个查询生成量身定制的注释指南,以及(ii)执行后续的注释任务。我们的方法通过在大型电子商务平台上的部署进行验证,表现出与人类注释相当的质量,显着减少了时间和成本,促进了快速发现问题,并为大规模生产级质量控制提供了有效的解决方案。

[NLP-24] MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts
[NLP-24] MEOW:MEMORY通过倒置事实监督LLM忘记学习

链接: https://arxiv.org/abs/2409.11844
作者: Tianle Gu,Kexin Huang,Ruilin Luo,Yuanqi Yao,Yujiu Yang,Yan Teng,Yingchun Wang
关键词-EN: Large Language Models, Large Language, memorize sensitive information, raising concerns, potential misuse
关键词-ZH: 大型语言模型、大型语言、记忆敏感信息、引起担忧、潜在的滥用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can memorize sensitive information, raising concerns about potential misuse. LLM Unlearning, a post-hoc approach to remove this information from trained LLMs, offers a promising solution to mitigate these risks. However, previous practices face three key challenges: 1. Utility: successful unlearning often causes catastrophic collapse on unrelated tasks. 2. Efficiency: many methods either involve adding similarly sized models, which slows down unlearning or inference, or require retain data that are difficult to obtain. 3. Robustness: even effective methods may still leak data via extraction techniques. To address these challenges, we propose MEOW, a simple yet effective gradient descent-based unlearning method. Specifically, we use an offline LLM to generate a set of inverted facts. Then, we design a new metric, MEMO, to quantify memorization in LLMs. Finally, based on the signals provided by MEMO, we select the most appropriate set of inverted facts and finetune the model based on them. We evaluate MEOW on the commonly used unlearn benchmark, ToFU, with Llama2-7B-Chat and Phi-1.5B, and test it on both NLU and NLG tasks. Results demonstrate significant improvement of MEOW in forget quality without substantial loss in model utility. Meanwhile, MEOW does not exhibit significant degradation in NLU or NLG capabilities, and there is even a slight improvement in NLU performance.
摘要:大型语言模型(LLM)可以记忆敏感信息,这引发了人们对潜在滥用的担忧。LLM遗忘是一种从训练有素的LLM中删除这些信息的后自组织方法,它为缓解这些风险提供了一种有前途的解决方案。然而,以前的实践面临着三个关键挑战:1.实用性:成功的遗忘往往会导致不相关任务的灾难性崩溃。2.效率:许多方法要么涉及添加类似大小的模型,这会减缓遗忘或推理的速度,要么需要保留难以获得的数据。3.健壮性:即使是有效的方法也可能通过提取技术泄露数据。为了应对这些挑战,我们提出了一种简单而有效的基于梯度下降的遗忘方法。具体地说,我们使用离线LLM来生成一组颠倒的事实。然后,我们设计了一种新的度量Memo来量化LLMS中的记忆。最后,基于备忘录提供的信号,我们选择最合适的倒置事实集并在此基础上对模型进行微调。我们在常用的遗忘基准豆腐上使用Llama2-7B-Chat和Phi-1.5B来评估Meow,并在NLU和NLG任务上对其进行测试。结果表明,在模型实用性没有实质性损失的情况下,猫咪在遗忘品质上有了显著的改善。同时,Meow在NLU或NLG能力上没有明显下降,甚至在NLU性能上有轻微的改善。

[NLP-25] Extract-and-Abstract: Unifying Extractive and Abstractive Summarization within Single Encoder-Decoder Framework
[NLP-25] 提取和摘要:在单个编码器-解码器框架内统一提取和抽象摘要

链接: https://arxiv.org/abs/2409.11827
作者: Yuping Wu,Hao Li,Hongbo Zhu,Goran Nenadic,Xiao-Jun Zeng
关键词-EN: naturally coherent paradigm, salient information identified, conduct abstractive summarization, naturally coherent, information identified
关键词-ZH: 自然连贯的范式,识别突出信息,进行抽象总结,自然连贯,识别信息
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extract-then-Abstract is a naturally coherent paradigm to conduct abstractive summarization with the help of salient information identified by the extractive model. Previous works that adopt this paradigm train the extractor and abstractor separately and introduce extra parameters to highlight the extracted salients to the abstractor, which results in error accumulation and additional training costs. In this paper, we first introduce a parameter-free highlight method into the encoder-decoder framework: replacing the encoder attention mask with a saliency mask in the cross-attention module to force the decoder to focus only on salient parts of the input. A preliminary analysis compares different highlight methods, demonstrating the effectiveness of our saliency mask. We further propose the novel extract-and-abstract paradigm, ExtAbs, which jointly and seamlessly performs Extractive and Abstractive summarization tasks within single encoder-decoder model to reduce error accumulation. In ExtAbs, the vanilla encoder is augmented to extract salients, and the vanilla decoder is modified with the proposed saliency mask to generate summaries. Built upon BART and PEGASUS, experiments on three datasets show that ExtAbs can achieve superior performance than baselines on the extractive task and performs comparable, or even better than the vanilla models on the abstractive task.
摘要:摘要-然后-摘要是一种自然连贯的范式,借助提取模型识别的显著信息进行抽象摘要。以前采用这种范式的工作分别对抽取器和抽象器进行训练,并引入额外的参数来突出抽象器中提取的显著特征,这导致错误积累和额外的训练成本。在本文中,我们首先在编解码器框架中引入了一种无参数的高亮显示方法:在交叉注意模块中,用显著掩码代替编码器的注意掩码,迫使解码器只关注输入的显著部分。初步分析比较了不同的突出显示方法,证明了我们的显著遮罩的有效性。在此基础上,我们进一步提出了一种新的抽取和抽象范式ExtAbs,它在单个编解码器模型中联合无缝地执行抽象和抽象摘要任务,以减少错误累积。在ExtAbs中,增加了Vanilla编码器以提取显著特征,并使用所提出的显著掩码来修改Vanilla解码器以生成摘要。在BART和Pegasus的基础上,在三个数据集上的实验表明,ExtAbs在提取任务上的性能优于基线,在抽象任务上的性能与普通模型相当,甚至更好。

[NLP-26] he Factuality of Large Language Models in the Legal Domain CIKM2024
[NLP-26] 法律领域大型语言模型的事实

链接: https://arxiv.org/abs/2409.11798
作者: Rajaa El Hamdani,Thomas Bonald,Fragkiskos Malliaros,Nils Holzenberger,Fabian Suchanek
关键词-EN: large language models, realistic usage scenario, language models, model abstain, usage scenario
关键词-ZH: 大型语言模型、现实使用场景、语言模型、模型放弃、使用场景
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: CIKM 2024, short paper

点击查看摘要

Abstract:This paper investigates the factuality of large language models (LLMs) as knowledge bases in the legal domain, in a realistic usage scenario: we allow for acceptable variations in the answer, and let the model abstain from answering when uncertain. First, we design a dataset of diverse factual questions about case law and legislation. We then use the dataset to evaluate several LLMs under different evaluation methods, including exact, alias, and fuzzy matching. Our results show that the performance improves significantly under the alias and fuzzy matching methods. Further, we explore the impact of abstaining and in-context examples, finding that both strategies enhance precision. Finally, we demonstrate that additional pre-training on legal documents, as seen with SaulLM, further improves factual precision from 63% to 81%.
摘要:本文在现实的使用场景中调查了大型语言模型(LLM)作为法律领域知识库的真实性:我们允许答案中存在可接受的变化,并在不确定时让模型放弃回答。首先,我们设计了一个包含有关判例法和立法的各种事实问题的数据集。然后,我们使用该数据集在不同的评估方法下评估多个LLM,包括精确匹配、别名匹配和模糊匹配。我们的结果表明,在别名和模糊匹配方法下,性能显着提高。此外,我们探讨了弃权和背景示例的影响,发现这两种策略都能提高准确性。最后,我们证明,如SaulLM所示,对法律文件进行额外的预训练可以进一步将事实准确性从63%提高到81%。

[NLP-27] Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources
[NLP-27] 在合理低计算资源下开发日本医学大语言模型并进行双语评估

链接: https://arxiv.org/abs/2409.11783
作者: Issey Sukeda
关键词-EN: success of large, scaling law, law has led, widespread adoption, large language models
关键词-ZH: 大型、扩展定律的成功,定律引领了广泛采用的大型语言模型
类目: Computation and Language (cs.CL)
备注: 18 pages, 9 tables

点击查看摘要

Abstract:The recent success of large language models (LLMs) and the scaling law has led to a widespread adoption of larger models. Particularly in the healthcare industry, there is an increasing demand for locally operated LLMs due to security concerns. However, the majority of high quality open-source LLMs have a size of 70B parameters, imposing significant financial burdens on users for GPU preparation and operation. To overcome these issues, we present a medical adaptation based on the recent 7B models, which enables the operation in low computational resources. We compare the performance on medical question-answering benchmarks in two languages (Japanese and English), demonstrating that its scores reach parity with or surpass those of currently existing medical LLMs that are ten times larger. We find that fine-tuning an English-centric base model on Japanese medical dataset improves the score in both language, supporting the effect of cross-lingual knowledge transfer. We hope that this study will alleviate financial challenges, serving as a stepping stone for clinical institutions to practically utilize LLMs locally. Our evaluation code is available at this https URL.
摘要:最近大语言模型(LLM)和标度律的成功导致了更大模型的广泛采用。特别是在医疗保健行业,出于安全方面的考虑,对当地运营的低成本管理的需求不断增加。然而,大多数高质量的开源LLM的大小为70B参数,这给用户带来了巨大的经济负担,用于GPU的准备和运行。为了克服这些问题,我们在最新的7B模型的基础上提出了一种医学适应方法,使操作能够在较低的计算资源下进行。我们比较了两种语言(日语和英语)在医学问答基准上的表现,表明其分数达到或超过了现有医学LLM的十倍。我们发现,在日本医学数据集上微调以英语为中心的基本模型可以提高两种语言的得分,支持跨语言知识转移的效果。我们希望这项研究将缓解财政上的挑战,为临床机构在当地实际使用低成本管理提供一个垫脚石。我们的评估代码可在此HTTPS URL中找到。

[NLP-28] Human-like Affective Cognition in Foundation Models
[NLP-28] 基础模型中的类人情感认知

链接: https://arxiv.org/abs/2409.11733
作者: Kanishk Gandhi,Zoe Lynch,Jan-Philipp Fränken,Kayla Patterson,Sharon Wambu,Tobias Gerstenberg,Desmond C. Ong,Noah D. Goodman
关键词-EN: interaction and experience, models, emotions, foundation models, foundation
关键词-ZH: 互动和体验、模型、情感、基础模型、基础
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding emotions is fundamental to human interaction and experience. Humans easily infer emotions from situations or facial expressions, situations from emotions, and do a variety of other \emphaffective cognition. How adept is modern AI at these inferences? We introduce an evaluation framework for testing affective cognition in foundation models. Starting from psychological theory, we generate 1,280 diverse scenarios exploring relationships between appraisals, emotions, expressions, and outcomes. We evaluate the abilities of foundation models (GPT-4, Claude-3, Gemini-1.5-Pro) and humans (N = 567) across carefully selected conditions. Our results show foundation models tend to agree with human intuitions, matching or exceeding interparticipant agreement. In some conditions, models are ``superhuman’’ – they better predict modal human judgements than the average human. All models benefit from chain-of-thought reasoning. This suggests foundation models have acquired a human-like understanding of emotions and their influence on beliefs and behavior.
摘要:理解情绪是人类互动和体验的基础。人类很容易从情景或面部表情中推断情绪,从情绪中推断情景,并进行各种其他有效的认知。现代人工智能在这些推理方面有多熟练?我们介绍了一个在基础模型中测试情感认知的评估框架。从心理学理论出发,我们产生了1280个不同的情景,探索评估、情绪、表情和结果之间的关系。我们评估了基础模型(GPT-4、Claude-3、Gemini-1.5-Pro)和人类(N=567)在精心选择的条件下的能力。我们的结果表明,基础模型往往符合人类的直觉,符合或超过参与者之间的一致。在某些情况下,模型是“超人”–他们比普通人更好地预测人类的情态判断。所有模型都受益于思维链推理。这表明,基础模型已经获得了对情绪及其对信仰和行为的影响的人类般的理解。

[NLP-29] Enabling Real-Time Conversations with Minimal Training Costs
[NLP-29] 以最低的培训成本实现实时对话

链接: https://arxiv.org/abs/2409.11727
作者: Wang Xu,Shuo Wang,Weilin Zhao,Xu Han,Yukun Yan,Yudi Zhang,Zhe Tao,Zhiyuan Liu,Wanxiang Che
关键词-EN: Large language models, improve human efficiency, Large language, improve human, human efficiency
关键词-ZH: 大型语言模型,提高人类效率,大型语言,提高人类,人类效率
类目: Computation and Language (cs.CL)
备注: 7pages, 6 figures, 1 table

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating on a turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the ability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of queries and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.
摘要:大型语言模型(LLM)已经证明了通过对话交互来提高人类效率的能力。传统的LLM驱动的对话系统运行在基于话轮的范例上,排除了在响应生成期间的实时交互。为了解决这一局限性,研究人员提出了双重结构模型。这些模型可以动态适应用户输入,便于实时交互反馈。然而,这些方法通常需要大量的计算资源才能获得这种能力。为了减少开销,本文提出了一种新的双工译码方法,该方法增强了LLMS的双工能力,只需要最少的额外训练。具体地说,我们的方法采用了会话中查询和响应的并行解码,有效地实现了信道分割-多路复用的解码策略。实验结果表明,该方法以最小的训练代价显着提高了用户-人工智能交互的自然度和人性化程度。

[NLP-30] Revealing the Challenge of Detecting Character Knowledge Errors in LLM Role-Playing
[NLP-30] 揭示LLM角色扮演中检测角色知识错误的挑战

链接: https://arxiv.org/abs/2409.11726
作者: Wenyuan Zhang,Jiawei Sheng,Shuaiyi Nie,Zefeng Zhang,Xinghua Zhang,Yongquan He,Tingwen Liu
关键词-EN: Large language model, LLM role-playing agents, realistic LLM role-playing, constructing realistic LLM, Large language
关键词-ZH: 大型语言模型,LLM角色扮演代理,现实LLM角色扮演,构建现实LLM,大型语言
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 22 pages, 14 figures

点击查看摘要

Abstract:Large language model (LLM) role-playing has gained widespread attention, where the authentic character knowledge is crucial for constructing realistic LLM role-playing agents. However, existing works usually overlook the exploration of LLMs’ ability to detect characters’ known knowledge errors (KKE) and unknown knowledge errors (UKE) while playing roles, which would lead to low-quality automatic construction of character trainable corpus. In this paper, we propose a probing dataset to evaluate LLMs’ ability to detect errors in KKE and UKE. The results indicate that even the latest LLMs struggle to effectively detect these two types of errors, especially when it comes to familiar knowledge. We experimented with various reasoning strategies and propose an agent-based reasoning method, Self-Recollection and Self-Doubt (S2RD), to further explore the potential for improving error detection capabilities. Experiments show that our method effectively improves the LLMs’ ability to detect error character knowledge, but it remains an issue that requires ongoing attention.
摘要:大语言模型的角色扮演得到了广泛的关注,其中真实的角色知识对于构建现实的大语言模型角色扮演智能体至关重要。然而,现有的研究往往忽略了对LLMS在角色扮演过程中检测角色的已知知识错误(KKE)和未知知识错误(UKE)能力的探索,这将导致自动构建角色可训练语料库的质量较低。在本文中,我们提出了一个探测数据集来评估LLMS在KKE和UKE中的错误检测能力。结果表明,即使是最新的LLMS也很难有效地检测出这两种类型的错误,特别是在涉及到熟悉的知识时。我们对各种推理策略进行了实验,并提出了一种基于主体的推理方法–自我回忆和自我怀疑(S2RD),以进一步探索提高错误检测能力的潜力。实验表明,我们的方法有效地提高了LLMS检测错误特征知识的能力,但这仍然是一个需要不断关注的问题。

[NLP-31] ART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning
[NLP-31] ART:一个用于可解释表推理的开源工具增强框架

链接: https://arxiv.org/abs/2409.11724
作者: Xinyuan Lu,Liangming Pan,Yubo Ma,Preslav Nakov,Min-Yen Kan
关键词-EN: Current Large Language, Large Language Models, Current Large, Language Models, Large Language
关键词-ZH: 当前大型语言、大型语言模型、当前大型、语言模型、大型语言
类目: Computation and Language (cs.CL)
备注: technical report

点击查看摘要

Abstract:Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning, which is crucial for tasks such as table question answering (TQA) and table-based fact verification (TFV). To address these challenges, we introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools. TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability. We also present the TOOLTAB dataset, a new benchmark designed specifically for training LLMs in table-tool integration. Our experiments indicate that TART achieves substantial improvements over existing methods (e.g., Chain-of-Thought) by improving both the precision of data processing and the clarity of the reasoning process. Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse real-world scenarios. All the code and data are available at this https URL.
摘要:当前的大型语言模型在理解表格结构和应用精确数值推理方面表现出有限的能力,这对于表格问答(TQA)和基于表格的事实验证(TFV)等任务至关重要。为了应对这些挑战,我们引入了我们的工具增强的表格推理框架(START),它将LLM与专门的工具集成在一起。START包含三个关键组件:用于确保准确数据表示的表格格式化器、用于开发特定计算工具的工具生成器和用于维护可解释性的解释生成器。我们还提供了TOOLTAB数据集,这是一个新的基准,专门为培训LLM在表工具集成方面而设计。我们的实验表明,START通过提高数据处理的精度和推理过程的清晰度,在现有方法(例如,思想链)的基础上取得了实质性的改进。值得注意的是,与CodeLlama配合使用的START达到了闭源LLM GPT-3.5-Turbo的90.0%的准确率,突显了其在各种现实世界场景中的健壮性。所有代码和数据都可以在这个HTTPS URL上找到。

[NLP-32] From Lists to Emojis: How Format Bias Affects Model Alignment
[NLP-32] 从列表到收件箱:格式偏差如何影响模型对齐

链接: https://arxiv.org/abs/2409.11704
作者: Xuanchang Zhang,Wei Xiong,Lichang Chen,Tianyi Zhou,Heng Huang,Tong Zhang
关键词-EN: LMSYS Chatbot Arena, RLHF, human feedback, biases, format
关键词-ZH: LMSEARCH Chatbot Arena、RL HF、人类反馈、偏见、格式
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Working in progress

点击查看摘要

Abstract:In this paper, we study format biases in reinforcement learning from human feedback (RLHF). We observe that many widely-used preference models, including human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark, exhibit strong biases towards specific format patterns, such as lists, links, bold text, and emojis. Furthermore, large language models (LLMs) can exploit these biases to achieve higher rankings on popular benchmarks like AlpacaEval and LMSYS Chatbot Arena. One notable example of this is verbosity bias, where current preference models favor longer responses that appear more comprehensive, even when their quality is equal to or lower than shorter, competing responses. However, format biases beyond verbosity remain largely underexplored in the literature. In this work, we extend the study of biases in preference learning beyond the commonly recognized length bias, offering a comprehensive analysis of a wider range of format biases. Additionally, we show that with a small amount of biased data (less than 1%), we can inject significant bias into the reward model. Moreover, these format biases can also be easily exploited by downstream alignment algorithms, such as best-of-n sampling and online iterative DPO, as it is usually easier to manipulate the format than to improve the quality of responses. Our findings emphasize the need to disentangle format and content both for designing alignment algorithms and evaluating models.
摘要:本文研究了人反馈强化学习中的格式偏差问题。我们观察到,许多广泛使用的偏好模型,包括人工评估者、GPT-4和RewardB边基准测试中的顶级模型,对特定的格式模式表现出强烈的偏见,如列表、链接、粗体文本和表情符号。此外,大型语言模型(LLM)可以利用这些偏见在AlpacaEval和LMSYS聊天机器人Arena等流行基准测试中获得更高的排名。其中一个值得注意的例子是冗长偏向,当前的偏好模型倾向于较长的反应,看起来更全面,即使它们的质量等于或低于较短的、相互竞争的反应。然而,除了冗长之外,格式偏见在文献中仍然基本上没有得到充分的探讨。在这项工作中,我们扩展了对偏好学习中的偏差的研究,超越了公认的长度偏差,提供了对更广泛的格式偏差的全面分析。此外,我们还表明,在少量的有偏数据(小于1%)的情况下,我们可以在报酬模型中注入显著的偏倚。此外,这些格式偏差也很容易被下游比对算法利用,例如n中最佳采样和在线迭代DPO,因为操纵格式通常比提高响应质量更容易。我们的研究结果强调,在设计比对算法和评估模型时,需要将格式和内容分开。

[NLP-33] Harnessing LLMs for API Interactions: A Framework for Classification and Synthetic Data Generation
[NLP-33] 利用LLM进行API交互:分类和合成数据生成框架

链接: https://arxiv.org/abs/2409.11703
作者: Chunliang Tao,Xiaojing Fan,Yahe Yang
关键词-EN: Large Language Models, natural language processing, Large Language, classifying natural language, natural language
关键词-ZH: 大型语言模型、自然语言处理、大型语言、分类自然语言、自然语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) advance in natural language processing, there is growing interest in leveraging their capabilities to simplify software interactions. In this paper, we propose a novel system that integrates LLMs for both classifying natural language inputs into corresponding API calls and automating the creation of sample datasets tailored to specific API functions. By classifying natural language commands, our system allows users to invoke complex software functionalities through simple inputs, improving interaction efficiency and lowering the barrier to software utilization. Our dataset generation approach also enables the efficient and systematic evaluation of different LLMs in classifying API calls, offering a practical tool for developers or business owners to assess the suitability of LLMs for customized API management. We conduct experiments on several prominent LLMs using generated sample datasets for various API functions. The results show that GPT-4 achieves a high classification accuracy of 0.996, while LLaMA-3-8B performs much worse at 0.759. These findings highlight the potential of LLMs to transform API management and validate the effectiveness of our system in guiding model testing and selection across diverse applications.
摘要:随着大型语言模型(LLM)在自然语言处理方面的进步,人们对利用它们的能力来简化软件交互越来越感兴趣。在本文中,我们提出了一个新的系统,它集成了LLMS,用于将自然语言输入分类到相应的API调用中,并自动创建针对特定API函数的样本数据集。通过对自然语言命令进行分类,我们的系统允许用户通过简单的输入调用复杂的软件功能,提高了交互效率,降低了软件使用的门槛。我们的数据集生成方法还可以在对API调用进行分类时对不同的LLM进行高效和系统的评估,为开发人员或企业主提供一个实用的工具来评估LLM是否适合于定制的API管理。我们使用为各种API函数生成的样本数据集在几个著名的LLM上进行了实验。结果表明,GPT-4的分类正确率高达0.996,而LAMA-3-8B的分类正确率仅为0.759。这些发现突出了LLMS在改变原料药管理方面的潜力,并验证了我们的系统在指导不同应用程序的模型测试和选择方面的有效性。

[NLP-34] FLARE: Fusing Language Models and Collaborative Architectures for Recommender Enhancement
[NLP-34] DART:融合语言模型和协作架构以增强推荐器

链接: https://arxiv.org/abs/2409.11699
作者: Liam Hebert,Marialena Kyriakidi,Hubert Pham,Krishna Sayana,James Pine,Sukhdeep Sodhi,Ambarish Jash
关键词-EN: Hybrid recommender systems, combining item IDs, Hybrid recommender, textual descriptions, offer potential
关键词-ZH: 混合推荐系统,结合了物品ID、混合推荐、文本描述,提供了潜力
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hybrid recommender systems, combining item IDs and textual descriptions, offer potential for improved accuracy. However, previous work has largely focused on smaller datasets and model architectures. This paper introduces Flare (Fusing Language models and collaborative Architectures for Recommender Enhancement), a novel hybrid recommender that integrates a language model (mT5) with a collaborative filtering model (Bert4Rec) using a Perceiver network. This architecture allows Flare to effectively combine collaborative and content information for enhanced recommendations. We conduct a two-stage evaluation, first assessing Flare’s performance against established baselines on smaller datasets, where it demonstrates competitive accuracy. Subsequently, we evaluate Flare on a larger, more realistic dataset with a significantly larger item vocabulary, introducing new baselines for this setting. Finally, we showcase Flare’s inherent ability to support critiquing, enabling users to provide feedback and refine recommendations. We further leverage critiquing as an evaluation method to assess the model’s language understanding and its transferability to the recommendation task. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2409.11699 [cs.IR] (or arXiv:2409.11699v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.11699 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:混合式推荐系统结合了商品ID和文本描述,提供了提高准确率的潜力。然而,以前的工作主要集中在较小的数据集和模型体系结构上。本文介绍了FLARE(Fuating Language Models And Collaborative Builtures For Recomender Enhenvery),它是一种新的混合式推荐系统,它利用感知器网络将语言模型(MT5)和协同过滤模型(Bert4Rec)结合在一起。此体系结构使FLARE能够有效地将协作信息和内容信息结合在一起,以获得增强的建议。我们进行两个阶段的评估,首先根据在较小数据集上建立的基准来评估FLARE的性能,在这些基准中,它展示了具有竞争力的准确性。随后,我们在一个更大、更逼真的数据集上评估Flare,该数据集具有明显更大的项目词汇表,为此设置引入了新的基线。最后,我们展示了FLARE支持批评的固有能力,使用户能够提供反馈和完善建议。我们进一步利用批评作为一种评估方法来评估模型的语言理解能力及其对推荐任务的可转移性。主题:信息检索(cs.IR);计算与语言(cs.CL)引用为:arxiv:2409.11699cs.IRhttps://doi.org/10.48550/arXiv.2409.11699 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-35] Enhancing Complex Formula Recognition with Hierarchical Detail-Focused Network ICASSP2025
[NLP-35] 利用以分层细节为中心的网络增强复杂公式识别

链接: https://arxiv.org/abs/2409.11677
作者: Jiale Wang,Junhui Yu,Huanyong Liu,Chenanran Kong
关键词-EN: Mathematical Expression Recognition, complex Mathematical Expression, Mathematical Expression, Expression Recognition, Hierarchical Detail-Focused Recognition
关键词-ZH: 数学公式识别、复杂数学公式、数学公式、公式识别、分层细节识别
类目: Computation and Language (cs.CL)
备注: Submitted to the 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)

点击查看摘要

Abstract:Hierarchical and complex Mathematical Expression Recognition (MER) is challenging due to multiple possible interpretations of a formula, complicating both parsing and evaluation. In this paper, we introduce the Hierarchical Detail-Focused Recognition dataset (HDR), the first dataset specifically designed to address these issues. It consists of a large-scale training set, HDR-100M, offering an unprecedented scale and diversity with one hundred million training instances. And the test set, HDR-Test, includes multiple interpretations of complex hierarchical formulas for comprehensive model performance evaluation. Additionally, the parsing of complex formulas often suffers from errors in fine-grained details. To address this, we propose the Hierarchical Detail-Focused Recognition Network (HDNet), an innovative framework that incorporates a hierarchical sub-formula module, focusing on the precise handling of formula details, thereby significantly enhancing MER performance. Experimental results demonstrate that HDNet outperforms existing MER models across various datasets.
摘要:由于公式有多种可能的解释,使得解析和求值变得复杂,因此层次化和复杂化的数学表达式识别(MER)具有挑战性。在本文中,我们介绍了分层细节聚焦识别数据集(HDR),这是第一个专门为解决这些问题而设计的数据集。它由一个大型培训集HDR-100M组成,提供了前所未有的规模和多样性,拥有1亿个培训实例。测试集HDR-Test包括对复杂层次公式的多种解释,用于全面的模型性能评估。此外,复杂公式的解析经常会在细粒度细节方面出现错误。为了解决这一问题,我们提出了分层细节聚焦识别网络(HDNet),这是一个创新的框架,它结合了一个分层子公式模块,专注于对公式细节的精确处理,从而显著提高了MER的性能。实验结果表明,在不同的数据集上,HDNet的性能都优于现有的MER模型。

[NLP-36] RUIE: Retrieval-based Unified Information Extraction using Large Language Model
[NLP-36] RUIE:使用大型语言模型的基于检索的统一信息提取

链接: https://arxiv.org/abs/2409.11673
作者: Xincheng Liao,Junwen Duan,Yixi Huang,Jianxin Wang
关键词-EN: Unified information extraction, Retrieval-based Unified Information, information extraction, information extraction tasks, Unified information
关键词-ZH: 统一信息提取,基于检索的统一信息,信息提取,信息提取任务,统一信息
类目: Computation and Language (cs.CL)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:Unified information extraction (UIE) aims to complete all information extraction tasks using a single model or framework. While previous work has primarily focused on instruction-tuning large language models (LLMs) with constructed datasets, these methods require significant computational resources and struggle to generalize to unseen tasks. To address these limitations, we propose RUIE (Retrieval-based Unified Information Extraction), a framework that leverages in-context learning to enable rapid generalization while reducing computational costs. The key challenge in RUIE is selecting the most beneficial demonstrations for LLMs to effectively handle diverse IE tasks. To achieve this, we integrate LLM preferences for ranking candidate demonstrations and design a keyword-enhanced reward model to capture fine-grained relationships between queries and demonstrations. We then train a bi-encoder retriever for UIE through contrastive learning and knowledge distillation. To the best of our knowledge, RUIE is the first trainable retrieval framework for UIE. Experimental results on 8 held-out datasets demonstrate RUIE’s effectiveness in generalizing to unseen tasks, with average F1-score improvements of 19.22 and 3.13 compared to instruction-tuning methods and other retrievers, respectively. Further analysis confirms RUIE’s adaptability to LLMs of varying sizes and the importance of its key components.
摘要:统一信息抽取的目标是使用单一的模型或框架完成所有的信息抽取任务。虽然以前的工作主要集中在使用构建的数据集对大型语言模型(LLM)进行指令调优,但这些方法需要大量的计算资源,并且难以推广到未知的任务。为了解决这些局限性,我们提出了RUIE(基于检索的统一信息提取),这是一个利用上下文中的学习来实现快速概括的框架,同时降低了计算成本。RUIE的关键挑战是为低成本管理选择最有益的演示,以便有效地处理不同的IE任务。为了实现这一点,我们集成了LLM偏好来对候选演示进行排名,并设计了一个关键字增强的奖励模型来捕获查询和演示之间的细粒度关系。然后,我们通过对比学习和知识提炼的方法训练了一个UIE的双编码检索器。据我们所知,RUIE是教育研究所的第一个可培训的检索框架。在8个待检索集上的实验结果表明,RUIE能够有效地概括为看不见的任务,与指令调优方法和其他检索者相比,平均F1分数分别提高了19.22和3.13。进一步的分析证实了RUIE对不同大小的LLMS的适应性及其关键组件的重要性。

[NLP-37] BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla
[NLP-37] BanStereoSet:衡量孟加拉国LLM刻板印象社会偏见的数据集

链接: https://arxiv.org/abs/2409.11638
作者: Mahammed Kamruzzaman,Abdullah Al Monsur,Shrabon Das,Enamul Hassan,Gene Louis Kim
关键词-EN: study presents BanStereoSet, Bangla language, study presents, designed to evaluate, evaluate stereotypical social
关键词-ZH: 研究提出BanStereoSet,孟加拉语,研究提出,旨在评估,评估刻板印象社会
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study presents BanStereoSet, a dataset designed to evaluate stereotypical social biases in multilingual LLMs for the Bangla language. In an effort to extend the focus of bias research beyond English-centric datasets, we have localized the content from the StereoSet, IndiBias, and Kamruzzaman et. al.'s datasets, producing a resource tailored to capture biases prevalent within the Bangla-speaking community. Our BanStereoSet dataset consists of 1,194 sentences spanning 9 categories of bias: race, profession, gender, ageism, beauty, beauty in profession, region, caste, and religion. This dataset not only serves as a crucial tool for measuring bias in multilingual LLMs but also facilitates the exploration of stereotypical bias across different social categories, potentially guiding the development of more equitable language technologies in Bangladeshi contexts. Our analysis of several language models using this dataset indicates significant biases, reinforcing the necessity for culturally and linguistically adapted datasets to develop more equitable language technologies.
摘要:本研究提供了BanStereoSet数据集,该数据集旨在评估孟加拉语多语言LLM中的刻板印象社会偏见。为了将偏见研究的重点扩展到以英语为中心的数据集之外,我们已经本地化了来自StereoSet、IndiBias和Kamruzzaman et的内容。我们的BanStereoSet数据集由1,194个句子组成,涉及9类偏见:种族、职业、性别、年龄歧视、美貌、职业美貌、地区、种姓和宗教。这个数据集不仅是衡量多语种土地管理中偏见的重要工具,而且有助于探索不同社会类别中的刻板印象偏见,潜在地指导孟加拉语环境中更公平的语言技术的开发。我们使用该数据集对几个语言模型的分析表明了显著的偏见,加强了在文化和语言上适应的数据集开发更公平的语言技术的必要性。

[NLP-38] “A Woman is More Culturally Knowledgeable than A Man?”: The Effect of Personas on Cultural Norm Interpretation in LLMs
[NLP-38] “女人比男人更有文化知识?”:人物角色对法学硕士中文化规范解释的影响

链接: https://arxiv.org/abs/2409.11636
作者: Mahammed Kamruzzaman,Hieu Nguyen,Nazmul Hassan,Gene Louis Kim
关键词-EN: large language models, deployment of large, large language, increasing demand, demand for personalized
关键词-ZH: 大型语言模型、大型语言的部署、需求不断增加、个性化需求
类目: Computation and Language (cs.CL)
备注: Preprint, Under Review

点击查看摘要

Abstract:As the deployment of large language models (LLMs) expands, there is an increasing demand for personalized LLMs. One method to personalize and guide the outputs of these models is by assigning a persona – a role that describes the expected behavior of the LLM (e.g., a man, a woman, an engineer). This study investigates whether an LLM’s understanding of social norms varies across assigned personas. Ideally, the perception of a social norm should remain consistent regardless of the persona, since acceptability of a social norm should be determined by the region the norm originates from, rather than by individual characteristics such as gender, body size, or race. A norm is universal within its cultural context. In our research, we tested 36 distinct personas from 12 sociodemographic categories (e.g., age, gender, beauty) across four different LLMs. We find that LLMs’ cultural norm interpretation varies based on the persona used and the norm interpretation also varies within a sociodemographic category (e.g., a fat person and a thin person as in physical appearance group) where an LLM with the more socially desirable persona (e.g., a thin person) interprets social norms more accurately than with the less socially desirable persona (e.g., a fat person). We also discuss how different types of social biases may contribute to the results that we observe.
摘要:随着大型语言模型的广泛应用,人们对个性化语言模型的需求越来越大。个性化和指导这些模型的输出的一种方法是分配一个角色–一个描述LLM的预期行为的角色(例如,一个男人、一个女人、一个工程师)。这项研究调查了LLM对社会规范的理解是否因指定的人物角色而异。理想情况下,人们对一种社会规范的认知应该保持一致,而不管是什么角色,因为一种社会规范的可接受性应该由该规范起源的地区决定,而不是由性别、体型或种族等个人特征决定。一种规范在其文化背景下是普遍的。在我们的研究中,我们测试了来自12个社会人口统计类别(例如,年龄、性别、美貌)的四个不同LLM中的36个不同的人物角色。我们发现,LLMS的文化规范解释因所使用的人物角色而异,并且在不同的社会人口学类别(例如,在外表组中,胖人和瘦人)中,具有更多社会期望人物(例如,瘦人)的LLM对社会规范的解释比具有较不社会期望人物(例如,胖子)的LLM更准确。我们还讨论了不同类型的社会偏见如何对我们观察到的结果做出贡献。

[NLP-39] owards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation
[NLP-39] owards Fair RAG:关于公平排名对检索增强一代的影响

链接: https://arxiv.org/abs/2409.11598
作者: To Eun Kim,Fernando Diaz
关键词-EN: RAG systems, RAG, language models, models now enhance, enhance their responses
关键词-ZH: RAG系统,RAG,语言模型,模型现在增强,增强他们的响应
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many language models now enhance their responses with retrieval capabilities, leading to the widespread adoption of retrieval-augmented generation (RAG) systems. However, despite retrieval being a core component of RAG, much of the research in this area overlooks the extensive body of work on fair ranking, neglecting the importance of considering all stakeholders involved. This paper presents the first systematic evaluation of RAG systems integrated with fair rankings. We focus specifically on measuring the fair exposure of each relevant item across the rankings utilized by RAG systems (i.e., item-side fairness), aiming to promote equitable growth for relevant item providers. To gain a deep understanding of the relationship between item-fairness, ranking quality, and generation quality in the context of RAG, we analyze nine different RAG systems that incorporate fair rankings across seven distinct datasets. Our findings indicate that RAG systems with fair rankings can maintain a high level of generation quality and, in many cases, even outperform traditional RAG systems, despite the general trend of a tradeoff between ensuring fairness and maintaining system-effectiveness. We believe our insights lay the groundwork for responsible and equitable RAG systems and open new avenues for future research. We publicly release our codebase and dataset at this https URL.
摘要:许多语言模型现在通过检索能力来增强它们的响应能力,导致检索增强生成(RAG)系统的广泛采用。然而,尽管检索是RAG的核心组成部分,但这一领域的许多研究忽略了关于公平排名的广泛工作,忽视了考虑所有利益相关者的重要性。本文首次提出了RAG系统与公平排名相结合的系统评估。我们特别侧重于衡量RAG系统使用的排名中每个相关项目的公平曝光率(即项目方公平性),旨在促进相关项目提供商的公平增长。为了深入理解RAG环境下项目公平性、排名质量和生成质量之间的关系,我们分析了九个不同的RAG系统,这些系统包含了七个不同数据集的公平排名。我们的研究结果表明,具有公平排名的RAG系统可以保持较高的生成质量,并且在许多情况下甚至优于传统的RAG系统,尽管总体趋势是在确保公平性和保持系统有效性之间进行权衡。我们相信,我们的见解为负责任和公平的RAG系统奠定了基础,并为未来的研究开辟了新的途径。我们在这个HTTPS URL上公开发布我们的代码库和数据集。

[NLP-40] ProSLM : A Prolog Synergized Language Model for explainable Domain Specific Knowledge Based Question Answering
[NLP-40] ProLAM:一种用于可解释领域特定知识的问题解答的Prolog协同语言模型

链接: https://arxiv.org/abs/2409.11589
作者: Priyesh Vakharia,Abigail Kufeldt,Max Meyers,Ian Lane,Leilani Gilpin
关键词-EN: explainable symbolic representations, opaque neural systems, incorporating explainable symbolic, symbolic representations, opaque neural
关键词-ZH: 可解释的符号表示、不透明的神经系统、合并可解释的符号、符号表示、不透明的神经
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at NeSy 2024

点击查看摘要

Abstract:Neurosymbolic approaches can add robustness to opaque neural systems by incorporating explainable symbolic representations. However, previous approaches have not used formal logic to contextualize queries to and validate outputs of large language models (LLMs). We propose \systemname, a novel neurosymbolic framework, to improve the robustness and reliability of LLMs in question-answering tasks. We provide \systemname with a domain-specific knowledge base, a logical reasoning system, and an integration to an existing LLM. This framework has two capabilities (1) context gathering: generating explainable and relevant context for a given query, and (2) validation: confirming and validating the factual accuracy of a statement in accordance with a knowledge base (KB). Our work opens a new area of neurosymbolic generative AI text validation and user personalization.
摘要:神经符号方法可以通过合并可解释的符号表示来增加不透明神经系统的鲁棒性。然而,以前的方法尚未使用形式逻辑来将查询上下文化到大型语言模型(LLM)并验证其输出。我们提出了\系统名称,这是一种新型的神经符号框架,以提高LLM在问答任务中的稳健性和可靠性。我们为\system Name提供了特定领域的知识库、逻辑推理系统以及与现有LLM的集成。该框架具有两种能力:(1)上下文收集:为给定查询生成可解释且相关的上下文,以及(2)验证:根据知识库(KB)确认和验证陈述的事实准确性。我们的工作开辟了神经符号生成人工智能文本验证和用户个性化的新领域。

[NLP-41] HEARTS: A Holistic Framework for Explainable Sustainable and Robust Text Stereotype Detection NEURIPS2024
[NLP-41] HEARTS:可解释可持续且稳健的文本刻板印象检测的整体框架

链接: https://arxiv.org/abs/2409.11579
作者: Theo King,Zekun Wu,Adriano Koshiyama,Emre Kazim,Philip Treleaven
关键词-EN: in-context learning struggle, identify them accurately, generalised assumptions, assumptions about societal, in-context learning
关键词-ZH: 背景学习斗争,准确识别它们,概括假设,关于社会、背景学习的假设
类目: Computation and Language (cs.CL)
备注: Submitted to NeurIPS 2024 SoLaR Workshop

点击查看摘要

Abstract:Stereotypes are generalised assumptions about societal groups, and even state-of-the-art LLMs using in-context learning struggle to identify them accurately. Due to the subjective nature of stereotypes, where what constitutes a stereotype can vary widely depending on cultural, social, and individual perspectives, robust explainability is crucial. Explainable models ensure that these nuanced judgments can be understood and validated by human users, promoting trust and accountability. We address these challenges by introducing HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection), a framework that enhances model performance, minimises carbon footprint, and provides transparent, interpretable explanations. We establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising 57,201 labeled texts across six groups, including under-represented demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm that BERT models fine-tuned on EMGSD outperform those trained on individual components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model using SHAP to generate token-level importance values, ensuring alignment with human understanding, and calculate explainability confidence scores by comparing SHAP and LIME outputs. Finally, HEARTS is applied to assess stereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias over time within model families.
摘要:刻板印象是关于社会群体的概括性假设,即使是最先进的LLM也很难使用情景学习来准确识别它们。由于刻板印象的主观性,即构成刻板印象的内容可能会因文化、社会和个人角度的不同而有很大不同,因此强大的可解释性至关重要。可解释的模型确保这些细微差别的判断可以被人类用户理解和验证,从而促进信任和责任。我们通过引入HEARS(整体可解释、可持续和稳健的文本刻板印象检测框架)来应对这些挑战,该框架可以增强模型性能,最大限度地减少碳足迹,并提供透明、可解释的解释。我们建立了扩展的多粒度刻板印象数据集(EMGSD),包括六个组的57,201个标签文本,包括LGBTQ+等代表性不足的人口统计数据和地区刻板印象。消融研究证实,根据EMGSD微调的BERT模型优于针对单个组件训练的模型。然后,我们使用Shap来分析一个微调的、碳效率高的Albert-V2模型,以生成令牌级的重要性值,确保与人类的理解一致,并通过比较Shap和LIME的输出来计算可解释性置信度分数。最后,心脏被用来评估12个LLM输出中的刻板印象偏见,显示出在模型家庭中随着时间的推移,偏见逐渐减少。

[NLP-42] Preference Tuning with Human Feedback on Language Speech and Vision Tasks: A Survey
[NLP-42] 通过人类反馈调整语言言语和视觉任务的偏好:一项调查

链接: https://arxiv.org/abs/2409.11564
作者: Genta Indra Winata,Hanyang Zhao,Anirban Das,Wenpin Tang,David D. Yao,Shi-Xiong Zhang,Sambit Sahu
关键词-EN: aligning deep generative, Preference tuning, deep generative models, preference tuning tasks, Preference
关键词-ZH: 对齐深度生成、偏好调整、深度生成模型、偏好调整任务、偏好
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Survey paper

点击查看摘要

Abstract:Preference tuning is a crucial process for aligning deep generative models with human preferences. This survey offers a thorough overview of recent advancements in preference tuning and the integration of human feedback. The paper is organized into three main sections: 1) introduction and preliminaries: an introduction to reinforcement learning frameworks, preference tuning tasks, models, and datasets across various modalities: language, speech, and vision, as well as different policy approaches, 2) in-depth examination of each preference tuning approach: a detailed analysis of the methods used in preference tuning, and 3) applications, discussion, and future directions: an exploration of the applications of preference tuning in downstream tasks, including evaluation methods for different modalities, and an outlook on future research directions. Our objective is to present the latest methodologies in preference tuning and model alignment, enhancing the understanding of this field for researchers and practitioners. We hope to encourage further engagement and innovation in this area.
摘要:偏好调整是使深层生成模型与人类偏好保持一致的关键过程。这项调查对偏好调整和人类反馈整合的最新进展进行了全面的概述。本文分为三个主要部分:1)引言和前言:介绍强化学习框架、偏好调整任务、模型和各种不同模式的数据集:语言、言语和视觉,以及不同的政策方法;2)深入考察每种偏好调整方法:详细分析偏好调整中使用的方法;3)应用、讨论和未来方向:探索偏好调整在下游任务中的应用,包括不同模式的评估方法,以及对未来研究方向的展望。我们的目标是展示偏好调整和模型匹配的最新方法,增进研究人员和实践者对该领域的理解。我们希望鼓励这一领域的进一步参与和创新。

[NLP-43] Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs
[NLP-43] 小型语言模型在简短创意写作中可以胜过人类:一项比较SLC与人类和LLM的研究

链接: https://arxiv.org/abs/2409.11547
作者: Guillermo Marco,Luz Rello,Julio Gonzalo
关键词-EN: fine-tuned small language, small language model, large language models, fiction writing abilities, small language
关键词-ZH: 微调小语言、小语言模型、大语言模型、小说写作能力、小语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART Large, and compare its performance to humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human evaluation where readers assess the stories generated by the SLM compared to human-written stories, and (ii) a qualitative linguistic analysis comparing the textual characteristics of the stories generated by the different models. In the first experiment, we asked 68 participants to rate short stories generated by the models and humans along dimensions such as grammaticality, relevance, creativity, and attractiveness. BART Large outperformed human writers in most aspects, except creativity, with an overall score of 2.11 compared to 1.85 for human-written texts – a 14% improvement. In the second experiment, the qualitative analysis revealed that, while GPT-4o exhibited near-perfect internal and external coherence, it tended to produce more predictable narratives, with only 3% of its stories seen as novel. In contrast, 15% of BART’s stories were considered novel, indicating a higher degree of creativity despite its smaller model size. This study provides both quantitative and qualitative insights into how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks.
摘要:在本文中,我们评估了一个微调的小语言模型BART Large的创作能力,并将其性能与人类和两个大语言模型GPT-3.5和GPT-40进行了比较。我们的评价包括两个实验:(I)人类评价,读者将SLM生成的故事与人类书写的故事进行比较;(Ii)定性语言分析,比较不同模式生成的故事的文本特征。在第一个实验中,我们让68名参与者从语法、关联性、创造力和吸引力等维度对模特和人类创作的短篇小说进行评分。Bart Large在除创造力以外的大多数方面都优于人类作家,总体得分为2.11,而人类书写的文本得分为1.85,提高了14%。在第二个实验中,定性分析表明,虽然GPT-40表现出近乎完美的内部和外部连贯,但它倾向于产生更多可预测的叙事,只有3%的故事被视为小说。相比之下,BART有15%的故事被认为是新奇的,这表明尽管它的模型尺寸较小,但创意程度更高。本研究对模型大小和微调如何影响创造性写作任务中创造性、流利性和连贯性之间的平衡提供了定量和定性的见解。

[NLP-44] Chain-of-Thought Prompting for Speech Translation
[NLP-44] 语音翻译的思维链预测

链接: https://arxiv.org/abs/2409.11538
作者: Ke Hu,Zhehuai Chen,Chao-Han Huck Yang,Piotr Żelasko,Oleksii Hrinchuk,Vitaly Lavrukhin,Jagadeesh Balam,Boris Ginsburg
关键词-EN: Large language models, Large language, demonstrated remarkable advancements, understanding and generation, language understanding
关键词-ZH: 大型语言模型,大型语言,展示了显着的进步、理解和生成、语言理解
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation. Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance in automatic speech recognition (ASR) and automatic speech translation (AST). In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM model consists of a speech encoder and an encoder-decoder structure Megatron-T5. By first decoding speech to generate ASR transcripts and subsequently using these transcripts along with encoded speech for prompting, we guide the speech translation in a two-step process like chain-of-thought (CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model adaptation and shows superior performance to full model fine-tuning. Experimental results show that the proposed CoT prompting significantly improves AST performance, achieving an average increase of 2.4 BLEU points across 6 En-X or X-En AST tasks compared to speech prompting alone. Additionally, compared to a related CoT prediction method that predicts a concatenated sequence of ASR and AST transcripts, our method performs better by an average of 2 BLEU points.
摘要:大语言模型在语言理解和生成方面取得了显著的进步。在基于文本的LLMS的成功基础上,最近的研究已经将这些模型改进为使用语音嵌入来进行提示,从而产生了在自动语音识别(ASR)和自动语音翻译(AST)中表现出强大性能的语音LLM模型。在这项工作中,我们提出了一种新的方法来利用ASR转录作为语音LLM中AST的提示,该语音LLM建立在编解码器文本LLM上。语音-LLM模型由语音编码器和编解码器结构Megatron-T5组成。通过首先解码语音以生成ASR抄本,然后将这些抄本与用于提示的编码语音一起使用,我们在类似于思想链(COT)提示的两个步骤中引导语音翻译。低阶自适应(LORA)用于T5 LLM的模型自适应,并显示出优于全模型微调的性能。实验结果表明,在6个EN-X或X-EN AST任务中,COT提示显著提高了AST的性能,与单独使用语音提示相比,BLEU分数平均提高了2.4分。此外,与相关的COT预测方法相比,我们的方法预测ASR和AST转录本的串联序列,平均提高2个BLEU点。

[NLP-45] Egalitarian Language Representation in Language Models: It All Begins with Tokenizers
[NLP-45] 语言模型中的平等主义语言表示:一切都始于代币器

链接: https://arxiv.org/abs/2409.11501
作者: Menan Velayuthan,Kengatharaiyer Sarveswaran
关键词-EN: Large Language Models, language models, Byte Pair Encoding, complex script languages, bridge between human
关键词-ZH: 大型语言模型、语言模型、字节对编码、复杂脚本语言、人类之间的桥梁
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Content - 8 pages, References - 3 pages

点击查看摘要

Abstract:Tokenizers act as a bridge between human language and the latent space of language models, influencing how language is represented in these models. Due to the immense popularity of English-Centric Large Language Models (LLMs), efforts are being made to adapt them for other languages. However, we demonstrate that, from a tokenization standpoint, not all tokenizers offer fair representation for complex script languages such as Tamil, Sinhala, and Hindi, primarily due to the choice of pre-tokenization methods. We go further to show that pre-tokenization plays a more critical role than the tokenization algorithm itself in achieving an egalitarian representation of these complex script languages. To address this, we introduce an improvement to the Byte Pair Encoding (BPE) algorithm by incorporating graphemes, which we term Grapheme Pair Encoding (GPE). Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts. We validate this approach through experiments on Tamil, Sinhala, and Hindi.
摘要:标记词在人类语言和语言模型的潜在空间之间起着桥梁作用,影响着语言在这些模型中的表达方式。由于以英语为中心的大型语言模型(LLM)非常受欢迎,人们正在努力将它们改编成其他语言的模型。然而,我们证明,从标记化的角度来看,并不是所有的标记器都能公平地表示泰米尔语、僧伽罗语和印地语等复杂的脚本语言,这主要是由于选择了预标记化方法。我们进一步证明,在实现这些复杂脚本语言的平等表示方面,预标记化比标记化算法本身起到更关键的作用。为了解决这个问题,我们引入了字素来改进字节对编码(BPE)算法,我们称之为字素对编码(GPE)。我们的实验表明,对于复杂的文字,基于字形的字符提取的性能优于字节级的标记器。我们通过泰米尔语、僧伽罗语和印地语的实验验证了这种方法。

[NLP-46] Multi-Document Grounded Multi-Turn Synthetic Dialog Generation
[NLP-46] 多文档接地多回合合成对话框生成

链接: https://arxiv.org/abs/2409.11500
作者: Young-Suk Lee,Chulaka Gunasekara,Danish Contractor,Ramón Fernandez Astudillo,Radu Florian
关键词-EN: main ideas, introduce a technique, incorporates three main, multi-document grounded, dialog
关键词-ZH: 主要思想,介绍一种技术,包含三个主要的、多文档基础的对话框
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.
摘要:我们介绍了一种用于多文档接地多轮合成对话生成的技术,该技术结合了三个主要思想。首先,我们使用由思想链(CoT)提示生成的分类驱动的用户查询来控制整个对话流。其次,我们通过模仿现实世界中检索器的使用来支持生成多文档基础对话框,以在用户每次转向对话框后更新基础文档。第三,我们应用LLM作为法官来过滤掉答案不正确的查询。对合成对话数据的人类评估表明,数据是多样化的、连贯的,并且包括大多数正确的答案。对可回答查询的人工评估和自动评估都表明,在合成对话上微调的模型始终优于在四个公开可用的多回合文档接地基准测试集中现有人类生成的训练数据上微调的模型。

[NLP-47] Augment Drop Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning
[NLP-47] 增强丢弃交换:改善LLM字幕的多样性,以实现高效的音乐文本表示学习

链接: https://arxiv.org/abs/2409.11498
作者: Ilaria Manco,Justin Salamon,Oriol Nieto
关键词-EN: music representation learning, Audio-text contrastive models, Audio-text contrastive, powerful approach, approach in music
关键词-ZH: 音乐表示学习,音频文本对比模型,音频文本对比,强大的方法,音乐方法
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: To appear in the Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

点击查看摘要

Abstract:Audio-text contrastive models have become a powerful approach in music representation learning. Despite their empirical success, however, little is known about the influence of key design choices on the quality of music-text representations learnt through this framework. In this work, we expose these design choices within the constraints of limited data and computation budgets, and establish a more solid understanding of their impact grounded in empirical observations along three axes: the choice of base encoders, the level of curation in training data, and the use of text augmentation. We find that data curation is the single most important factor for music-text contrastive training in resource-constrained scenarios. Motivated by this insight, we introduce two novel techniques, Augmented View Dropout and TextSwap, which increase the diversity and descriptiveness of text inputs seen in training. Through our experiments we demonstrate that these are effective at boosting performance across different pre-training regimes, model architectures, and downstream data distributions, without incurring higher computational costs or requiring additional training data.
摘要:音文对比模型已成为音乐表征学习的有力工具。然而,尽管他们在经验上取得了成功,但人们对关键设计选择对通过这一框架学习的音乐文本表征质量的影响知之甚少。在这项工作中,我们在有限的数据和计算预算的限制下暴露了这些设计选择,并基于三个轴的经验观察建立了对它们影响的更坚实的理解:基本编码器的选择,训练数据中的精选水平,以及文本增强的使用。我们发现,在资源有限的情况下,数据整理是音乐-文本对比训练中最重要的单一因素。受此启发,我们引入了两种新的技术,增强的视图删除和文本交换,这两种技术增加了文本输入的多样性和描述性。通过我们的实验,我们证明了这些方法在不同的预训练制度、模型体系结构和下游数据分布中有效地提高了性能,而不会招致更高的计算成本或需要额外的训练数据。

[NLP-48] Enriching Datasets with Demographics through Large Language Models : Whats in a Name?
[NLP-48] 通过大型语言模型用人口统计数据丰富数据集:名字中有什么?

链接: https://arxiv.org/abs/2409.11491
作者: Khaled AlNuaimi,Gautier Marti,Mathieu Ravaut,Abdulla AlKetbi,Andreas Henschel,Raed Jaradat
关键词-EN: public policy, fields like healthcare, social sciences, Enriching datasets, critical task
关键词-ZH: 公共政策、医疗保健、社会科学等领域、丰富数据集、关键任务
类目: Computation and Language (cs.CL)
备注: 8 pages, 7 Tables, 5 Figures

点击查看摘要

Abstract:Enriching datasets with demographic information, such as gender, race, and age from names, is a critical task in fields like healthcare, public policy, and social sciences. Such demographic insights allow for more precise and effective engagement with target populations. Despite previous efforts employing hidden Markov models and recurrent neural networks to predict demographics from names, significant limitations persist: the lack of large-scale, well-curated, unbiased, publicly available datasets, and the lack of an approach robust across datasets. This scarcity has hindered the development of traditional supervised learning approaches. In this paper, we demonstrate that the zero-shot capabilities of Large Language Models (LLMs) can perform as well as, if not better than, bespoke models trained on specialized data. We apply these LLMs to a variety of datasets, including a real-life, unlabelled dataset of licensed financial professionals in Hong Kong, and critically assess the inherent demographic biases in these models. Our work not only advances the state-of-the-art in demographic enrichment but also opens avenues for future research in mitigating biases in LLMs.
摘要:在医疗保健、公共政策和社会科学等领域,用人名中的性别、种族和年龄等人口统计信息丰富数据集是一项关键任务。这种人口统计洞察力使我们能够更准确、更有效地接触目标人群。尽管以前使用隐马尔可夫模型和递归神经网络来根据姓名预测人口统计数据,但仍然存在重大限制:缺乏大规模、精心策划、无偏见、公开可用的数据集,以及缺乏跨数据集的稳健方法。这种稀缺性阻碍了传统监督学习方法的发展。在这篇文章中,我们证明了大语言模型(LLM)的零射能力可以和基于专门数据训练的定制模型一样好,如果不是更好的话。我们将这些LLM应用于各种数据集,包括香港注册金融专业人员的真实、未标记的数据集,并对这些模型中固有的人口统计学偏差进行了严格的评估。我们的工作不仅推进了人口丰富的最先进水平,而且也为未来在减轻小岛屿发展中国家的偏见方面的研究开辟了道路。

[NLP-49] Jailbreaking Large Language Models with Symbolic Mathematics
[NLP-49] 用符号数学破解大型语言模型

链接: https://arxiv.org/abs/2409.11445
作者: Emet Bethany,Mazal Bethany,Juan Arturo Nolazco Flores,Sumit Kumar Jha,Peyman Najafirad
关键词-EN: unsafe content generation, mitigate unsafe content, Recent advancements, large language models, content generation
关键词-ZH: 不安全内容生成、缓解不安全内容、最新进展、大型语言模型、内容生成
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs’ advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack’s success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.
摘要:最近人工智能安全方面的进步导致了在培训和红队大型语言模型(LLM)方面的更多努力,以减少不安全的内容生成。然而,这些安全机制可能不是全面的,留下了潜在的漏洞有待探索。本文介绍了MathPrompt,这是一种新的越狱技术,它利用LLMS在符号数学中的高级能力来绕过它们的安全机制。通过将有害的自然语言提示编码为数学问题,我们展示了当前人工智能安全措施中的一个严重漏洞。我们在13个最先进的LLM上进行的实验显示,平均攻击成功率为73.6%,这突显了现有安全培训机制无法概括为数学编码的输入。对嵌入向量的分析显示,原始提示和编码提示之间存在实质性的语义转换,这有助于解释攻击的成功。这项工作强调了对人工智能安全采取整体方法的重要性,呼吁扩大红队努力,为所有潜在的投入类型及其相关风险制定强有力的保障措施。

[NLP-50] AIvril: AI-Driven RTL Generation With Verification In-The-Loop
[NLP-50] AIvril:具有环内验证的人工智能驱动RTL生成

链接: https://arxiv.org/abs/2409.11411
作者: Mubashir ul Islam,Humza Sami,Pierre-Emmanuel Gaillardon,Valerio Tenace
关键词-EN: Large Language Models, computational models capable, performing complex natural, complex natural language, natural language processing
关键词-ZH: 大型语言模型,具有计算能力,执行复杂的自然、复杂的自然语言、自然语言处理
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are computational models capable of performing complex natural language processing tasks. Leveraging these capabilities, LLMs hold the potential to transform the entire hardware design stack, with predictions suggesting that front-end and back-end tasks could be fully automated in the near future. Currently, LLMs show great promise in streamlining Register Transfer Level (RTL) generation, enhancing efficiency, and accelerating innovation. However, their probabilistic nature makes them prone to inaccuracies - a significant drawback in RTL design, where reliability and precision are essential. To address these challenges, this paper introduces AIvril, an advanced framework designed to enhance the accuracy and reliability of RTL-aware LLMs. AIvril employs a multi-agent, LLM-agnostic system for automatic syntax correction and functional verification, significantly reducing - and in many cases, completely eliminating - instances of erroneous code generation. Experimental results conducted on the VerilogEval-Human dataset show that our framework improves code quality by nearly 2x when compared to previous works, while achieving an 88.46% success rate in meeting verification objectives. This represents a critical step toward automating and optimizing hardware design workflows, offering a more dependable methodology for AI-driven RTL design. Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2409.11411 [cs.AI] (or arXiv:2409.11411v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.11411 Focus to learn more arXiv-issued DOI via DataCite
摘要:大语言模型是能够执行复杂自然语言处理任务的计算模型。利用这些功能,LLM具有改变整个硬件设计堆栈的潜力,预测表明,前端和后端任务可能在不久的将来完全自动化。目前,LLMS在简化寄存器传输级别(RTL)生成、提高效率和加速创新方面显示出巨大的前景。然而,它们的概率性质使它们容易不准确-这是RTL设计的一个重大缺陷,在RTL设计中,可靠性和精确度至关重要。为了应对这些挑战,本文引入了Aivril,这是一个先进的框架,旨在提高RTL感知的LLMS的准确性和可靠性。Aivril采用多代理、LLM无关的系统进行自动语法更正和功能验证,显著减少–在许多情况下甚至完全消除–错误代码生成的实例。在VerilogEval-Human数据集上进行的实验结果表明,与以前的工作相比,该框架的代码质量提高了近2倍,同时达到了88.46%的验证目标成功率。这代表着朝着自动化和优化硬件设计工作流程迈出的关键一步,为人工智能驱动的RTL设计提供了更可靠的方法。主题:人工智能(cs.AI);硬件体系结构(cs.AR);计算与语言(cs.CL);机器学习(cs.LG);多智能体系统(cs.MA)引用AS:arxiv:2409.11411cs.AIhttps://doi.org/10.48550/arXiv.2409.11411 Focus通过DataCite了解更多arxiv发布的DOI

[NLP-51] Optimizing Performance: How Compact Models Match or Exceed GPTs Classification Capabilities through Fine-Tuning
[NLP-51] 优化性能:紧凑型模型如何通过微调匹配或超越GPT分类能力

链接: https://arxiv.org/abs/2409.11408
作者: Baptiste Lefort,Eric Benhamou,Jean-Jacques Ohana,David Saltiel,Beatrice Guez
关键词-EN: zero-shot learning settings, demonstrate that non-generative, FinBERT and FinDRoBERTa, zero-shot learning, learning settings
关键词-ZH: 零射击学习设置,证明非生成性、FinBERT和FinDRoBERTa、零射击学习、学习设置
类目: Computation and Language (cs.CL); Statistical Finance (q-fin.ST)
备注:

点击查看摘要

Abstract:In this paper, we demonstrate that non-generative, small-sized models such as FinBERT and FinDRoBERTa, when fine-tuned, can outperform GPT-3.5 and GPT-4 models in zero-shot learning settings in sentiment analysis for financial news. These fine-tuned models show comparable results to GPT-3.5 when it is fine-tuned on the task of determining market sentiment from daily financial news summaries sourced from Bloomberg. To fine-tune and compare these models, we created a novel database, which assigns a market score to each piece of news without human interpretation bias, systematically identifying the mentioned companies and analyzing whether their stocks have gone up, down, or remained neutral. Furthermore, the paper shows that the assumptions of Condorcet’s Jury Theorem do not hold suggesting that fine-tuned small models are not independent of the fine-tuned GPT models, indicating behavioural similarities. Lastly, the resulted fine-tuned models are made publicly available on HuggingFace, providing a resource for further research in financial sentiment analysis and text classification.
摘要:在财经新闻情感分析中,我们证明了像FinBERT和FinDRoBERTa这样的非生成性小规模模型,经过微调后,在零学习环境下的性能要好于GPT-3.5和GPT-4模型。这些微调的模型显示了与GPT-3.5类似的结果,当它根据来自彭博社的每日金融新闻摘要确定市场情绪的任务进行微调时。为了微调和比较这些模型,我们创建了一个新的数据库,它为每条新闻分配一个市场分数,而不是人为的解释偏差,系统地识别所提到的公司,并分析它们的股票是上涨、下跌还是保持中性。此外,本文还表明,孔多塞的陪审团定理的假设不成立,这表明微调的小模型并不独立于微调的GPT模型,这表明行为相似。最后,在HuggingFace上公开了经过微调的模型,为金融情感分析和文本分类的进一步研究提供了资源。

[NLP-52] owards Signal Processing In Large Language Models
[NLP-52] 大型语言模型中的信号处理

链接: https://arxiv.org/abs/2406.10254
作者: Prateek Verma,Mert Pilanci
关键词-EN: Large Language Model, Large Language, Language Model, applying signal processing, signal processing inside
关键词-ZH: 大语言模型,大语言,语言模型,应用信号处理,信号处理内部
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:This paper introduces the idea of applying signal processing inside a Large Language Model (LLM). With the recent explosion of generative AI, our work can help bridge two fields together, namely the field of signal processing and large language models. We draw parallels between classical Fourier-Transforms and Fourier Transform-like learnable time-frequency representations for every intermediate activation signal of an LLM. Once we decompose every activation signal across tokens into a time-frequency representation, we learn how to filter and reconstruct them, with all components learned from scratch, to predict the next token given the previous context. We show that for GPT-like architectures, our work achieves faster convergence and significantly increases performance by adding a minuscule number of extra parameters when trained for the same epochs. We hope this work paves the way for algorithms exploring signal processing inside the signals found in neural architectures like LLMs and beyond.
摘要:介绍了在大型语言模型(LLM)中应用信号处理的思想。随着最近生成式人工智能的爆炸性增长,我们的工作可以帮助将两个领域联系在一起,即信号处理领域和大型语言模型。对于LLM的每个中间激活信号,我们将经典傅里叶变换和类似傅立叶变换的可学习时频表示进行了比较。一旦我们将每个令牌上的激活信号分解成一个时频表示,我们就学习了如何过滤和重建它们,并从零开始学习所有组件,以预测给定先前上下文的下一个令牌。我们证明,对于类似GPT的体系结构,我们的工作通过添加少量的额外参数来实现更快的收敛并显著提高性能,当针对相同的历元进行训练时。我们希望这项工作为探索LLMS等神经结构中的信号内部信号处理的算法铺平道路。

[NLP-53] Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference ICASSP2025
[NLP-53] 低帧率语音编解码器:专为快速高质量语音LLM训练和推理而设计的编解码器

链接: https://arxiv.org/abs/2409.12117
作者: Edresson Casanova,Ryan Langman,Paarth Neekhara,Shehzeen Hussain,Jason Li,Subhankar Ghosh,Ante Jukić,Sang-gil Lee
关键词-EN: language modeling techniques, significantly advanced audio, advanced audio processing, discrete tokens, enabling the application
关键词-ZH: 语言建模技术、非常先进的音频、先进的音频处理、离散令牌,支持应用程序
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
摘要:大型语言模型(LLM)通过将音频转换为离散令牌的音频编解码器实现了显着高级的音频处理,从而使语言建模技术能够应用于音频数据。然而,音频编解码器通常以高帧率运行,导致训练和推理缓慢,尤其是对于自回归模型。为了应对这一挑战,我们提出了低帧率语音编解码器(LFSC):一种神经音频编解码器,利用有限纯量量化和具有大型语音语言模型的对抗性训练,以1.89 kMbps比特率和21.5帧每秒的高质量音频压缩。我们证明,我们的新型编解码器可以使基于LLM的文本到语音模型的推理速度加快大约三倍,同时提高可理解性并产生与之前模型相当的质量。

人工智能

[AI-0] Vista3D: Unravel the 3D Darkside of a Single Image ECCV’2024

链接: https://arxiv.org/abs/2409.12193
作者: Qiuhong Shen,Xingyi Yang,Michael Bi Mi,Xinchao Wang
关键词-EN: age-old quest, unveiling the hidden, hidden dimensions, Gaussian Splatting, mere glimpses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multimedia (cs.MM)
*备注: ECCV’2024

点击查看摘要

Abstract:We embark on the age-old quest: unveiling the hidden dimensions of objects from mere glimpses of their visible parts. To address this, we present Vista3D, a framework that realizes swift and consistent 3D generation within a mere 5 minutes. At the heart of Vista3D lies a two-phase approach: the coarse phase and the fine phase. In the coarse phase, we rapidly generate initial geometry with Gaussian Splatting from a single image. In the fine phase, we extract a Signed Distance Function (SDF) directly from learned Gaussian Splatting, optimizing it with a differentiable isosurface representation. Furthermore, it elevates the quality of generation by using a disentangled representation with two independent implicit functions to capture both visible and obscured aspects of objects. Additionally, it harmonizes gradients from 2D diffusion prior with 3D-aware diffusion priors by angular diffusion prior composition. Through extensive evaluation, we demonstrate that Vista3D effectively sustains a balance between the consistency and diversity of the generated 3D objects. Demos and code will be available at this https URL.

[AI-1] DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

链接: https://arxiv.org/abs/2409.12192
作者: Zichen Jeff Cui,Hengkai Pan,Aadhithya Iyer,Siddhant Haldar,Lerrel Pinto
关键词-EN: complex visuomotor policies, training complex visuomotor, visuomotor policies, powerful tool, tool for training
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at this https URL

[AI-2] Qwen2-VL: Enhancing Vision-Language Models Perception of the World at Any Resolution

链接: https://arxiv.org/abs/2409.12191
作者: Peng Wang,Shuai Bai,Sinan Tan,Shijie Wang,Zhihao Fan,Jinze Bai,Keqin Chen,Xuejing Liu,Jialin Wang,Wenbin Ge,Yang Fan,Kai Dang,Mengfei Du,Xuancheng Ren,Rui Men,Dayiheng Liu,Chang Zhou,Jingren Zhou,Junyang Lin
关键词-EN: Naive Dynamic Resolution, conventional predetermined-resolution approach, previous Qwen-VL models, Dynamic Resolution mechanism, Rotary Position Embedding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Code is available at this https URL . arXiv admin note: text overlap with arXiv:2408.15262 by other authors

点击查看摘要

Abstract:We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model’s visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at \urlthis https URL.

[AI-3] Democratizing MLLMs in Healthcare: TinyLLaVA-Med for Efficient Healthcare Diagnostics in Resource-Constrained Settings

链接: https://arxiv.org/abs/2409.12184
作者: Aya El Mir,Lukelo Thadei Luoga,Boyuan Chen,Muhammad Abdullah Hanif,Muhammad Shafique
关键词-EN: Nvidia Jetson Xavier, Deploying Multi-Modal Large, Multi-Modal Large Language, Large Language Models, Jetson Xavier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deploying Multi-Modal Large Language Models (MLLMs) in healthcare is hindered by their high computational demands and significant memory requirements, which are particularly challenging for resource-constrained devices like the Nvidia Jetson Xavier. This problem is particularly evident in remote medical settings where advanced diagnostics are needed but resources are limited. In this paper, we introduce an optimization method for the general-purpose MLLM, TinyLLaVA, which we have adapted and renamed TinyLLaVA-Med. This adaptation involves instruction-tuning and fine-tuning TinyLLaVA on a medical dataset by drawing inspiration from the LLaVA-Med training pipeline. Our approach successfully minimizes computational complexity and power consumption, with TinyLLaVA-Med operating at 18.9W and using 11.9GB of memory, while achieving accuracies of 64.54% on VQA-RAD and 70.70% on SLAKE for closed-ended questions. Therefore, TinyLLaVA-Med achieves deployment viability in hardware-constrained environments with low computational resources, maintaining essential functionalities and delivering accuracies close to state-of-the-art models.

[AI-4] o CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

链接: https://arxiv.org/abs/2409.12183
作者: Zayne Sprague,Fangcong Yin,Juan Diego Rodriguez,Dongwei Jiang,Manya Wadhwa,Prasann Singhal,Xinyu Zhao,Xi Ye,Kyle Mahowald,Greg Durrett
关键词-EN: large language models, eliciting reasoning capabilities, facto method, method for eliciting, capabilities from large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking’’ really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model’s response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT’s gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

[AI-5] LifeGPT: Topology-Agnostic Generative Pretrained Transformer Model for Cellular Automata

链接: https://arxiv.org/abs/2409.12182
作者: Jaime A. Berkovich,Markus J. Buehler
关键词-EN: exhibits complex emergent, complex emergent dynamics, cellular automata, exhibits complex, emergent dynamics
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Statistical Mechanics (cond-mat.stat-mech); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:The Game of Life (Life), a well known algorithm within the broader class of cellular automata (CA), exhibits complex emergent dynamics, with extreme sensitivity to initial conditions. Modeling and predicting such intricate behavior without explicit knowledge of the system’s underlying topology presents a significant challenge, motivating the development of algorithms that can generalize across various grid configurations and boundary conditions. We develop a decoder-only generative pretrained transformer model to solve this problem, showing that our model can simulate Life on a toroidal grid with no prior knowledge on the size of the grid, or its periodic boundary conditions (LifeGPT). LifeGPT is topology-agnostic with respect to its training data and our results show that a GPT model is capable of capturing the deterministic rules of a Turing-complete system with near-perfect accuracy, given sufficiently diverse training data. We also introduce the idea of an `autoregressive autoregressor’ to recursively implement Life using LifeGPT. Our results pave the path towards true universal computation within a large language model (LLM) framework, synthesizing of mathematical analysis with natural language processing, and probing AI systems for situational awareness about the evolution of such algorithms without ever having to compute them. Similar GPTs could potentially solve inverse problems in multicellular self-assembly by extracting CA-compatible rulesets from real-world biological systems to create new predictive models, which would have significant consequences for the fields of bioinspired materials, tissue engineering, and architected materials design.

[AI-6] Computational Dynamical Systems

链接: https://arxiv.org/abs/2409.12179
作者: Jordan Cotler,Semon Rezchikov
关键词-EN: dynamical systems, finite-dimensional dynamical systems, dynamical, Turing machine, smooth dynamical system
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Dynamical Systems (math.DS)
*备注: 46+14 pages, 6 figures; accepted to FOCS 2024

点击查看摘要

Abstract:We study the computational complexity theory of smooth, finite-dimensional dynamical systems. Building off of previous work, we give definitions for what it means for a smooth dynamical system to simulate a Turing machine. We then show that ‘chaotic’ dynamical systems (more precisely, Axiom A systems) and ‘integrable’ dynamical systems (more generally, measure-preserving systems) cannot robustly simulate universal Turing machines, although such machines can be robustly simulated by other kinds of dynamical systems. Subsequently, we show that any Turing machine that can be encoded into a structurally stable one-dimensional dynamical system must have a decidable halting problem, and moreover an explicit time complexity bound in instances where it does halt. More broadly, our work elucidates what it means for one ‘machine’ to simulate another, and emphasizes the necessity of defining low-complexity ‘encoders’ and ‘decoders’ to translate between the dynamics of the simulation and the system being simulated. We highlight how the notion of a computational dynamical system leads to questions at the intersection of computational complexity theory, dynamical systems theory, and real algebraic geometry.

[AI-7] Expanding Expressivity in Transformer Models with M"obiusAttention

链接: https://arxiv.org/abs/2409.12175
作者: Anna-Maria Halacheva,Mojtaba Nayyeri,Steffen Staab
关键词-EN: Natural Language Processing, revolutionized Natural Language, Language Processing, Natural Language, enabling exceptional modeling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Attention mechanisms and Transformer architectures have revolutionized Natural Language Processing (NLP) by enabling exceptional modeling of long-range dependencies and capturing intricate linguistic patterns. However, their inherent reliance on linear operations in the form of matrix multiplications limits their ability to fully capture inter-token relationships on their own. We propose MöbiusAttention, a novel approach that integrates Möbius transformations within the attention mechanism of Transformer-based models. Möbius transformations are non-linear operations in spaces over complex numbers with the ability to map between various geometries. By incorporating these properties, MöbiusAttention empowers models to learn more intricate geometric relationships between tokens and capture a wider range of information through complex-valued weight vectors. We build and pre-train a BERT and a RoFormer version enhanced with MöbiusAttention, which we then finetune on the GLUE benchmark. We evaluate empirically our approach against the baseline BERT and RoFormer models on a range of downstream tasks. Our approach compares favorably against the baseline models, even with smaller number of parameters suggesting the enhanced expressivity of MöbiusAttention. This research paves the way for exploring the potential of Möbius transformations in the complex projective space to enhance the expressivity and performance of foundation models.

[AI-8] Semantic Interoperability on Blockchain by Generating Smart Contracts Based on Knowledge Graphs

链接: https://arxiv.org/abs/2409.12171
作者: William Van Woensel,Oshani Seneviratne
关键词-EN: Toggle, code, smart, smart contract, Papers
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Background: Health 3.0 allows decision making to be based on longitudinal data from multiple institutions, from across the patient’s healthcare journey. In such a distributed setting, blockchain smart contracts can act as neutral intermediaries to implement trustworthy decision making. Objective: In a distributed setting, transmitted data will be structured using standards (such as HL7 FHIR) for semantic interoperability. In turn, the smart contract will require interoperability with this standard, implement a complex communication setup (e.g., using oracles), and be developed using blockchain languages (e.g., Solidity). We propose the encoding of smart contract logic using a high-level semantic Knowledge Graph, using concepts from the domain standard. We then deploy this semantic KG on blockchain. Methods: Off-chain, a code generation pipeline compiles the KG into a concrete smart contract, which is then deployed on-chain. Our pipeline targets an intermediary bridge representation, which can be transpiled into a specific blockchain language. Our choice avoids on-chain rule engines, with unpredictable and likely higher computational cost; it is thus in line with the economic rules of blockchain. Results: We applied our code generation approach to generate smart contracts for 3 health insurance cases from Medicare. We discuss the suitability of our approach - the need for a neutral intermediary - for a number of healthcare use cases. Our evaluation finds that the generated contracts perform well in terms of correctness and execution cost (“gas”) on blockchain. Conclusions: We showed that it is feasible to automatically generate smart contract code based on a semantic KG, in a way that respects the economic rules of blockchain. Future work includes studying the use of Large Language Models (LLM) in our approach, and evaluations on other blockchains. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.12171 [cs.CR] (or arXiv:2409.12171v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.12171 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: William Van Woensel [view email] [v1] Wed, 11 Sep 2024 13:46:24 UTC (948 KB) Full-text links: Access Paper: View a PDF of the paper titled Semantic Interoperability on Blockchain by Generating Smart Contracts Based on Knowledge Graphs, by William Van Woensel and Oshani SeneviratneView PDFOther Formats view license Current browse context: cs.CR prev | next new | recent | 2024-09 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-9] he Unreliability of Acoustic Systems in Alzheimers Speech Datasets with Heterogeneous Recording Conditions

链接: https://arxiv.org/abs/2409.12170
作者: Lara Gauder,Pablo Riera,Andrea Slachevsky,Gonzalo Forno,Adolfo M. Garcia,Luciana Ferrer
关键词-EN: Automated speech analysis, detect early markers, Alzheimer disease, markers of Alzheimer, Automated speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 1 figure, 1 table

点击查看摘要

Abstract:Automated speech analysis is a thriving approach to detect early markers of Alzheimer’s disease (AD). Yet, recording conditions in most AD datasets are heterogeneous, with patients and controls often evaluated in different acoustic settings. While this is not a problem for analyses based on speech transcription or features obtained from manual alignment, it does cast serious doubts on the validity of acoustic features, which are strongly influenced by acquisition conditions. We examined this issue in the ADreSSo dataset, derived from the widely used Pitt corpus. We show that systems based on two acoustic features, MFCCs and Wav2vec 2.0 embeddings, can discriminate AD patients from controls with above-chance performance when using only the non-speech part of the audio signals. We replicated this finding in a separate dataset of Spanish speakers. Thus, in these datasets, the class can be partly predicted by recording conditions. Our results are a warning against the use of acoustic systems for identifying patients based on non-standardized recordings. We propose that acoustically heterogeneous datasets for dementia studies should be either (a) analyzed using only transcripts or other features derived from manual annotations, or (b) replaced by datasets collected with strictly controlled acoustic conditions.

[AI-10] NSSR-DIL: Null-Shot Image Super-Resolution Using Deep Identity Learning

链接: https://arxiv.org/abs/2409.12165
作者: Sree Rama Vamsidhar S,Rama Krishna Gorthi
关键词-EN: Deep Identity Learning, employ Deep Learning, existing SotA ISR, ISR, ISR task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The present State-of-the-Art (SotA) Image Super-Resolution (ISR) methods employ Deep Learning (DL) techniques using a large amount of image data. The primary limitation to extending the existing SotA ISR works for real-world instances is their computational and time complexities. In this paper, contrary to the existing methods, we present a novel and computationally efficient ISR algorithm that is independent of the image dataset to learn the ISR task. The proposed algorithm reformulates the ISR task from generating the Super-Resolved (SR) images to computing the inverse of the kernels that span the degradation space. We introduce Deep Identity Learning, exploiting the identity relation between the degradation and inverse degradation models. The proposed approach neither relies on the ISR dataset nor on a single input low-resolution (LR) image (like the self-supervised method i.e. ZSSR) to model the ISR task. Hence we term our model as Null-Shot Super-Resolution Using Deep Identity Learning (NSSR-DIL). The proposed NSSR-DIL model requires fewer computational resources, at least by an order of 10, and demonstrates a competitive performance on benchmark ISR datasets. Another salient aspect of our proposition is that the NSSR-DIL framework detours retraining the model and remains the same for varying scale factors like X2, X3, and X4. This makes our highly efficient ISR model more suitable for real-world applications.

[AI-11] Abductive explanations of classifiers under constraints: Complexity and properties ECAI2023

链接: https://arxiv.org/abs/2409.12154
作者: Martin Cooper,Leila Amgoud
关键词-EN: Abductive explanations, decisions of classifiers, understanding decisions, Abductive, AXp
类目: Artificial Intelligence (cs.AI)
*备注: Full version with proofs of Martin C. Cooper and Leila Amgoud, Abductive explanations of classifiers under constraints: Complexity and properties, ECAI 2023, 469-476

点击查看摘要

Abstract:Abductive explanations (AXp’s) are widely used for understanding decisions of classifiers. Existing definitions are suitable when features are independent. However, we show that ignoring constraints when they exist between features may lead to an explosion in the number of redundant or superfluous AXp’s. We propose three new types of explanations that take into account constraints and that can be generated from the whole feature space or from a sample (such as a dataset). They are based on a key notion of coverage of an explanation, the set of instances it explains. We show that coverage is powerful enough to discard redundant and superfluous AXp’s. For each type, we analyse the complexity of finding an explanation and investigate its formal properties. The final result is a catalogue of different forms of AXp’s with different complexities and different formal guarantees.

[AI-12] Decoding Style: Efficient Fine-Tuning of LLMs for Image-Guided Outfit Recommendation with Preference CIKM2024

链接: https://arxiv.org/abs/2409.12150
作者: Najmeh Forouzandehmehr,Nima Farrokhsiar,Ramin Giahi,Evren Korpeoglu,Kannan Achan
关键词-EN: large language models, fashion compatibility understanding, Multimodal Large Language, large language, complex challenge
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CIKM 2024

点击查看摘要

Abstract:Personalized outfit recommendation remains a complex challenge, demanding both fashion compatibility understanding and trend awareness. This paper presents a novel framework that harnesses the expressive power of large language models (LLMs) for this task, mitigating their “black box” and static nature through fine-tuning and direct feedback integration. We bridge the item visual-textual gap in items descriptions by employing image captioning with a Multimodal Large Language Model (MLLM). This enables the LLM to extract style and color characteristics from human-curated fashion images, forming the basis for personalized recommendations. The LLM is efficiently fine-tuned on the open-source Polyvore dataset of curated fashion images, optimizing its ability to recommend stylish outfits. A direct preference mechanism using negative examples is employed to enhance the LLM’s decision-making process. This creates a self-enhancing AI feedback loop that continuously refines recommendations in line with seasonal fashion trends. Our framework is evaluated on the Polyvore dataset, demonstrating its effectiveness in two key tasks: fill-in-the-blank, and complementary item retrieval. These evaluations underline the framework’s ability to generate stylish, trend-aligned outfit suggestions, continuously improving through direct feedback. The evaluation results demonstrated that our proposed framework significantly outperforms the base LLM, creating more cohesive outfits. The improved performance in these tasks underscores the proposed framework’s potential to enhance the shopping experience with accurate suggestions, proving its effectiveness over the vanilla LLM based outfit generation.

[AI-13] akin: A Cohort of Superior Quality Zero-shot Speech Generation Models

链接: https://arxiv.org/abs/2409.12139
作者: EverestAI:Sijin Chen,Yuan Feng,Laipeng He,Tianwei He,Wendi He,Yanni Hu,Bin Lin,Yiting Lin,Pengfei Tan,Chengwei Tian,Chen Wang,Zhicheng Wang,Ruoye Xie,Jingjing Yin,Jianhao Ye,Jixun Yao,Quanlei Yan,Yuguang Yang
关键词-EN: personalized rapid customization, zero-shot personalized rapid, introduce Takin TTS, Takin TTS, including Takin TTS
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:With the advent of the big data and large language model era, zero-shot personalized rapid customization has emerged as a significant trend. In this report, we introduce Takin AudioLLM, a series of techniques and models, mainly including Takin TTS, Takin VC, and Takin Morphing, specifically designed for audiobook production. These models are capable of zero-shot speech production, generating high-quality speech that is nearly indistinguishable from real human speech and facilitating individuals to customize the speech content according to their own needs. Specifically, we first introduce Takin TTS, a neural codec language model that builds upon an enhanced neural speech codec and a multi-task training framework, capable of generating high-fidelity natural speech in a zero-shot way. For Takin VC, we advocate an effective content and timbre joint modeling approach to improve the speaker similarity, while advocating for a conditional flow matching based decoder to further enhance its naturalness and expressiveness. Last, we propose the Takin Morphing system with highly decoupled and advanced timbre and prosody modeling approaches, which enables individuals to customize speech production with their preferred timbre and prosody in a precise and controllable manner. Extensive experiments validate the effectiveness and robustness of our Takin AudioLLM series models. For detailed demos, please refer to this https URL.

[AI-14] GRIN: GRadient-INformed MoE

链接: https://arxiv.org/abs/2409.12136
作者: Liyuan Liu,Young Jin Kim,Shuohang Wang,Chen Liang,Yelong Shen,Hao Cheng,Xiaodong Liu,Masahiro Tanaka,Xiaoxia Wu,Wenxiang Hu,Vishrav Chaudhary,Zeqi Lin,Chenruidong Zhang,Jilong Xue,Hany Awadalla,Jianfeng Gao,Weizhu Chen
关键词-EN: selectively activating, expert routing, scale more effectively, small subset, sparse computation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 58 pages

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16 \times 3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.

[AI-15] Almost Sure Convergence of Linear Temporal Difference Learning with Arbitrary Features

链接: https://arxiv.org/abs/2409.12135
作者: Jiuqi Wang,Shangtong Zhang
关键词-EN: Temporal difference, powerful prediction algorithm, reinforcement learning, linear function approximation, classic and powerful
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 30 pages, 0 figures

点击查看摘要

Abstract:Temporal difference (TD) learning with linear function approximation, abbreviated as linear TD, is a classic and powerful prediction algorithm in reinforcement learning. While it is well understood that linear TD converges almost surely to a unique point, this convergence traditionally requires the assumption that the features used by the approximator are linearly independent. However, this linear independence assumption does not hold in many practical scenarios. This work is the first to establish the almost sure convergence of linear TD without requiring linearly independent features. In fact, we do not make any assumptions on the features. We prove that the approximated value function converges to a unique point and the weight iterates converge to a set. We also establish a notion of local stability of the weight iterates. Importantly, we do not need to introduce any other additional assumptions and do not need to make any modification to the linear TD algorithm. Key to our analysis is a novel characterization of bounded invariant sets of the mean ODE of linear TD.

[AI-16] BERT-VBD: Vietnamese Multi-Document Summarization Framework

链接: https://arxiv.org/abs/2409.12134
作者: Tuan-Cuong Vuong,Trang Mai Xuan,Thien Van Luong
关键词-EN: abstractive summarization, tackling the challenge, challenge of Multi-Document, abstractive summarization methods, abstractive summarization techniques
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:In tackling the challenge of Multi-Document Summarization (MDS), numerous methods have been proposed, spanning both extractive and abstractive summarization techniques. However, each approach has its own limitations, making it less effective to rely solely on either one. An emerging and promising strategy involves a synergistic fusion of extractive and abstractive summarization methods. Despite the plethora of studies in this domain, research on the combined methodology remains scarce, particularly in the context of Vietnamese language processing. This paper presents a novel Vietnamese MDS framework leveraging a two-component pipeline architecture that integrates extractive and abstractive techniques. The first component employs an extractive approach to identify key sentences within each document. This is achieved by a modification of the pre-trained BERT network, which derives semantically meaningful phrase embeddings using siamese and triplet network structures. The second component utilizes the VBD-LLaMA2-7B-50b model for abstractive summarization, ultimately generating the final summary document. Our proposed framework demonstrates a positive performance, attaining ROUGE-2 scores of 39.6% on the VN-MDS dataset and outperforming the state-of-the-art baselines.

[AI-17] Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

链接: https://arxiv.org/abs/2409.12122
作者: An Yang,Beichen Zhang,Binyuan Hui,Bofei Gao,Bowen Yu,Chengpeng Li,Dayiheng Liu,Jianhong Tu,Jingren Zhou,Junyang Lin,Keming Lu,Mingfeng Xue,Runji Lin,Tianyu Liu,Xingzhang Ren,Zhenru Zhang
关键词-EN: math-specific large language, large language models, math-specific large, SFT model, SFT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it’s possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model’s performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.12122 [cs.CL] (or arXiv:2409.12122v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.12122 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-18] Pareto Data Framework: Steps Towards Resource-Efficient Decision Making Using Minimum Viable Data (MVD)

链接: https://arxiv.org/abs/2409.12112
作者: Tashfain Ahmed,Josh Siegel
关键词-EN: Minimum Viable Data, Pareto Data Framework, Internet of Things, enabling machine learning, machine learning applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper introduces the Pareto Data Framework, an approach for identifying and selecting the Minimum Viable Data (MVD) required for enabling machine learning applications on constrained platforms such as embedded systems, mobile devices, and Internet of Things (IoT) devices. We demonstrate that strategic data reduction can maintain high performance while significantly reducing bandwidth, energy, computation, and storage costs. The framework identifies Minimum Viable Data (MVD) to optimize efficiency across resource-constrained environments without sacrificing performance. It addresses common inefficient practices in an IoT application such as overprovisioning of sensors and overprecision, and oversampling of signals, proposing scalable solutions for optimal sensor selection, signal extraction and transmission, and data representation. An experimental methodology demonstrates effective acoustic data characterization after downsampling, quantization, and truncation to simulate reduced-fidelity sensors and network and storage constraints; results shows that performance can be maintained up to 95% with sample rates reduced by 75% and bit depths and clip length reduced by 50% which translates into substantial cost and resource reduction. These findings have implications on the design and development of constrained systems. The paper also discusses broader implications of the framework, including the potential to democratize advanced AI technologies across IoT applications and sectors such as agriculture, transportation, and manufacturing to improve access and multiply the benefits of data-driven insights.

[AI-19] Measuring Human and AI Values based on Generative Psychometrics with Large Language Models

链接: https://arxiv.org/abs/2409.12106
作者: Haoran Ye,Yuhang Xie,Yuanyi Ren,Hanjun Fang,Xin Zhang,Guojie Song
关键词-EN: long-standing interdisciplinary inquiry, measurement, LLM, Human, GPV
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Human values and their measurement are long-standing interdisciplinary inquiry. Recent advances in AI have sparked renewed interest in this area, with large language models (LLMs) emerging as both tools and subjects of value measurement. This work introduces Generative Psychometrics for Values (GPV), an LLM-based, data-driven value measurement paradigm, theoretically grounded in text-revealed selective perceptions. We begin by fine-tuning an LLM for accurate perception-level value measurement and verifying the capability of LLMs to parse texts into perceptions, forming the core of the GPV pipeline. Applying GPV to human-authored blogs, we demonstrate its stability, validity, and superiority over prior psychological tools. Then, extending GPV to LLM value measurement, we advance the current art with 1) a psychometric methodology that measures LLM values based on their scalable and free-form outputs, enabling context-specific measurement; 2) a comparative analysis of measurement paradigms, indicating response biases of prior methods; and 3) an attempt to bridge LLM values and their safety, revealing the predictive power of different value systems and the impacts of various values on LLM safety. Through interdisciplinary efforts, we aim to leverage AI for next-generation psychometrics and psychometrics for value-aligned AI.

[AI-20] IMRL: Integrating Visual Physical Temporal and Geometric Representations for Enhanced Food Acquisition

链接: https://arxiv.org/abs/2409.12092
作者: Rui Liu,Zahiruddin Mahammad,Amisha Bhaskar,Pratap Tokekar
关键词-EN: Robotic assistive feeding, assistive feeding holds, feeding holds significant, holds significant promise, Robotic assistive
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Robotic assistive feeding holds significant promise for improving the quality of life for individuals with eating disabilities. However, acquiring diverse food items under varying conditions and generalizing to unseen food presents unique challenges. Existing methods that rely on surface-level geometric information (e.g., bounding box and pose) derived from visual cues (e.g., color, shape, and texture) often lacks adaptability and robustness, especially when foods share similar physical properties but differ in visual appearance. We employ imitation learning (IL) to learn a policy for food acquisition. Existing methods employ IL or Reinforcement Learning (RL) to learn a policy based on off-the-shelf image encoders such as ResNet-50. However, such representations are not robust and struggle to generalize across diverse acquisition scenarios. To address these limitations, we propose a novel approach, IMRL (Integrated Multi-Dimensional Representation Learning), which integrates visual, physical, temporal, and geometric representations to enhance the robustness and generalizability of IL for food acquisition. Our approach captures food types and physical properties (e.g., solid, semi-solid, granular, liquid, and mixture), models temporal dynamics of acquisition actions, and introduces geometric information to determine optimal scooping points and assess bowl fullness. IMRL enables IL to adaptively adjust scooping strategies based on context, improving the robot’s capability to handle diverse food acquisition scenarios. Experiments on a real robot demonstrate our approach’s robustness and adaptability across various foods and bowl configurations, including zero-shot generalization to unseen settings. Our approach achieves improvement up to 35% in success rate compared with the best-performing baseline.

[AI-21] owards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques

链接: https://arxiv.org/abs/2409.12087
作者: Yubo Li,Saba Al-Sayouri,Rema Padman
关键词-EN: Chronic Kidney Disease, End-Stage Renal Disease, Kidney Disease, Renal Disease, Chronic Kidney
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10pages, 4 figures, AMIA 2024

点击查看摘要

Abstract:This study explores the potential of utilizing administrative claims data, combined with advanced machine learning and deep learning techniques, to predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major health insurance organization to develop prediction models for multiple observation windows using traditional machine learning methods such as Random Forest and XGBoost as well as deep learning approaches such as Long Short-Term Memory (LSTM) networks. Our findings demonstrate that the LSTM model, particularly with a 24-month observation window, exhibits superior performance in predicting ESRD progression, outperforming existing models in the literature. We further apply SHapley Additive exPlanations (SHAP) analysis to enhance interpretability, providing insights into the impact of individual features on predictions at the individual patient level. This study underscores the value of leveraging administrative claims data for CKD management and predicting ESRD progression.

[AI-22] PAD-FT: A Lightweight Defense for Backdoor Attacks via Data Purification and Fine-Tuning

链接: https://arxiv.org/abs/2409.12072
作者: Yukai Xu,Yujie Gu,Kouichi Sakurai
关键词-EN: deep neural networks, increasingly subtle implantation, neural networks, subtle implantation, pose a significant
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Backdoor attacks pose a significant threat to deep neural networks, particularly as recent advancements have led to increasingly subtle implantation, making the defense more challenging. Existing defense mechanisms typically rely on an additional clean dataset as a standard reference and involve retraining an auxiliary model or fine-tuning the entire victim model. However, these approaches are often computationally expensive and not always feasible in practical applications. In this paper, we propose a novel and lightweight defense mechanism, termed PAD-FT, that does not require an additional clean dataset and fine-tunes only a very small part of the model to disinfect the victim model. To achieve this, our approach first introduces a simple data purification process to identify and select the most-likely clean data from the poisoned training dataset. The self-purified clean dataset is then used for activation clipping and fine-tuning only the last classification layer of the victim model. By integrating data purification, activation clipping, and classifier fine-tuning, our mechanism PAD-FT demonstrates superior effectiveness across multiple backdoor attack methods and datasets, as confirmed through extensive experimental evaluation.

[AI-23] Generalized Robot Learning Framework

链接: https://arxiv.org/abs/2409.12061
作者: Jiahuan Yan,Zhouyang Hong,Yu Zhao,Yu Tian,Yunxin Liu,Travis Davies,Luhui Hu
关键词-EN: recently gained significant, gained significant attention, robotics field due, Imitation based robot, based robot learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures. cs.RO

点击查看摘要

Abstract:Imitation based robot learning has recently gained significant attention in the robotics field due to its theoretical potential for transferability and generalizability. However, it remains notoriously costly, both in terms of hardware and data collection, and deploying it in real-world environments demands meticulous setup of robots and precise experimental conditions. In this paper, we present a low-cost robot learning framework that is both easily reproducible and transferable to various robots and environments. We demonstrate that deployable imitation learning can be successfully applied even to industrial-grade robots, not just expensive collaborative robotic arms. Furthermore, our results show that multi-task robot learning is achievable with simple network architectures and fewer demonstrations than previously thought necessary. As the current evaluating method is almost subjective when it comes to real-world manipulation tasks, we propose Voting Positive Rate (VPR) - a novel evaluation strategy that provides a more objective assessment of performance. We conduct an extensive comparison of success rates across various self-designed tasks to validate our approach. To foster collaboration and support the robot learning community, we have open-sourced all relevant datasets and model checkpoints, available at this http URL.

[AI-24] PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models

链接: https://arxiv.org/abs/2409.12060
作者: Andrianos Michail,Simon Clematide,Juri Opitz
关键词-EN: challenge in NLP, task of determining, NLP, paraphrase, paraphrase detection models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The task of determining whether two texts are paraphrases has long been a challenge in NLP. However, the prevailing notion of paraphrase is often quite simplistic, offering only a limited view of the vast spectrum of paraphrase phenomena. Indeed, we find that evaluating models in a paraphrase dataset can leave uncertainty about their true semantic understanding. To alleviate this, we release paraphrasus, a benchmark designed for multi-dimensional assessment of paraphrase detection models and finer model selection. We find that paraphrase detection models under a fine-grained evaluation lens exhibit trade-offs that cannot be captured through a single classification dataset.

[AI-25] Dual-Layer Training and Decoding of Large Language Model with Simultaneously Thinking and Speaking

链接: https://arxiv.org/abs/2409.12059
作者: Ningyuan Xi,Xiaoyu Wang,Yetao Wu,Teng Chen,Qingqing Gu,Jinxian Qu,Zhonglin Jiang,Yong Chen,Luo Ji
关键词-EN: Large Language Model, generate human expressions, Large Language, human expressions, Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Large Language Model can reasonably understand and generate human expressions but may lack of thorough thinking and reasoning mechanisms. Recently there have been several studies which enhance the thinking ability of language models but most of them are not data-driven or training-based. In this paper, we are motivated by the cognitive mechanism in the natural world, and design a novel model architecture called TaS which allows it to first consider the thoughts and then express the response based upon the query. We design several pipelines to annotate or generate the thought contents from prompt-response samples, then add language heads in a middle layer which behaves as the thinking layer. We train the language model by the thoughts-augmented data and successfully let the thinking layer automatically generate reasonable thoughts and finally output more reasonable responses. Both qualitative examples and quantitative results validate the effectiveness and performance of TaS. Our code is available at https://anonymous.4open.science/r/TadE.

[AI-26] A Unified Framework for Neural Computation and Learning Over Time

链接: https://arxiv.org/abs/2409.12038
作者: Stefano Melacci,Alessandro Betti,Michele Casoni,Tommaso Guidi,Matteo Tiezzi,Marco Gori
关键词-EN: proposes Hamiltonian Learning, paper proposes Hamiltonian, Learning, possibly infinite stream, future information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes Hamiltonian Learning, a novel unified framework for learning with neural networks “over time”, i.e., from a possibly infinite stream of data, in an online manner, without having access to future information. Existing works focus on the simplified setting in which the stream has a known finite length or is segmented into smaller sequences, leveraging well-established learning strategies from statistical machine learning. In this paper, the problem of learning over time is rethought from scratch, leveraging tools from optimal control theory, which yield a unifying view of the temporal dynamics of neural computations and learning. Hamiltonian Learning is based on differential equations that: (i) can be integrated without the need of external software solvers; (ii) generalize the well-established notion of gradient-based learning in feed-forward and recurrent networks; (iii) open to novel perspectives. The proposed framework is showcased by experimentally proving how it can recover gradient-based learning, comparing it to out-of-the box optimizers, and describing how it is flexible enough to switch from fully-local to partially/non-local computational schemes, possibly distributed over multiple devices, and BackPropagation without storing activations. Hamiltonian Learning is easy to implement and can help researches approach in a principled and innovative manner the problem of learning over time.

[AI-27] opological Deep Learning with State-Space Models: A Mamba Approach for Simplicial Complexes

链接: https://arxiv.org/abs/2409.12033
作者: Marco Montagna,Simone Scardapane,Lev Telyatnikov
关键词-EN: Graph Neural Networks, Neural Networks based, Neural Networks, handling graph-structured data, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks based on the message-passing (MP) mechanism are a dominant approach for handling graph-structured data. However, they are inherently limited to modeling only pairwise interactions, making it difficult to explicitly capture the complexity of systems with n -body relations. To address this, topological deep learning has emerged as a promising field for studying and modeling higher-order interactions using various topological domains, such as simplicial and cellular complexes. While these new domains provide powerful representations, they introduce new challenges, such as effectively modeling the interactions among higher-order structures through higher-order MP. Meanwhile, structured state-space sequence models have proven to be effective for sequence modeling and have recently been adapted for graph data by encoding the neighborhood of a node as a sequence, thereby avoiding the MP mechanism. In this work, we propose a novel architecture designed to operate with simplicial complexes, utilizing the Mamba state-space model as its backbone. Our approach generates sequences for the nodes based on the neighboring cells, enabling direct communication between all higher-order structures, regardless of their rank. We extensively validate our model, demonstrating that it achieves competitive performance compared to state-of-the-art models developed for simplicial complexes.

[AI-28] Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization

链接: https://arxiv.org/abs/2409.12020
作者: Zhi Chen,Lingxiao Jiang
关键词-EN: rapidly evolving field, organizations presents significant, presents significant challenges, significant challenges due, training
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper accepted to the ASE 2024 Conference Research Track

点击查看摘要

Abstract:In the rapidly evolving field of machine learning, training models with datasets from various locations and organizations presents significant challenges due to privacy and legal concerns. The exploration of effective collaborative training settings capable of leveraging valuable knowledge from distributed and isolated datasets is increasingly crucial. This study investigates key factors that impact the effectiveness of collaborative training methods in code next-token prediction, as well as the correctness and utility of the generated code, demonstrating the promise of such methods. Additionally, we evaluate the memorization of different participant training data across various collaborative training settings, including centralized, federated, and incremental training, highlighting their potential risks in leaking data. Our findings indicate that the size and diversity of code datasets are pivotal factors influencing the success of collaboratively trained code models. We show that federated learning achieves competitive performance compared to centralized training while offering better data protection, as evidenced by lower memorization ratios in the generated code. However, federated learning can still produce verbatim code snippets from hidden training data, potentially violating privacy or copyright. Our study further explores effectiveness and memorization patterns in incremental learning, emphasizing the sequence in which individual participant datasets are introduced. We also identify cross-organizational clones as a prevalent challenge in both centralized and federated learning scenarios. Our findings highlight the persistent risk of data leakage during inference, even when training data remains unseen. We conclude with recommendations for practitioners and researchers to optimize multisource datasets, propelling cross-organizational collaboration forward.

[AI-29] Representing Positional Information in Generative World Models for Object Manipulation

链接: https://arxiv.org/abs/2409.12005
作者: Stefano Ferraro,Pietro Mazzaglia,Tim Verbelen,Bart Dhoedt,Sai Rajeswar
关键词-EN: embodied agents engaging, realm of robotics, essential skills, skills that set, set apart embodied
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Object manipulation capabilities are essential skills that set apart embodied agents engaging with the world, especially in the realm of robotics. The ability to predict outcomes of interactions with objects is paramount in this setting. While model-based control methods have started to be employed for tackling manipulation tasks, they have faced challenges in accurately manipulating objects. As we analyze the causes of this limitation, we identify the cause of underperformance in the way current world models represent crucial positional information, especially about the target’s goal specification for object positioning tasks. We introduce a general approach that empowers world model-based agents to effectively solve object-positioning tasks. We propose two declinations of this approach for generative world models: position-conditioned (PCP) and latent-conditioned (LCP) policy learning. In particular, LCP employs object-centric latent representations that explicitly capture object positional information for goal specification. This naturally leads to the emergence of multimodal capabilities, enabling the specification of goals through spatial coordinates or a visual goal. Our methods are rigorously evaluated across several manipulation environments, showing favorable performance compared to current model-based control approaches.

[AI-30] Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2409.12001
作者: Claude Formanek,Louise Beyers,Callum Rhys Tilbury,Jonathan P. Shock,Arnu Pretorius
关键词-EN: multi-agent reinforcement learning, find optimal control, optimal control policies, Offline multi-agent reinforcement, multi-agent systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems. Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results. We first substantiate this claim by surveying the literature, showing how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets. We then show why neglecting the nature of the data is problematic, through salient examples of how tightly algorithmic performance is coupled to the dataset used, necessitating a common foundation for experiments in the field. In response, we take a big step towards improving data usage and data awareness in offline MARL, with three key contributions: (1) a clear guideline for generating novel datasets; (2) a standardisation of over 80 existing datasets, hosted in a publicly available repository, using a consistent storage format and easy-to-use API; and (3) a suite of analysis tools that allow us to understand these datasets better, aiding further development.

[AI-31] AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots

链接: https://arxiv.org/abs/2409.11905
作者: Zhaxizhuoma,Pengan Chen,Ziniu Wu,Jiawei Sun,Dong Wang,Peng Zhou,Nieqing Cao,Yan Ding,Bin Zhao,Xuelong Li
关键词-EN: paper presents AlignBot, paper presents, framework designed, designed to optimize, robots by effectively
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminders. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders-such as personalized preferences, corrective guidance, and contextual assistance-into structured instruction-formatted cues that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects task-relevant historical successes as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in real-world household environments, which are constructed within the laboratory to replicate typical household settings. A multimodal dataset with over 1,500 entries derived from volunteer reminders is used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting a 65% improvement and over four times greater effectiveness. Supplementary materials are available at: this https URL

[AI-32] Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

链接: https://arxiv.org/abs/2409.11904
作者: Dimitrios Christodoulou,Mads Kuhlmann-Jørgensen
关键词-EN: inherently requires subjective, requires subjective judgment, Efficiently evaluating, making it hard, inherently requires
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata’s technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.

[AI-33] DocMamba: Efficient Document Pre-training with State Space Model

链接: https://arxiv.org/abs/2409.11887
作者: Pengfei Hu,Zhenrong Zhang,Jiefeng Ma,Shuhang Liu,Jun Du,Jianshu Zhang
关键词-EN: attracted increasing attention, visually-rich document understanding, recent years, increasing attention, understanding has attracted
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism’s quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba’s potential for length extrapolation. The code will be available online.

[AI-34] Learning Task Planning from Multi-Modal Demonstration for Multi-Stage Contact-Rich Manipulation

链接: https://arxiv.org/abs/2409.11863
作者: Kejia Chen,Zheng Shen,Yue Zhang,Lingyun Chen,Fan Wu,Zhenshan Bing,Sami Haddadin,Alois Knoll
关键词-EN: Large Language Models, Large Language, Language Models, gained popularity, long-horizon manipulation tasks
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have gained popularity in task planning for long-horizon manipulation tasks. To enhance the validity of LLM-generated plans, visual demonstrations and online videos have been widely employed to guide the planning process. However, for manipulation tasks involving subtle movements but rich contact interactions, visual perception alone may be insufficient for the LLM to fully interpret the demonstration. Additionally, visual data provides limited information on force-related parameters and conditions, which are crucial for effective execution on real robots. In this paper, we introduce an in-context learning framework that incorporates tactile and force-torque information from human demonstrations to enhance LLMs’ ability to generate plans for new task scenarios. We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan. This task plan is then used as a reference for planning in new task configurations. Real-world experiments on two different sequential manipulation tasks demonstrate the effectiveness of our framework in improving LLMs’ understanding of multi-modal demonstrations and enhancing the overall planning performance. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.11863 [cs.RO] (or arXiv:2409.11863v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.11863 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-35] Retrieve Annotate Evaluate Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

链接: https://arxiv.org/abs/2409.11860
作者: Kasra Hosseini,Thomas Kober,Josip Krapac,Roland Vollgraf,Weiwei Cheng,Ana Peleteiro Ramallo
关键词-EN: Evaluating production-level retrieval, well-trained human annotators, challenging task due, Large Language Models, production-level retrieval systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: 13 pages, 5 figures, 4 Tables

点击查看摘要

Abstract:Evaluating production-level retrieval systems at scale is a crucial yet challenging task due to the limited availability of a large pool of well-trained human annotators. Large Language Models (LLMs) have the potential to address this scaling issue and offer a viable alternative to humans for the bulk of annotation tasks. In this paper, we propose a framework for assessing the product search engines in a large-scale e-commerce setting, leveraging Multimodal LLMs for (i) generating tailored annotation guidelines for individual queries, and (ii) conducting the subsequent annotation task. Our method, validated through deployment on a large e-commerce platform, demonstrates comparable quality to human annotations, significantly reduces time and cost, facilitates rapid problem discovery, and provides an effective solution for production-level quality control at scale.

[AI-36] MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts

链接: https://arxiv.org/abs/2409.11844
作者: Tianle Gu,Kexin Huang,Ruilin Luo,Yuanqi Yao,Yujiu Yang,Yan Teng,Yingchun Wang
关键词-EN: Large Language Models, Large Language, memorize sensitive information, raising concerns, potential misuse
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can memorize sensitive information, raising concerns about potential misuse. LLM Unlearning, a post-hoc approach to remove this information from trained LLMs, offers a promising solution to mitigate these risks. However, previous practices face three key challenges: 1. Utility: successful unlearning often causes catastrophic collapse on unrelated tasks. 2. Efficiency: many methods either involve adding similarly sized models, which slows down unlearning or inference, or require retain data that are difficult to obtain. 3. Robustness: even effective methods may still leak data via extraction techniques. To address these challenges, we propose MEOW, a simple yet effective gradient descent-based unlearning method. Specifically, we use an offline LLM to generate a set of inverted facts. Then, we design a new metric, MEMO, to quantify memorization in LLMs. Finally, based on the signals provided by MEMO, we select the most appropriate set of inverted facts and finetune the model based on them. We evaluate MEOW on the commonly used unlearn benchmark, ToFU, with Llama2-7B-Chat and Phi-1.5B, and test it on both NLU and NLG tasks. Results demonstrate significant improvement of MEOW in forget quality without substantial loss in model utility. Meanwhile, MEOW does not exhibit significant degradation in NLU or NLG capabilities, and there is even a slight improvement in NLU performance.

[AI-37] DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech ICASSP2025

链接: https://arxiv.org/abs/2409.11835
作者: Xin Qi,Ruibo Fu,Zhengqi Wen,Tao Wang,Chunyu Qiang,Jianhua Tao,Chenxing Li,Yi Lu,Shuchen Shi,Zhiyong Wang,Xiaopeng Wang,Yuankun Xie,Yukun Liu,Xuefei Liu,Guanjun Li
关键词-EN: recent years, advanced rapidly, Diffusion Transformer, Directional Patch Interaction, speech diffusion models
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Submitted to ICASSP2025

点击查看摘要

Abstract:In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.

[AI-38] Optimizing Job Shop Scheduling in the Furniture Industry: A Reinforcement Learning Approach Considering Machine Setup Batch Variability and Intralogistics

链接: https://arxiv.org/abs/2409.11820
作者: Malte Schneevogt,Karsten Binninger,Noah Klarmann
关键词-EN: Deep Reinforcement Learning, Deep Reinforcement, Reinforcement Learning, application of Deep, Shop Scheduling Problem
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 18 pages, 8 pages

点击查看摘要

Abstract:This paper explores the potential application of Deep Reinforcement Learning in the furniture industry. To offer a broad product portfolio, most furniture manufacturers are organized as a job shop, which ultimately results in the Job Shop Scheduling Problem (JSSP). The JSSP is addressed with a focus on extending traditional models to better represent the complexities of real-world production environments. Existing approaches frequently fail to consider critical factors such as machine setup times or varying batch sizes. A concept for a model is proposed that provides a higher level of information detail to enhance scheduling accuracy and efficiency. The concept introduces the integration of DRL for production planning, particularly suited to batch production industries such as the furniture industry. The model extends traditional approaches to JSSPs by including job volumes, buffer management, transportation times, and machine setup times. This enables more precise forecasting and analysis of production flows and processes, accommodating the variability and complexity inherent in real-world manufacturing processes. The RL agent learns to optimize scheduling decisions. It operates within a discrete action space, making decisions based on detailed observations. A reward function guides the agent’s decision-making process, thereby promoting efficient scheduling and meeting production deadlines. Two integration strategies for implementing the RL agent are discussed: episodic planning, which is suitable for low-automation environments, and continuous planning, which is ideal for highly automated plants. While episodic planning can be employed as a standalone solution, the continuous planning approach necessitates the integration of the agent with ERP and Manufacturing Execution Systems. This integration enables real-time adjustments to production schedules based on dynamic changes.

[AI-39] EFCM: Efficient Fine-tuning on Compressed Models for deployment of large models in medical image analysis

链接: https://arxiv.org/abs/2409.11817
作者: Shaojie Li,Zhaoshuo Diao
关键词-EN: medicine shows remarkable, shows remarkable performance, deep learning large, learning large models, recent development
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent development of deep learning large models in medicine shows remarkable performance in medical image analysis and diagnosis, but their large number of parameters causes memory and inference latency challenges. Knowledge distillation offers a solution, but the slide-level gradients cannot be backpropagated for student model updates due to high-resolution pathological images and slide-level labels. This study presents an Efficient Fine-tuning on Compressed Models (EFCM) framework with two stages: unsupervised feature distillation and fine-tuning. In the distillation stage, Feature Projection Distillation (FPD) is proposed with a TransScan module for adaptive receptive field adjustment to enhance the knowledge absorption capability of the student model. In the slide-level fine-tuning stage, three strategies (Reuse CLAM, Retrain CLAM, and End2end Train CLAM (ETC)) are compared. Experiments are conducted on 11 downstream datasets related to three large medical models: RETFound for retina, MRM for chest X-ray, and BROW for histopathology. The experimental results demonstrate that the EFCM framework significantly improves accuracy and efficiency in handling slide-level pathological image problems, effectively addressing the challenges of deploying large medical models. Specifically, it achieves a 4.33% increase in ACC and a 5.2% increase in AUC compared to the large model BROW on the TCGA-NSCLC and TCGA-BRCA datasets. The analysis of model inference efficiency highlights the high efficiency of the distillation fine-tuning method.

[AI-40] EventAug: Multifaceted Spatio-Temporal Data Augmentation Methods for Event-based Learning

链接: https://arxiv.org/abs/2409.11813
作者: Yukun Tian,Hao Chen,Yongjian Deng,Feihong Shen,Kepan Liu,Wei You,Ziyang Zhang
关键词-EN: high dynamic range, low time latency, demonstrated significant success, wide range, dynamic range
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The event camera has demonstrated significant success across a wide range of areas due to its low time latency and high dynamic range. However, the community faces challenges such as data deficiency and limited diversity, often resulting in over-fitting and inadequate feature learning. Notably, the exploration of data augmentation techniques in the event community remains scarce. This work aims to address this gap by introducing a systematic augmentation scheme named EventAug to enrich spatial-temporal diversity. In particular, we first propose Multi-scale Temporal Integration (MSTI) to diversify the motion speed of objects, then introduce Spatial-salient Event Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants. Our EventAug can facilitate models learning with richer motion patterns, object variants and local spatio-temporal relations, thus improving model robustness to varied moving speeds, occlusions, and action disruptions. Experiment results show that our augmentation method consistently yields significant improvements across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128 Gesture). Our code will be publicly available for this community.

[AI-41] Latent fingerprint enhancement for accurate minutiae detection

链接: https://arxiv.org/abs/2409.11802
作者: Abdul Wahab,Tariq Mahmood Khan,Shahzaib Iqbal,Bandar AlShammari,Bandar Alhaqbani,Imran Razzak
关键词-EN: latent fingerprints, commonly referred, suspects based, based on partial, partial and smudged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Identification of suspects based on partial and smudged fingerprints, commonly referred to as fingermarks or latent fingerprints, presents a significant challenge in the field of fingerprint recognition. Although fixed-length embeddings have shown effectiveness in recognising rolled and slap fingerprints, the methods for matching latent fingerprints have primarily centred around local minutiae-based embeddings, failing to fully exploit global representations for matching purposes. Consequently, enhancing latent fingerprints becomes critical to ensuring robust identification for forensic investigations. Current approaches often prioritise restoring ridge patterns, overlooking the fine-macroeconomic details crucial for accurate fingerprint recognition. To address this, we propose a novel approach that uses generative adversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE) through a structured approach to fingerprint generation. By directly optimising the minutiae information during the generation process, the model produces enhanced latent fingerprints that exhibit exceptional fidelity to ground-truth instances. This leads to a significant improvement in identification performance. Our framework integrates minutiae locations and orientation fields, ensuring the preservation of both local and structural fingerprint features. Extensive evaluations conducted on two publicly available datasets demonstrate our method’s dominance over existing state-of-the-art techniques, highlighting its potential to significantly enhance latent fingerprint recognition accuracy in forensic applications.

[AI-42] he Factuality of Large Language Models in the Legal Domain CIKM2024

链接: https://arxiv.org/abs/2409.11798
作者: Rajaa El Hamdani,Thomas Bonald,Fragkiskos Malliaros,Nils Holzenberger,Fabian Suchanek
关键词-EN: large language models, realistic usage scenario, language models, model abstain, usage scenario
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: CIKM 2024, short paper

点击查看摘要

Abstract:This paper investigates the factuality of large language models (LLMs) as knowledge bases in the legal domain, in a realistic usage scenario: we allow for acceptable variations in the answer, and let the model abstain from answering when uncertain. First, we design a dataset of diverse factual questions about case law and legislation. We then use the dataset to evaluate several LLMs under different evaluation methods, including exact, alias, and fuzzy matching. Our results show that the performance improves significantly under the alias and fuzzy matching methods. Further, we explore the impact of abstaining and in-context examples, finding that both strategies enhance precision. Finally, we demonstrate that additional pre-training on legal documents, as seen with SaulLM, further improves factual precision from 63% to 81%.

[AI-43] Efficient Low-Resolution Face Recognition via Bridge Distillation

链接: https://arxiv.org/abs/2409.11786
作者: Shiming Ge,Shengwei Zhao,Chenyu Li,Yu Zhang,Jia Li
关键词-EN: fast inference speed, fast inference, faces, private high-resolution faces, Face recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: This paper is published in IEEE TIP 2020

点击查看摘要

Abstract:Face recognition in the wild is now advancing towards light-weight models, fast inference speed and resolution-adapted capability. In this paper, we propose a bridge distillation approach to turn a complex face model pretrained on private high-resolution faces into a light-weight one for low-resolution face recognition. In our approach, such a cross-dataset resolution-adapted knowledge transfer problem is solved via two-step distillation. In the first step, we conduct cross-dataset distillation to transfer the prior knowledge from private high-resolution faces to public high-resolution faces and generate compact and discriminative features. In the second step, the resolution-adapted distillation is conducted to further transfer the prior knowledge to synthetic low-resolution faces via multi-task learning. By learning low-resolution face representations and mimicking the adapted high-resolution knowledge, a light-weight student model can be constructed with high efficiency and promising accuracy in recognizing low-resolution faces. Experimental results show that the student model performs impressively in recognizing low-resolution faces with only 0.21M parameters and 0.057MB memory. Meanwhile, its speed reaches up to 14,705, ~934 and 763 faces per second on GPU, CPU and mobile phone, respectively.

[AI-44] Distilling Channels for Efficient Deep Tracking

链接: https://arxiv.org/abs/2409.11785
作者: Shiming Ge,Zhao Luo,Chunhui Zhang,Yingying Hua,Dacheng Tao
关键词-EN: proven success, success in visual, Deep, Deep trackers, deep networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Published by IEEE TIP 2020

点击查看摘要

Abstract:Deep trackers have proven success in visual tracking. Typically, these trackers employ optimally pre-trained deep networks to represent all diverse objects with multi-channel features from some fixed layers. The deep networks employed are usually trained to extract rich knowledge from massive data used in object classification and so they are capable to represent generic objects very well. However, these networks are too complex to represent a specific moving object, leading to poor generalization as well as high computational and memory costs. This paper presents a novel and general framework termed channel distillation to facilitate deep trackers. To validate the effectiveness of channel distillation, we take discriminative correlation filter (DCF) and ECO for example. We demonstrate that an integrated formulation can turn feature compression, response map generation, and model update into a unified energy minimization problem to adaptively select informative feature channels that improve the efficacy of tracking moving objects on the fly. Channel distillation can accurately extract good channels, alleviating the influence of noisy channels and generally reducing the number of channels, as well as adaptively generalizing to different channels and networks. The resulting deep tracker is accurate, fast, and has low memory requirements. Extensive experimental evaluations on popular benchmarks clearly demonstrate the effectiveness and generalizability of our framework.

[AI-45] Explaining Non-monotonic Normative Reasoning using Argumentation Theory with Deontic Logic

链接: https://arxiv.org/abs/2409.11780
作者: Zhe Yu,Yiwei Lu
关键词-EN: provide legal support, provided a reasoning, design process, theory to provide, previous research
类目: Artificial Intelligence (cs.AI)
*备注: 13 pages

点击查看摘要

Abstract:In our previous research, we provided a reasoning system (called LeSAC) based on argumentation theory to provide legal support to designers during the design process. Building on this, this paper explores how to provide designers with effective explanations for their legally relevant design decisions. We extend the previous system for providing explanations by specifying norms and the key legal or ethical principles for justifying actions in normative contexts. Considering that first-order logic has strong expressive power, in the current paper we adopt a first-order deontic logic system with deontic operators and preferences. We illustrate the advantages and necessity of introducing deontic logic and designing explanations under LeSAC by modelling two cases in the context of autonomous driving. In particular, this paper also discusses the requirements of the updated LeSAC to guarantee rationality, and proves that a well-defined LeSAC can satisfy the rationality postulate for rule-based argumentation frameworks. This ensures the system’s ability to provide coherent, legally valid explanations for complex design decisions.

[AI-46] Knowledge Adaptation Network for Few-Shot Class-Incremental Learning

链接: https://arxiv.org/abs/2409.11770
作者: Ye Wang,Yaxiong Wang,Guoshuai Zhao,Xueming Qian
关键词-EN: Few-shot class-incremental learning, Few-shot class-incremental, aims to incrementally, incrementally recognize, samples while maintaining
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages;6 figures

点击查看摘要

Abstract:Few-shot class-incremental learning (FSCIL) aims to incrementally recognize new classes using a few samples while maintaining the performance on previously learned classes. One of the effective methods to solve this challenge is to construct prototypical evolution classifiers. Despite the advancement achieved by most existing methods, the classifier weights are simply initialized using mean features. Because representations for new classes are weak and biased, we argue such a strategy is suboptimal. In this paper, we tackle this issue from two aspects. Firstly, thanks to the development of foundation models, we employ a foundation model, the CLIP, as the network pedestal to provide a general representation for each class. Secondly, to generate a more reliable and comprehensive instance representation, we propose a Knowledge Adapter (KA) module that summarizes the data-specific knowledge from training data and fuses it into the general representation. Additionally, to tune the knowledge learned from the base classes to the upcoming classes, we propose a mechanism of Incremental Pseudo Episode Learning (IPEL) by simulating the actual FSCIL. Taken together, our proposed method, dubbed as Knowledge Adaptation Network (KANet), achieves competitive performance on a wide range of datasets, including CIFAR100, CUB200, and ImageNet-R.

[AI-47] One Map to Find Them All: Real-time Open-Vocabulary Mapping for Zero-shot Multi-Object Navigation

链接: https://arxiv.org/abs/2409.11764
作者: Finn Lukas Busch,Timon Homberger,Jesús Ortega-Peimbert,Quantao Yang,Olov Andersson
关键词-EN: real-world robot applications, robot applications, navigation, Jetson Orin AGX, complex environments
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The capability to efficiently search for objects in complex environments is fundamental for many real-world robot applications. Recent advances in open-vocabulary vision models have resulted in semantically-informed object navigation methods that allow a robot to search for an arbitrary object without prior training. However, these zero-shot methods have so far treated the environment as unknown for each consecutive query. In this paper we introduce a new benchmark for zero-shot multi-object navigation, allowing the robot to leverage information gathered from previous searches to more efficiently find new objects. To address this problem we build a reusable open-vocabulary feature map tailored for real-time object search. We further propose a probabilistic-semantic map update that mitigates common sources of errors in semantic feature extraction and leverage this semantic uncertainty for informed multi-object exploration. We evaluate our method on a set of object navigation tasks in both simulation as well as with a real robot, running in real-time on a Jetson Orin AGX. We demonstrate that it outperforms existing state-of-the-art approaches both on single and multi-object navigation tasks. Additional videos, code and the multi-object navigation benchmark will be available on this https URL.

[AI-48] Synthesizing Evolving Symbolic Representations for Autonomous Systems

链接: https://arxiv.org/abs/2409.11756
作者: Gabriele Sartor,Angelo Oddi,Riccardo Rasconi,Vieri Giuliano Santucci,Rosa Meo
关键词-EN: made remarkable progress, made remarkable, remarkable progress, Deep Reinforcement Learning, Recently
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:Recently, AI systems have made remarkable progress in various tasks. Deep Reinforcement Learning(DRL) is an effective tool for agents to learn policies in low-level state spaces to solve highly complex tasks. Researchers have introduced Intrinsic Motivation(IM) to the RL mechanism, which simulates the agent’s curiosity, encouraging agents to explore interesting areas of the environment. This new feature has proved vital in enabling agents to learn policies without being given specific goals. However, even though DRL intelligence emerges through a sub-symbolic model, there is still a need for a sort of abstraction to understand the knowledge collected by the agent. To this end, the classical planning formalism has been used in recent research to explicitly represent the knowledge an autonomous agent acquires and effectively reach extrinsic goals. Despite classical planning usually presents limited expressive capabilities, PPDDL demonstrated usefulness in reviewing the knowledge gathered by an autonomous system, making explicit causal correlations, and can be exploited to find a plan to reach any state the agent faces during its experience. This work presents a new architecture implementing an open-ended learning system able to synthesize from scratch its experience into a PPDDL representation and update it over time. Without a predefined set of goals and tasks, the system integrates intrinsic motivations to explore the environment in a self-directed way, exploiting the high-level knowledge acquired during its experience. The system explores the environment and iteratively: (a) discover options, (b) explore the environment using options, © abstract the knowledge collected and (d) plan. This paper proposes an alternative approach to implementing open-ended learning architectures exploiting low-level and high-level representations to extend its knowledge in a virtuous loop.

[AI-49] NPAT Null-Space Projected Adversarial Training Towards Zero Deterioration

链接: https://arxiv.org/abs/2409.11754
作者: Hanyi Hu,Qiao Han,Kui Chen,Yao Yang
关键词-EN: effective defense strategy, Projected Data Augmentation, Projected Gradient Descent, Null-space Projected Data, defense strategy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:To mitigate the susceptibility of neural networks to adversarial attacks, adversarial training has emerged as a prevalent and effective defense strategy. Intrinsically, this countermeasure incurs a trade-off, as it sacrifices the model’s accuracy in processing normal samples. To reconcile the trade-off, we pioneer the incorporation of null-space projection into adversarial training and propose two innovative Null-space Projection based Adversarial Training(NPAT) algorithms tackling sample generation and gradient optimization, named Null-space Projected Data Augmentation (NPDA) and Null-space Projected Gradient Descent (NPGD), to search for an overarching optimal solutions, which enhance robustness with almost zero deterioration in generalization performance. Adversarial samples and perturbations are constrained within the null-space of the decision boundary utilizing a closed-form null-space projector, effectively mitigating threat of attack stemming from unreliable features. Subsequently, we conducted experiments on the CIFAR10 and SVHN datasets and reveal that our methodology can seamlessly combine with adversarial training methods and obtain comparable robustness while keeping generalization close to a high-accuracy model.

[AI-50] Exploring Gaze Pattern in Autistic Children: Clustering Visualization and Prediction

链接: https://arxiv.org/abs/2409.11744
作者: Weiyan Shi,Haihong Zhang,Jin Yang,Ruiqing Ding,YongWei Zhu,Kenny Tsu Wei Choo
关键词-EN: Autism Spectrum Disorder, Autism Spectrum, Spectrum Disorder, ASD, gaze
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Autism Spectrum Disorder (ASD) significantly affects the social and communication abilities of children, and eye-tracking is commonly used as a diagnostic tool by identifying associated atypical gaze patterns. Traditional methods demand manual identification of Areas of Interest in gaze patterns, lowering the performance of gaze behavior analysis in ASD subjects. To tackle this limitation, we propose a novel method to automatically analyze gaze behaviors in ASD children with superior accuracy. To be specific, we first apply and optimize seven clustering algorithms to automatically group gaze points to compare ASD subjects with typically developing peers. Subsequently, we extract 63 significant features to fully describe the patterns. These features can describe correlations between ASD diagnosis and gaze patterns. Lastly, using these features as prior knowledge, we train multiple predictive machine learning models to predict and diagnose ASD based on their gaze behaviors. To evaluate our method, we apply our method to three ASD datasets. The experimental and visualization results demonstrate the improvements of clustering algorithms in the analysis of unique gaze patterns in ASD children. Additionally, these predictive machine learning models achieved state-of-the-art prediction performance ( 81% AUC) in the field of automatically constructed gaze point features for ASD diagnosis. Our code is available at \urlthis https URL.

[AI-51] HARP: Human-Assisted Regrouping with Permutation Invariant Critic for Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2409.11741
作者: Huawen Hu,Enze Shi,Chenxi Yue,Shuocun Yang,Zihao Wu,Yiwei Li,Tianyang Zhong,Tuo Zhang,Tianming Liu,Shu Zhang
关键词-EN: provide critical guidance, complex fields, Permutation Invariant Critic, expertise to accelerate, provide critical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Human-in-the-loop reinforcement learning integrates human expertise to accelerate agent learning and provide critical guidance and feedback in complex fields. However, many existing approaches focus on single-agent tasks and require continuous human involvement during the training process, significantly increasing the human workload and limiting scalability. In this paper, we propose HARP (Human-Assisted Regrouping with Permutation Invariant Critic), a multi-agent reinforcement learning framework designed for group-oriented tasks. HARP integrates automatic agent regrouping with strategic human assistance during deployment, enabling and allowing non-experts to offer effective guidance with minimal intervention. During training, agents dynamically adjust their groupings to optimize collaborative task completion. When deployed, they actively seek human assistance and utilize the Permutation Invariant Group Critic to evaluate and refine human-proposed groupings, allowing non-expert users to contribute valuable suggestions. In multiple collaboration scenarios, our approach is able to leverage limited guidance from non-experts and enhance performance. The project can be found at this https URL.

[AI-52] InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models

链接: https://arxiv.org/abs/2409.11734
作者: Yan Zheng,Lemeng Wu
关键词-EN: customized user requirements, short for GEO, exceptionally versatile image, global scales, exceptionally versatile
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:In this paper, we introduce Geometry-Inverse-Meet-Pixel-Insert, short for GEO, an exceptionally versatile image editing technique designed to cater to customized user requirements at both local and global scales. Our approach seamlessly integrates text prompts and image prompts to yield diverse and precise editing outcomes. Notably, our method operates without the need for training and is driven by two key contributions: (i) a novel geometric accumulation loss that enhances DDIM inversion to faithfully preserve pixel space geometry and layout, and (ii) an innovative boosted image prompt technique that combines pixel-level editing for text-only inversion with latent space geometry guidance for standard classifier-free reversion. Leveraging the publicly available Stable Diffusion model, our approach undergoes extensive evaluation across various image types and challenging prompt editing scenarios, consistently delivering high-fidelity editing results for real images.

[AI-53] GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation

链接: https://arxiv.org/abs/2409.11689
作者: Shuowen Liang,Sisi Li,Qingyun Wang,Cen Zhang,Kaiquan Zhu,Tian Yang
关键词-EN: pose-controllable image generation, important reference, reference in pose-controllable, Pose skeleton, Pose skeleton images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.

[AI-54] Detecting Underdiagnosed Medical Conditions with Deep Learning-Based Opportunistic CT Imaging

链接: https://arxiv.org/abs/2409.11686
作者: Asad Aali,Andrew Johnston,Louis Blankemeier,Dave Van Veen,Laura T Derry,David Svec,Jason Hom,Robert D. Boutin,Akshay S. Chaudhari
关键词-EN: Abdominal computed tomography, Abdominal computed, computed tomography, frequently performed, clinical settings
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Abdominal computed tomography (CT) scans are frequently performed in clinical settings. Opportunistic CT involves repurposing routine CT images to extract diagnostic information and is an emerging tool for detecting underdiagnosed conditions such as sarcopenia, hepatic steatosis, and ascites. This study utilizes deep learning methods to promote accurate diagnosis and clinical documentation. We analyze 2,674 inpatient CT scans to identify discrepancies between imaging phenotypes (characteristics derived from opportunistic CT scans) and their corresponding documentation in radiology reports and ICD coding. Through our analysis, we find that only 0.5%, 3.2%, and 30.7% of scans diagnosed with sarcopenia, hepatic steatosis, and ascites (respectively) through either opportunistic imaging or radiology reports were ICD-coded. Our findings demonstrate opportunistic CT’s potential to enhance diagnostic precision and accuracy of risk adjustment models, offering advancements in precision medicine.

[AI-55] Hypergraph-based Motion Generation with Multi-modal Interaction Relational Reasoning

链接: https://arxiv.org/abs/2409.11676
作者: Keshu Wu,Yang Zhou,Haotian Shi,Dominique Lord,Bin Ran,Xinyue Ye
关键词-EN: presents considerable challenges, presents considerable, future states, intricate nature, accurately predicting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The intricate nature of real-world driving environments, characterized by dynamic and diverse interactions among multiple vehicles and their possible future states, presents considerable challenges in accurately predicting the motion states of vehicles and handling the uncertainty inherent in the predictions. Addressing these challenges requires comprehensive modeling and reasoning to capture the implicit relations among vehicles and the corresponding diverse behaviors. This research introduces an integrated framework for autonomous vehicles (AVs) motion prediction to address these complexities, utilizing a novel Relational Hypergraph Interaction-informed Neural mOtion generator (RHINO). RHINO leverages hypergraph-based relational reasoning by integrating a multi-scale hypergraph neural network to model group-wise interactions among multiple vehicles and their multi-modal driving behaviors, thereby enhancing motion prediction accuracy and reliability. Experimental validation using real-world datasets demonstrates the superior performance of this framework in improving predictive accuracy and fostering socially aware automated driving in dynamic traffic scenarios.

[AI-56] owards Explainable Goal Recognition Using Weight of Evidence (WoE): A Human-Centered Approach

链接: https://arxiv.org/abs/2409.11675
作者: Abeer Alshehri,Amal Abdulrahman,Hajar Alamri,Tim Miller,Mor Vered
关键词-EN: agent unobserved goal, Goal recognition, involves inferring, sequence of observations, unobserved goal
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Goal recognition (GR) involves inferring an agent’s unobserved goal from a sequence of observations. This is a critical problem in AI with diverse applications. Traditionally, GR has been addressed using ‘inference to the best explanation’ or abduction, where hypotheses about the agent’s goals are generated as the most plausible explanations for observed behavior. Alternatively, some approaches enhance interpretability by ensuring that an agent’s behavior aligns with an observer’s expectations or by making the reasoning behind decisions more transparent. In this work, we tackle a different challenge: explaining the GR process in a way that is comprehensible to humans. We introduce and evaluate an explainable model for goal recognition (GR) agents, grounded in the theoretical framework and cognitive processes underlying human behavior explanation. Drawing on insights from two human-agent studies, we propose a conceptual framework for human-centered explanations of GR. Using this framework, we develop the eXplainable Goal Recognition (XGR) model, which generates explanations for both why and why not questions. We evaluate the model computationally across eight GR benchmarks and through three user studies. The first study assesses the efficiency of generating human-like explanations within the Sokoban game domain, the second examines perceived explainability in the same domain, and the third evaluates the model’s effectiveness in aiding decision-making in illegal fishing detection. Results demonstrate that the XGR model significantly enhances user understanding, trust, and decision-making compared to baseline models, underscoring its potential to improve human-agent collaboration.

[AI-57] Anticipating Oblivious Opponents in Stochastic Games

链接: https://arxiv.org/abs/2409.11671
作者: Shadi Tasdighi Kalat,Sriram Sankaranarayanan,Ashutosh Trivedi
关键词-EN: concurrent stochastic games, stochastic games, systematically anticipating, concurrent stochastic, information state machine
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We present an approach for systematically anticipating the actions and policies employed by \emphoblivious environments in concurrent stochastic games, while maximizing a reward function. Our main contribution lies in the synthesis of a finite \emphinformation state machine whose alphabet ranges over the actions of the environment. Each state of the automaton is mapped to a belief state about the policy used by the environment. We introduce a notion of consistency that guarantees that the belief states tracked by our automaton stays within a fixed distance of the precise belief state obtained by knowledge of the full history. We provide methods for checking consistency of an automaton and a synthesis approach which upon successful termination yields such a machine. We show how the information state machine yields an MDP that serves as the starting point for computing optimal policies for maximizing a reward function defined over plays. We present an experimental evaluation over benchmark examples including human activity data for tasks such as cataract surgery and furniture assembly, wherein our approach successfully anticipates the policies and actions of the environment in order to maximize the reward.

[AI-58] Agent Aggregator with Mask Denoise Mechanism for Histopathology Whole Slide Image Analysis

链接: https://arxiv.org/abs/2409.11664
作者: Xitong Ling,Minxi Ouyang,Yizhi Wang,Xinrui Chen,Renao Yan,Hongbo Chu,Junru Cheng,Tian Guan,Sufang Tian,Xiaoping Liu,Yonghong He
关键词-EN: Histopathology analysis, gold standard, standard for medical, medical diagnosis, Histopathology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Histopathology analysis is the gold standard for medical diagnosis. Accurate classification of whole slide images (WSIs) and region-of-interests (ROIs) localization can assist pathologists in diagnosis. The gigapixel resolution of WSI and the absence of fine-grained annotations make direct classification and analysis challenging. In weakly supervised learning, multiple instance learning (MIL) presents a promising approach for WSI classification. The prevailing strategy is to use attention mechanisms to measure instance importance for classification. However, attention mechanisms fail to capture inter-instance information, and self-attention causes quadratic computational complexity. To address these challenges, we propose AMD-MIL, an agent aggregator with a mask denoise mechanism. The agent token acts as an intermediate variable between the query and key for computing instance importance. Mask and denoising matrices, mapped from agents-aggregated value, dynamically mask low-contribution representations and eliminate noise. AMD-MIL achieves better attention allocation by adjusting feature representations, capturing micro-metastases in cancer, and improving interpretability. Extensive experiments on CAMELYON-16, CAMELYON-17, TCGA-KIDNEY, and TCGA-LUNG show AMD-MIL’s superiority over state-of-the-art methods.

[AI-59] GReDP: A More Robust Approach for Differential Privacy Training with Gradient-Preserving Noise Reduction

链接: https://arxiv.org/abs/2409.11663
作者: Haodi Wang,Tangyu Jiang,Yu Guo,Xiaohua Jia,Chengjun Cai
关键词-EN: represent hierarchical features, Deep learning, deep learning training, Deep learning models, deep learning algorithms
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning models have been extensively adopted in various regions due to their ability to represent hierarchical features, which highly rely on the training set and procedures. Thus, protecting the training process and deep learning algorithms is paramount in privacy preservation. Although Differential Privacy (DP) as a powerful cryptographic primitive has achieved satisfying results in deep learning training, the existing schemes still fall short in preserving model utility, i.e., they either invoke a high noise scale or inevitably harm the original gradients. To address the above issues, in this paper, we present a more robust approach for DP training called GReDP. Specifically, we compute the model gradients in the frequency domain and adopt a new approach to reduce the noise level. Unlike the previous work, our GReDP only requires half of the noise scale compared to DPSGD [1] while keeping all the gradient information intact. We present a detailed analysis of our method both theoretically and empirically. The experimental results show that our GReDP works consistently better than the baselines on all models and training settings.

[AI-60] Few-Shot Class-Incremental Learning with Non-IID Decentralized Data

链接: https://arxiv.org/abs/2409.11657
作者: Cuiwei Liu,Siang Xu,Huaijun Qiu,Jing Zhang,Zhi Liu,Liang Zhao
关键词-EN: adaptive intelligent systems, Few-shot class-incremental learning, minimal annotated data, previously accumulated knowledge, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Few-shot class-incremental learning is crucial for developing scalable and adaptive intelligent systems, as it enables models to acquire new classes with minimal annotated data while safeguarding the previously accumulated knowledge. Nonetheless, existing methods deal with continuous data streams in a centralized manner, limiting their applicability in scenarios that prioritize data privacy and security. To this end, this paper introduces federated few-shot class-incremental learning, a decentralized machine learning paradigm tailored to progressively learn new classes from scarce data distributed across multiple clients. In this learning paradigm, clients locally update their models with new classes while preserving data privacy, and then transmit the model updates to a central server where they are aggregated globally. However, this paradigm faces several issues, such as difficulties in few-shot learning, catastrophic forgetting, and data heterogeneity. To address these challenges, we present a synthetic data-driven framework that leverages replay buffer data to maintain existing knowledge and facilitate the acquisition of new knowledge. Within this framework, a noise-aware generative replay module is developed to fine-tune local models with a balance of new and replay data, while generating synthetic data of new classes to further expand the replay buffer for future tasks. Furthermore, a class-specific weighted aggregation strategy is designed to tackle data heterogeneity by adaptively aggregating class-specific parameters based on local models performance on synthetic data. This enables effective global model optimization without direct access to client data. Comprehensive experiments across three widely-used datasets underscore the effectiveness and preeminence of the introduced framework.

[AI-61] Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview

链接: https://arxiv.org/abs/2409.11650
作者: Yanshu Wang,Tong Yang,Xiyan Liang,Guoan Wang,Hanning Lu,Xu Zhe,Yaoming Li,Li Weitao
关键词-EN: quantizing large-scale neural, neural network models, large-scale neural network, comprehensive overview, neural network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper provides a comprehensive overview of the principles, challenges, and methodologies associated with quantizing large-scale neural network models. As neural networks have evolved towards larger and more complex architectures to address increasingly sophisticated tasks, the computational and energy costs have escalated significantly. We explore the necessity and impact of model size growth, highlighting the performance benefits as well as the computational challenges and environmental considerations. The core focus is on model quantization as a fundamental approach to mitigate these challenges by reducing model size and improving efficiency without substantially compromising accuracy. We delve into various quantization techniques, including both post-training quantization (PTQ) and quantization-aware training (QAT), and analyze several state-of-the-art algorithms such as LLM-QAT, PEQA(L4Q), ZeroQuant, SmoothQuant, and others. Through comparative analysis, we examine how these methods address issues like outliers, importance weighting, and activation quantization, ultimately contributing to more sustainable and accessible deployment of large-scale models.

[AI-62] Combating Phone Scams with LLM-based Detection: Where Do We Stand?

链接: https://arxiv.org/abs/2409.11643
作者: Zitong Shen,Kangzhong Wang,Youqian Zhang,Grace Ngai,Eugene Y. Fu
关键词-EN: causing substantial financial, substantial financial losses, Phone scams pose, individuals and communities, causing substantial
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 2 pages, 1 figure

点击查看摘要

Abstract:Phone scams pose a significant threat to individuals and communities, causing substantial financial losses and emotional distress. Despite ongoing efforts to combat these scams, scammers continue to adapt and refine their tactics, making it imperative to explore innovative countermeasures. This research explores the potential of large language models (LLMs) to provide detection of fraudulent phone calls. By analyzing the conversational dynamics between scammers and victims, LLM-based detectors can identify potential scams as they occur, offering immediate protection to users. While such approaches demonstrate promising results, we also acknowledge the challenges of biased datasets, relatively low recall, and hallucinations that must be addressed for further advancement in this field

[AI-63] A Metric Hybrid Planning Approach to Solving Pandemic Planning Problems with Simple SIR Models

链接: https://arxiv.org/abs/2409.11631
作者: Ari Gestetner,Buser Say
关键词-EN: Susceptible Infected Removed, economic and social, large regions, terms of health, disease across large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A pandemic is the spread of a disease across large regions, and can have devastating costs to the society in terms of health, economic and social. As such, the study of effective pandemic mitigation strategies can yield significant positive impact on the society. A pandemic can be mathematically described using a compartmental model, such as the Susceptible Infected Removed (SIR) model. In this paper, we extend the solution equations of the SIR model to a state transition model with lockdowns. We formalize a metric hybrid planning problem based on this state transition model, and solve it using a metric hybrid planner. We improve the runtime effectiveness of the metric hybrid planner with the addition of valid inequalities, and demonstrate the success of our approach both theoretically and experimentally under various challenging settings.

[AI-64] HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms

链接: https://arxiv.org/abs/2409.11617
作者: Evgenia-Maria K. Goula,Dimitris G. Sotiropoulos
关键词-EN: solving complex optimization, complex optimization problems, Metaheuristic algorithms, essential for solving, solving complex
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: 13 pages, 1 figure

点击查看摘要

Abstract:Metaheuristic algorithms are essential for solving complex optimization problems in different fields. However, the difficulty in comparing and rating these algorithms remains due to the wide range of performance metrics and problem dimensions usually involved. On the other hand, nonparametric statistical methods and post hoc tests are time-consuming, especially when we only need to identify the top performers among many algorithms. The Hierarchical Rank Aggregation (HRA) algorithm aims to efficiently rank metaheuristic algorithms based on their performance across many criteria and dimensions. The HRA employs a hierarchical framework that begins with collecting performance metrics on various benchmark functions and dimensions. Rank-based normalization is employed for each performance measure to ensure comparability and the robust TOPSIS aggregation is applied to combine these rankings at several hierarchical levels, resulting in a comprehensive ranking of the algorithms. Our study uses data from the CEC 2017 competition to demonstrate the robustness and efficacy of the HRA framework. It examines 30 benchmark functions and evaluates the performance of 13 metaheuristic algorithms across five performance indicators in four distinct dimensions. This presentation highlights the potential of the HRA to enhance the interpretation of the comparative advantages and disadvantages of various algorithms by simplifying practitioners’ choices of the most appropriate algorithm for certain optimization problems.

[AI-65] No Saved Kaleidosope: an 100% Jitted Neural Network Coding Language with Pythonic Syntax

链接: https://arxiv.org/abs/2409.11600
作者: Augusto Seben da Rosa,Marlon Daniel Angeli,Jorge Aikes Junior,Alef Iury Ferreira,Lucas Rafael Gris,Anderson da Silva Soares,Arnaldo Candido Junior,Frederico Santos de Oliveira,Gabriel Trevisan Damke,Rafael Teixeira Sousa
关键词-EN: training Artificial Neural, LLVM and Cuda, Artificial Neural Networks, training Artificial, Artificial Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: 12 pages, 3 figures and 3 tables

点击查看摘要

Abstract:We developed a jitted compiler for training Artificial Neural Networks using C++, LLVM and Cuda. It features object-oriented characteristics, strong typing, parallel workers for data pre-processing, pythonic syntax for expressions, PyTorch like model declaration and Automatic Differentiation. We implement the mechanisms of cache and pooling in order to manage VRAM, cuBLAS for high performance matrix multiplication and cuDNN for convolutional layers. Our experiments with Residual Convolutional Neural Networks on ImageNet, we reach similar speed but degraded performance. Also, the GRU network experiments show similar accuracy, but our compiler have degraded speed in that task. However, our compiler demonstrates promising results at the CIFAR-10 benchmark, in which we reach the same performance and about the same speed as PyTorch. We make the code publicly available at: this https URL

[AI-66] owards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2409.11598
作者: To Eun Kim,Fernando Diaz
关键词-EN: RAG systems, RAG, language models, models now enhance, enhance their responses
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Many language models now enhance their responses with retrieval capabilities, leading to the widespread adoption of retrieval-augmented generation (RAG) systems. However, despite retrieval being a core component of RAG, much of the research in this area overlooks the extensive body of work on fair ranking, neglecting the importance of considering all stakeholders involved. This paper presents the first systematic evaluation of RAG systems integrated with fair rankings. We focus specifically on measuring the fair exposure of each relevant item across the rankings utilized by RAG systems (i.e., item-side fairness), aiming to promote equitable growth for relevant item providers. To gain a deep understanding of the relationship between item-fairness, ranking quality, and generation quality in the context of RAG, we analyze nine different RAG systems that incorporate fair rankings across seven distinct datasets. Our findings indicate that RAG systems with fair rankings can maintain a high level of generation quality and, in many cases, even outperform traditional RAG systems, despite the general trend of a tradeoff between ensuring fairness and maintaining system-effectiveness. We believe our insights lay the groundwork for responsible and equitable RAG systems and open new avenues for future research. We publicly release our codebase and dataset at this https URL.

[AI-67] Self-Contrastive Forward-Forward Algorithm

链接: https://arxiv.org/abs/2409.11593
作者: Xing Chen,Dongshu Liu,Jeremie Laydevant,Julie Grollier
关键词-EN: updates weights locally, purely forward-mode learning, purely forward-mode, updates weights, weights locally
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The Forward-Forward (FF) algorithm is a recent, purely forward-mode learning method, that updates weights locally and layer-wise and supports supervised as well as unsupervised learning. These features make it ideal for applications such as brain-inspired learning, low-power hardware neural networks, and distributed learning in large models. However, while FF has shown promise on written digit recognition tasks, its performance on natural images and time-series remains a challenge. A key limitation is the need to generate high-quality negative examples for contrastive learning, especially in unsupervised tasks, where versatile solutions are currently lacking. To address this, we introduce the Self-Contrastive Forward-Forward (SCFF) method, inspired by self-supervised contrastive learning. SCFF generates positive and negative examples applicable across different datasets, surpassing existing local forward algorithms for unsupervised classification accuracy on MNIST (MLP: 98.7%), CIFAR-10 (CNN: 80.75%), and STL-10 (CNN: 77.3%). Additionally, SCFF is the first to enable FF training of recurrent neural networks, opening the door to more complex tasks and continuous-time video and text processing.

[AI-68] ProSLM : A Prolog Synergized Language Model for explainable Domain Specific Knowledge Based Question Answering

链接: https://arxiv.org/abs/2409.11589
作者: Priyesh Vakharia,Abigail Kufeldt,Max Meyers,Ian Lane,Leilani Gilpin
关键词-EN: explainable symbolic representations, opaque neural systems, incorporating explainable symbolic, symbolic representations, opaque neural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at NeSy 2024

点击查看摘要

Abstract:Neurosymbolic approaches can add robustness to opaque neural systems by incorporating explainable symbolic representations. However, previous approaches have not used formal logic to contextualize queries to and validate outputs of large language models (LLMs). We propose \systemname, a novel neurosymbolic framework, to improve the robustness and reliability of LLMs in question-answering tasks. We provide \systemname with a domain-specific knowledge base, a logical reasoning system, and an integration to an existing LLM. This framework has two capabilities (1) context gathering: generating explainable and relevant context for a given query, and (2) validation: confirming and validating the factual accuracy of a statement in accordance with a knowledge base (KB). Our work opens a new area of neurosymbolic generative AI text validation and user personalization.

[AI-69] Preference Tuning with Human Feedback on Language Speech and Vision Tasks: A Survey

链接: https://arxiv.org/abs/2409.11564
作者: Genta Indra Winata,Hanyang Zhao,Anirban Das,Wenpin Tang,David D. Yao,Shi-Xiong Zhang,Sambit Sahu
关键词-EN: aligning deep generative, Preference tuning, deep generative models, preference tuning tasks, Preference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Survey paper

点击查看摘要

Abstract:Preference tuning is a crucial process for aligning deep generative models with human preferences. This survey offers a thorough overview of recent advancements in preference tuning and the integration of human feedback. The paper is organized into three main sections: 1) introduction and preliminaries: an introduction to reinforcement learning frameworks, preference tuning tasks, models, and datasets across various modalities: language, speech, and vision, as well as different policy approaches, 2) in-depth examination of each preference tuning approach: a detailed analysis of the methods used in preference tuning, and 3) applications, discussion, and future directions: an exploration of the applications of preference tuning in downstream tasks, including evaluation methods for different modalities, and an outlook on future research directions. Our objective is to present the latest methodologies in preference tuning and model alignment, enhancing the understanding of this field for researchers and practitioners. We hope to encourage further engagement and innovation in this area.

[AI-70] Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs

链接: https://arxiv.org/abs/2409.11547
作者: Guillermo Marco,Luz Rello,Julio Gonzalo
关键词-EN: fine-tuned small language, small language model, large language models, fiction writing abilities, small language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART Large, and compare its performance to humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human evaluation where readers assess the stories generated by the SLM compared to human-written stories, and (ii) a qualitative linguistic analysis comparing the textual characteristics of the stories generated by the different models. In the first experiment, we asked 68 participants to rate short stories generated by the models and humans along dimensions such as grammaticality, relevance, creativity, and attractiveness. BART Large outperformed human writers in most aspects, except creativity, with an overall score of 2.11 compared to 1.85 for human-written texts – a 14% improvement. In the second experiment, the qualitative analysis revealed that, while GPT-4o exhibited near-perfect internal and external coherence, it tended to produce more predictable narratives, with only 3% of its stories seen as novel. In contrast, 15% of BART’s stories were considered novel, indicating a higher degree of creativity despite its smaller model size. This study provides both quantitative and qualitative insights into how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks.

[AI-71] Improving LLM Reasoning with Multi-Agent Tree-of-Thought Validator Agent

链接: https://arxiv.org/abs/2409.11527
作者: Fatemeh Haji,Mazal Bethany,Maryam Tabar,Jason Chiang,Anthony Rios,Peyman Najafirad
关键词-EN: Large Language Models, Language Models, Large Language, assigning specialized roles, abilities of Large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-agent strategies have emerged as a promising approach to enhance the reasoning abilities of Large Language Models (LLMs) by assigning specialized roles in the problem-solving process. Concurrently, Tree of Thoughts (ToT) methods have shown potential in improving reasoning for complex question-answering tasks by exploring diverse reasoning paths. A critical limitation in multi-agent reasoning is the ‘Reasoner’ agent’s shallow exploration of reasoning paths. While ToT strategies could help mitigate this problem, they may generate flawed reasoning branches, which could harm the trustworthiness of the final answer. To leverage the strengths of both multi-agent reasoning and ToT strategies, we introduce a novel approach combining ToT-based Reasoner agents with a Thought Validator agent. Multiple Reasoner agents operate in parallel, employing ToT to explore diverse reasoning paths. The Thought Validator then scrutinizes these paths, considering a Reasoner’s conclusion only if its reasoning is valid. This method enables a more robust voting strategy by discarding faulty reasoning paths, enhancing the system’s ability to tackle tasks requiring systematic and trustworthy reasoning. Our method demonstrates superior performance compared to existing techniques when evaluated on the GSM8K dataset, outperforming the standard ToT strategy by an average 5.6% across four LLMs.

[AI-72] Mamba Fusion: Learning Actions Through Questioning

链接: https://arxiv.org/abs/2409.11513
作者: Zhikang Dong,Apoorva Beedu,Jason Sheinkopf,Irfan Essa
关键词-EN: Video Language Models, Video Language, enhance learning, crucial for generalizing, generalizing across diverse
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Video Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture information about actions from multiple perspectives within the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide critical information about actions, objects, and environmental context, leading to enhanced performance. As a result, MambaVL achieves state-of-the-art performance in action recognition on the Epic-Kitchens-100 dataset and outperforms baseline methods in action anticipation.

[AI-73] FedNE: Surrogate-Assisted Federated Neighbor Embedding for Dimensionality Reduction

链接: https://arxiv.org/abs/2409.11509
作者: Ziwei Li,Xiaoqi Wang,Hong-You Chen,Han-Wei Shen,Wei-Lun Chao
关键词-EN: enables collaborative model, collaborative model training, Federated learning, rapidly evolved, promising paradigm
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has rapidly evolved as a promising paradigm that enables collaborative model training across distributed participants without exchanging their local data. Despite its broad applications in fields such as computer vision, graph learning, and natural language processing, the development of a data projection model that can be effectively used to visualize data in the context of FL is crucial yet remains heavily under-explored. Neighbor embedding (NE) is an essential technique for visualizing complex high-dimensional data, but collaboratively learning a joint NE model is difficult. The key challenge lies in the objective function, as effective visualization algorithms like NE require computing loss functions among pairs of data. In this paper, we introduce \textscFedNE, a novel approach that integrates the \textscFedAvg framework with the contrastive NE technique, without any requirements of shareable data. To address the lack of inter-client repulsion which is crucial for the alignment in the global embedding space, we develop a surrogate loss function that each client learns and shares with each other. Additionally, we propose a data-mixing strategy to augment the local data, aiming to relax the problems of invisible neighbors and false neighbors constructed by the local k NN graphs. We conduct comprehensive experiments on both synthetic and real-world datasets. The results demonstrate that our \textscFedNE can effectively preserve the neighborhood data structures and enhance the alignment in the global embedding space compared to several baseline methods.

[AI-74] Egalitarian Language Representation in Language Models: It All Begins with Tokenizers

链接: https://arxiv.org/abs/2409.11501
作者: Menan Velayuthan,Kengatharaiyer Sarveswaran
关键词-EN: Large Language Models, language models, Byte Pair Encoding, complex script languages, bridge between human
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Content - 8 pages, References - 3 pages

点击查看摘要

Abstract:Tokenizers act as a bridge between human language and the latent space of language models, influencing how language is represented in these models. Due to the immense popularity of English-Centric Large Language Models (LLMs), efforts are being made to adapt them for other languages. However, we demonstrate that, from a tokenization standpoint, not all tokenizers offer fair representation for complex script languages such as Tamil, Sinhala, and Hindi, primarily due to the choice of pre-tokenization methods. We go further to show that pre-tokenization plays a more critical role than the tokenization algorithm itself in achieving an egalitarian representation of these complex script languages. To address this, we introduce an improvement to the Byte Pair Encoding (BPE) algorithm by incorporating graphemes, which we term Grapheme Pair Encoding (GPE). Our experiments show that grapheme-based character extraction outperforms byte-level tokenizers for complex scripts. We validate this approach through experiments on Tamil, Sinhala, and Hindi.

[AI-75] Multi-Document Grounded Multi-Turn Synthetic Dialog Generation

链接: https://arxiv.org/abs/2409.11500
作者: Young-Suk Lee,Chulaka Gunasekara,Danish Contractor,Ramón Fernandez Astudillo,Radu Florian
关键词-EN: main ideas, introduce a technique, incorporates three main, multi-document grounded, dialog
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce a technique for multi-document grounded multi-turn synthetic dialog generation that incorporates three main ideas. First, we control the overall dialog flow using taxonomy-driven user queries that are generated with Chain-of-Thought (CoT) prompting. Second, we support the generation of multi-document grounded dialogs by mimicking real-world use of retrievers to update the grounding documents after every user-turn in the dialog. Third, we apply LLM-as-a-Judge to filter out queries with incorrect answers. Human evaluation of the synthetic dialog data suggests that the data is diverse, coherent, and includes mostly correct answers. Both human and automatic evaluations of answerable queries indicate that models fine-tuned on synthetic dialogs consistently out-perform those fine-tuned on existing human generated training data across four publicly available multi-turn document grounded benchmark test sets.

[AI-76] Augment Drop Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning

链接: https://arxiv.org/abs/2409.11498
作者: Ilaria Manco,Justin Salamon,Oriol Nieto
关键词-EN: music representation learning, Audio-text contrastive models, Audio-text contrastive, powerful approach, approach in music
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注: To appear in the Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

点击查看摘要

Abstract:Audio-text contrastive models have become a powerful approach in music representation learning. Despite their empirical success, however, little is known about the influence of key design choices on the quality of music-text representations learnt through this framework. In this work, we expose these design choices within the constraints of limited data and computation budgets, and establish a more solid understanding of their impact grounded in empirical observations along three axes: the choice of base encoders, the level of curation in training data, and the use of text augmentation. We find that data curation is the single most important factor for music-text contrastive training in resource-constrained scenarios. Motivated by this insight, we introduce two novel techniques, Augmented View Dropout and TextSwap, which increase the diversity and descriptiveness of text inputs seen in training. Through our experiments we demonstrate that these are effective at boosting performance across different pre-training regimes, model architectures, and downstream data distributions, without incurring higher computational costs or requiring additional training data.

[AI-77] Beyond Algorithmic Fairness: A Guide to Develop and Deploy Ethical AI-Enabled Decision-Support Tools

链接: https://arxiv.org/abs/2409.11489
作者: Rosemarie Santa Gonzalez,Ryan Piansky,Sue M Bae,Justin Biddle,Daniel Molzahn
关键词-EN: hold substantial promise, optimization hold substantial, artificial intelligence, improving the efficiency, integration of artificial
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of artificial intelligence (AI) and optimization hold substantial promise for improving the efficiency, reliability, and resilience of engineered systems. Due to the networked nature of many engineered systems, ethically deploying methodologies at this intersection poses challenges that are distinct from other AI settings, thus motivating the development of ethical guidelines tailored to AI-enabled optimization. This paper highlights the need to go beyond fairness-driven algorithms to systematically address ethical decisions spanning the stages of modeling, data curation, results analysis, and implementation of optimization-based decision support tools. Accordingly, this paper identifies ethical considerations required when deploying algorithms at the intersection of AI and optimization via case studies in power systems as well as supply chain and logistics. Rather than providing a prescriptive set of rules, this paper aims to foster reflection and awareness among researchers and encourage consideration of ethical implications at every step of the decision-making process.

[AI-78] wo Stage Segmentation of Cervical Tumors using PocketNet

链接: https://arxiv.org/abs/2409.11456
作者: Awj Twam,Megan Jacobsen,Rachel Glenn,Ann Klopp,Aradhana M. Venkatesan,David Fuentes
关键词-EN: includes external beam, external beam radiation, locally advanced cervical, definitive treatment regimen, mainstay definitive treatment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cervical cancer remains the fourth most common malignancy amongst women worldwide.1 Concurrent chemoradiotherapy (CRT) serves as the mainstay definitive treatment regimen for locally advanced cervical cancers and includes external beam radiation followed by brachytherapy.2 Integral to radiotherapy treatment planning is the routine contouring of both the target tumor at the level of the cervix, associated gynecologic anatomy and the adjacent organs at risk (OARs). However, manual contouring of these structures is both time and labor intensive and associated with known interobserver variability that can impact treatment outcomes. While multiple tools have been developed to automatically segment OARs and the high-risk clinical tumor volume (HR-CTV) using computed tomography (CT) images,3,4,5,6 the development of deep learning-based tumor segmentation tools using routine T2-weighted (T2w) magnetic resonance imaging (MRI) addresses an unmet clinical need to improve the routine contouring of both anatomical structures and cervical cancers, thereby increasing quality and consistency of radiotherapy planning. This work applied a novel deep-learning model (PocketNet) to segment the cervix, vagina, uterus, and tumor(s) on T2w MRI. The performance of the PocketNet architecture was evaluated, when trained on data via 5-fold cross validation. PocketNet achieved a mean Dice-Sorensen similarity coefficient (DSC) exceeding 70% for tumor segmentation and 80% for organ segmentation. These results suggest that PocketNet is robust to variations in contrast protocols, providing reliable segmentation of the ROIs.

[AI-79] Evaluation of pretrained language models on music understanding

链接: https://arxiv.org/abs/2409.11449
作者: Yannis Vasilakis,Rachel Bittner,Johan Pauwels
关键词-EN: Music Information Research, Music-text multimodal systems, Information Research, text-based song generation, Music-text multimodal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Music-text multimodal systems have enabled new approaches to Music Information Research (MIR) applications such as audio-to-text and text-to-audio retrieval, text-based song generation, and music captioning. Despite the reported success, little effort has been put into evaluating the musical knowledge of Large Language Models (LLM). In this paper, we demonstrate that LLMs suffer from 1) prompt sensitivity, 2) inability to model negation (e.g. ‘rock song without guitar’), and 3) sensitivity towards the presence of specific words. We quantified these properties as a triplet-based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology. We leveraged the Audioset ontology to generate triplets consisting of an anchor, a positive (relevant) label, and a negative (less relevant) label for the genre and instruments sub-tree. We evaluated the triplet-based musical knowledge for six general-purpose Transformer-based models. The triplets obtained through this methodology required filtering, as some were difficult to judge and therefore relatively uninformative for evaluation purposes. Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off-the-shelf LLMs need adaptation to music before use.

[AI-80] Volvo Discovery Challenge at ECML-PKDD 2024 ECML KDD2024

链接: https://arxiv.org/abs/2409.11446
作者: Mahmoud Rahat,Peyman Sheikholharam Mashhadi,Sławomir Nowaczyk,Shamik Choudhury,Leo Petrin,Thorsteinn Rognvaldsson,Andreas Voskou,Carlo Metta,Claudio Savelli
关键词-EN: Volvo Discovery Challenge, Volvo Discovery, Discovery Challenge, paper presents, presents an overview
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ECML/PKDD 2024, Discovery Challenge

点击查看摘要

Abstract:This paper presents an overview of the Volvo Discovery Challenge, held during the ECML-PKDD 2024 conference. The challenge’s goal was to predict the failure risk of an anonymized component in Volvo trucks using a newly published dataset. The test data included observations from two generations (gen1 and gen2) of the component, while the training data was provided only for gen1. The challenge attracted 52 data scientists from around the world who submitted a total of 791 entries. We provide a brief description of the problem definition, challenge setup, and statistics about the submissions. In the section on winning methodologies, the first, second, and third-place winners of the competition briefly describe their proposed methods and provide GitHub links to their implemented code. The shared code can be interesting as an advanced methodology for researchers in the predictive maintenance domain. The competition was hosted on the Codabench platform.

[AI-81] Jailbreaking Large Language Models with Symbolic Mathematics

链接: https://arxiv.org/abs/2409.11445
作者: Emet Bethany,Mazal Bethany,Juan Arturo Nolazco Flores,Sumit Kumar Jha,Peyman Najafirad
关键词-EN: unsafe content generation, mitigate unsafe content, Recent advancements, large language models, content generation
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs’ advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack’s success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.

[AI-82] A Green Multi-Attribute Client Selection for Over-The-Air Federated Learning: A Grey-Wolf-Optimizer Approach

链接: https://arxiv.org/abs/2409.11442
作者: Maryam Ben Driss,Essaid Sabir,Halima Elbiaze,Abdoulaye Baniré Diallo,Mohamed Sadik
关键词-EN: train machine learning, centralizing sensitive data, Federated Learning, machine learning models, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has gained attention across various industries for its capability to train machine learning models without centralizing sensitive data. While this approach offers significant benefits such as privacy preservation and decreased communication overhead, it presents several challenges, including deployment complexity and interoperability issues, particularly in heterogeneous scenarios or resource-constrained environments. Over-the-air (OTA) FL was introduced to tackle these challenges by disseminating model updates without necessitating direct device-to-device connections or centralized servers. However, OTA-FL brought forth limitations associated with heightened energy consumption and network latency. In this paper, we propose a multi-attribute client selection framework employing the grey wolf optimizer (GWO) to strategically control the number of participants in each round and optimize the OTA-FL process while considering accuracy, energy, delay, reliability, and fairness constraints of participating devices. We evaluate the performance of our multi-attribute client selection approach in terms of model loss minimization, convergence time reduction, and energy efficiency. In our experimental evaluation, we assessed and compared the performance of our approach against the existing state-of-the-art methods. Our results demonstrate that the proposed GWO-based client selection outperforms these baselines across various metrics. Specifically, our approach achieves a notable reduction in model loss, accelerates convergence time, and enhances energy efficiency while maintaining high fairness and reliability indicators.

[AI-83] MARCA: Mamba Accelerator with ReConfigurable Architecture

链接: https://arxiv.org/abs/2409.11440
作者: Jinhao Li,Shan Huang,Jiaming Xu,Jun Liu,Li Ding,Ningyi Xu,Guohao Dai
关键词-EN: element-wise operations, operations, Reduction, element-wise, Mamba accelerator
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 10 figures, accepted by ICCAD 2024. arXiv admin note: text overlap with arXiv:2001.02514 by other authors

点击查看摘要

Abstract:We propose a Mamba accelerator with reconfigurable architecture, MARCA.We propose three novel approaches in this paper. (1) Reduction alternative PE array architecture for both linear and element-wise operations. For linear operations, the reduction tree connected to PE arrays is enabled and executes the reduction operation. For element-wise operations, the reduction tree is disabled and the output bypasses. (2) Reusable nonlinear function unit based on the reconfigurable PE. We decompose the exponential function into element-wise operations and a shift operation by a fast biased exponential algorithm, and the activation function (SiLU) into a range detection and element-wise operations by a piecewise approximation algorithm. Thus, the reconfigurable PEs are reused to execute nonlinear functions with negligible accuracy loss.(3) Intra-operation and inter-operation buffer management strategy. We propose intra-operation buffer management strategy to maximize input data sharing for linear operations within operations, and inter-operation strategy for element-wise operations between operations. We conduct extensive experiments on Mamba model families with different sizes.MARCA achieves up to 463.22 \times /11.66 \times speedup and up to 9761.42 \times /242.52 \times energy efficiency compared to Intel Xeon 8358P CPU and NVIDIA Tesla A100 GPU implementations, respectively.

[AI-84] Machine listening in a neonatal intensive care unit

链接: https://arxiv.org/abs/2409.11439
作者: Modan Tailleur(LS2N, Nantes Univ - ECN, LS2N - équipe SIMS),Vincent Lostanlen(LS2N, LS2N - équipe SIMS, Nantes Univ - ECN),Jean-Philippe Rivière(Nantes Univ, Nantes Univ - UFR FLCE, LS2N, LS2N - équipe PACCE),Pierre Aumond
关键词-EN: common sound sources, alarm devices, common sound, sound sources, Oxygenators
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Oxygenators, alarm devices, and footsteps are some of the most common sound sources in a hospital. Detecting them has scientific value for environmental psychology but comes with challenges of its own: namely, privacy preservation and limited labeled data. In this paper, we address these two challenges via a combination of edge computing and cloud computing. For privacy preservation, we have designed an acoustic sensor which computes third-octave spectrograms on the fly instead of recording audio waveforms. For sample-efficient machine learning, we have repurposed a pretrained audio neural network (PANN) via spectral transcoding and label space adaptation. A small-scale study in a neonatological intensive care unit (NICU) confirms that the time series of detected events align with another modality of measurement: i.e., electronic badges for parents and healthcare professionals. Hence, this paper demonstrates the feasibility of polyphonic machine listening in a hospital ward while guaranteeing privacy by design.

[AI-85] Analysis of flexible traffic control method in SDN

链接: https://arxiv.org/abs/2409.11436
作者: Marta Szymczyk
关键词-EN: SDN controller performance, enable intelligent adaptation, SDN controller, SDN networks, adaptation of SDN
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The aim of this paper is to analyze methods of flexible control in SDN networks and to propose a self-developed solution that will enable intelligent adaptation of SDN controller performance. This work aims not only to review existing solutions, but also to develop an approach that will increase the efficiency and adaptability of network management. The project uses a modern type of machine learning, Reinforcement Learning, which allows autonomous decisions of a network that learns based on its choices in a dynamically changing environment, which is most similar to the way humans learn. The solution aims not only to improve the network’s performance, but also its flexibility and real-time adaptability - flexible traffic control.

[AI-86] A hybrid solution for 2-UAV RAN slicing

链接: https://arxiv.org/abs/2409.11432
作者: Nathan Boyer
关键词-EN: Toggle, Machine Type Communication, Code Toggle Papers, Code, Papers
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: 9 pages, 11 figures

点击查看摘要

Abstract:It’s possible to distribute the Internet to users via drones. However it is then necessary to place the drones according to the positions of the users. Moreover, the 5th Generation (5G) New Radio (NR) technology is designed to accommodate a wide range of applications and industries. The NGNM 5G White Paper \cite5gwhitepaper groups these vertical use cases into three categories: - enhanced Mobile Broadband (eMBB) - massive Machine Type Communication (mMTC) - Ultra-Reliable Low-latency Communication (URLLC). Partitioning the physical network into multiple virtual networks appears to be the best way to provide a customised service for each application and limit operational costs. This design is well known as \textitnetwork slicing. Each drone must thus slice its bandwidth between each of the 3 user classes. This whole problem (placement + bandwidth) can be defined as an optimization problem, but since it is very hard to solve efficiently, it is almost always addressed by AI in the litterature. In my internship, I wanted to prove that viewing the problem as an optimization problem can still be useful, by building an hybrid solution involving on one hand AI and on the other optimization. I use it to achieve better results than approaches that use only AI, although at the cost of slightly larger (but still reasonable) computation times. Comments: 9 pages, 11 figures Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI) Cite as: arXiv:2409.11432 [cs.NI] (or arXiv:2409.11432v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2409.11432 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Nathan Boyer [view email] [v1] Sun, 15 Sep 2024 09:42:31 UTC (978 KB) Full-text links: Access Paper: View a PDF of the paper titled A hybrid solution for 2-UAV RAN slicing, by Nathan BoyerView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.NI prev | next new | recent | 2024-09 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-87] owards Opinion Shaping: A Deep Reinforcement Learning Approach in Bot-User Interactions

链接: https://arxiv.org/abs/2409.11426
作者: Farbod Siahkali,Saba Samadi,Hamed Kebriaei
关键词-EN: Bounded Confidence Model, Stochastic Bounded Confidence, Confidence Model, Stochastic Bounded, Bounded Confidence
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 5 pages, 3 figures, 2 tables

点击查看摘要

Abstract:This paper aims to investigate the impact of interference in social network algorithms via user-bot interactions, focusing on the Stochastic Bounded Confidence Model (SBCM). This paper explores two approaches: positioning bots controlled by agents into the network and targeted advertising under various circumstances, operating with an advertising budget. This study integrates the Deep Deterministic Policy Gradient (DDPG) algorithm and its variants to experiment with different Deep Reinforcement Learning (DRL). Finally, experimental results demonstrate that this approach can result in efficient opinion shaping, indicating its potential in deploying advertising resources on social platforms.

[AI-88] he Unseen AI Disruptions for Power Grids: LLM-Induced Transients

链接: https://arxiv.org/abs/2409.11416
作者: Yuzhuo Li,Mariam Mughees,Yize Chen,Yunwei Ryan Li
关键词-EN: exhibited superior capability, AI-centric data centers, Recent breakthroughs, large language models, industries and stimulated
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Performance (cs.PF); Systems and Control (eess.SY)
*备注: 21 pages, 18 figures

点击查看摘要

Abstract:Recent breakthroughs of large language models (LLMs) have exhibited superior capability across major industries and stimulated multi-hundred-billion-dollar investment in AI-centric data centers in the next 3-5 years. This, in turn, bring the increasing concerns on sustainability and AI-related energy usage. However, there is a largely overlooked issue as challenging and critical as AI model and infrastructure efficiency: the disruptive dynamic power consumption behaviour. With fast, transient dynamics, AI infrastructure features ultra-low inertia, sharp power surge and dip, and a significant peak-idle power ratio. The power scale covers from several hundred watts to megawatts, even to gigawatts. These never-seen-before characteristics make AI a very unique load and pose threats to the power grid reliability and resilience. To reveal this hidden problem, this paper examines the scale of AI power consumption, analyzes AI transient behaviour in various scenarios, develops high-level mathematical models to depict AI workload behaviour and discusses the multifaceted challenges and opportunities they potentially bring to existing power grids. Observing the rapidly evolving machine learning (ML) and AI technologies, this work emphasizes the critical need for interdisciplinary approaches to ensure reliable and sustainable AI infrastructure development, and provides a starting point for researchers and practitioners to tackle such challenges.

[AI-89] RTLRewriter: Methodologies for Large Models aided RTL Code Optimization

链接: https://arxiv.org/abs/2409.11414
作者: Xufeng Yao,Yiwen Wang,Xing Li,Yingzhao Lian,Ran Chen,Lei Chen,Mingxuan Yuan,Hong Xu,Bei Yu
关键词-EN: Register Transfer Level, Register Transfer, Transfer Level, crucial for enhancing, enhancing the efficiency
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: ICCAD2024

点击查看摘要

Abstract:Register Transfer Level (RTL) code optimization is crucial for enhancing the efficiency and performance of digital circuits during early synthesis stages. Currently, optimization relies heavily on manual efforts by skilled engineers, often requiring multiple iterations based on synthesis feedback. In contrast, existing compiler-based methods fall short in addressing complex designs. This paper introduces RTLRewriter, an innovative framework that leverages large models to optimize RTL code. A circuit partition pipeline is utilized for fast synthesis and efficient rewriting. A multi-modal program analysis is proposed to incorporate vital visual diagram information as optimization cues. A specialized search engine is designed to identify useful optimization guides, algorithms, and code snippets that enhance the model ability to generate optimized RTL. Additionally, we introduce a Cost-aware Monte Carlo Tree Search (C-MCTS) algorithm for efficient rewriting, managing diverse retrieved contents and steering the rewriting results. Furthermore, a fast verification pipeline is proposed to reduce verification cost. To cater to the needs of both industry and academia, we propose two benchmarking suites: the Large Rewriter Benchmark, targeting complex scenarios with extensive circuit partitioning, optimization trade-offs, and verification challenges, and the Small Rewriter Benchmark, designed for a wider range of scenarios and patterns. Our comparative analysis with established compilers such as Yosys and E-graph demonstrates significant improvements, highlighting the benefits of integrating large models into the early stages of circuit design. We provide our benchmarks at this https URL.

[AI-90] AIvril: AI-Driven RTL Generation With Verification In-The-Loop

链接: https://arxiv.org/abs/2409.11411
作者: Mubashir ul Islam,Humza Sami,Pierre-Emmanuel Gaillardon,Valerio Tenace
关键词-EN: Large Language Models, computational models capable, performing complex natural, complex natural language, natural language processing
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are computational models capable of performing complex natural language processing tasks. Leveraging these capabilities, LLMs hold the potential to transform the entire hardware design stack, with predictions suggesting that front-end and back-end tasks could be fully automated in the near future. Currently, LLMs show great promise in streamlining Register Transfer Level (RTL) generation, enhancing efficiency, and accelerating innovation. However, their probabilistic nature makes them prone to inaccuracies - a significant drawback in RTL design, where reliability and precision are essential. To address these challenges, this paper introduces AIvril, an advanced framework designed to enhance the accuracy and reliability of RTL-aware LLMs. AIvril employs a multi-agent, LLM-agnostic system for automatic syntax correction and functional verification, significantly reducing - and in many cases, completely eliminating - instances of erroneous code generation. Experimental results conducted on the VerilogEval-Human dataset show that our framework improves code quality by nearly 2x when compared to previous works, while achieving an 88.46% success rate in meeting verification objectives. This represents a critical step toward automating and optimizing hardware design workflows, offering a more dependable methodology for AI-driven RTL design. Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2409.11411 [cs.AI] (or arXiv:2409.11411v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.11411 Focus to learn more arXiv-issued DOI via DataCite

[AI-91] CyberNFTs: Conceptualizing a decentralized and reward-driven intrusion detection system with ML

链接: https://arxiv.org/abs/2409.11409
作者: Synim Selimi,Blerim Rexha,Kamer Vishi
关键词-EN: rapid evolution, people interact, interact and share, share data, Internet
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, 1 table, 1 algorithm, 1 listing, journal article

点击查看摘要

Abstract:The rapid evolution of the Internet, particularly the emergence of Web3, has transformed the ways people interact and share data. Web3, although still not well defined, is thought to be a return to the decentralization of corporations’ power over user data. Despite the obsolescence of the idea of building systems to detect and prevent cyber intrusions, this is still a topic of interest. This paper proposes a novel conceptual approach for implementing decentralized collaborative intrusion detection networks (CIDN) through a proof-of-concept. The study employs an analytical and comparative methodology, examining the synergy between cutting-edge Web3 technologies and information security. The proposed model incorporates blockchain concepts, cyber non-fungible token (cyberNFT) rewards, machine learning algorithms, and publish/subscribe architectures. Finally, the paper discusses the strengths and limitations of the proposed system, offering insights into the potential of decentralized cybersecurity models.

[AI-92] owards Signal Processing In Large Language Models

链接: https://arxiv.org/abs/2406.10254
作者: Prateek Verma,Mert Pilanci
关键词-EN: Large Language Model, Large Language, Language Model, applying signal processing, signal processing inside
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:This paper introduces the idea of applying signal processing inside a Large Language Model (LLM). With the recent explosion of generative AI, our work can help bridge two fields together, namely the field of signal processing and large language models. We draw parallels between classical Fourier-Transforms and Fourier Transform-like learnable time-frequency representations for every intermediate activation signal of an LLM. Once we decompose every activation signal across tokens into a time-frequency representation, we learn how to filter and reconstruct them, with all components learned from scratch, to predict the next token given the previous context. We show that for GPT-like architectures, our work achieves faster convergence and significantly increases performance by adding a minuscule number of extra parameters when trained for the same epochs. We hope this work paves the way for algorithms exploring signal processing inside the signals found in neural architectures like LLMs and beyond.

[AI-93] Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

链接: https://arxiv.org/abs/2105.00335
作者: Prateek Verma,Jonathan Berger
关键词-EN: learning hierarchical organizations, produced compelling models, CNN architectures, perception and cognition, learning hierarchical
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 4 figures; Under review WASPAA 2021

点击查看摘要

Abstract:Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.

[AI-94] Additive-feature-attribution methods: a review on explainable artificial intelligence for fluid dynamics and heat transfer

链接: https://arxiv.org/abs/2409.11992
作者: Andrés Cremades,Sergio Hoyas,Ricardo Vinuesa
关键词-EN: recent years due, turbulent flows, experimental tests, mechanics has surged, surged dramatically
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The use of data-driven methods in fluid mechanics has surged dramatically in recent years due to their capacity to adapt to the complex and multi-scale nature of turbulent flows, as well as to detect patterns in large-scale simulations or experimental tests. In order to interpret the relationships generated in the models during the training process, numerical attributions need to be assigned to the input features. One important example are the additive-feature-attribution methods. These explainability methods link the input features with the model prediction, providing an interpretation based on a linear formulation of the models. The SHapley Additive exPlanations (SHAP values) are formulated as the only possible interpretation that offers a unique solution for understanding the model. In this manuscript, the additive-feature-attribution methods are presented, showing four common implementations in the literature: kernel SHAP, tree SHAP, gradient SHAP, and deep SHAP. Then, the main applications of the additive-feature-attribution methods are introduced, dividing them into three main groups: turbulence modeling, fluid-mechanics fundamentals, and applied problems in fluid dynamics and heat transfer. This review shows thatexplainability techniques, and in particular additive-feature-attribution methods, are crucial for implementing interpretable and physics-compliant deep-learning models in the fluid-mechanics field.

[AI-95] Smart Data-Driven GRU Predictor for SnO_2 Thin films Characteristics ALT

链接: https://arxiv.org/abs/2409.11782
作者: Faiza Bouamra,Mohamed Sayah,Labib Sadek Terrissa,Noureddine Zerhouni
关键词-EN: foremost crucial, crucial for obtaining, material physics, Gated Recurrent Unit, thin films
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注: 19 pages, 14 figures. Baltica Journal, Special Issues, September 2024

点击查看摘要

Abstract:In material physics, characterization techniques are foremost crucial for obtaining the materials data regarding the physical properties as well as structural, electronics, magnetic, optic, dielectric, and spectroscopic characteristics. However, for many materials, ensuring availability and safe accessibility is not always easy and fully warranted. Moreover, the use of modeling and simulation techniques need a lot of theoretical knowledge, in addition of being associated to costly computation time and a great complexity deal. Thus, analyzing materials with different techniques for multiple samples simultaneously, still be very challenging for engineers and researchers. It is worth noting that although of being very risky, X-ray diffraction is the well known and widely used characterization technique which gathers data from structural properties of crystalline 1d, 2d or 3d materials. We propose in this paper, a Smart GRU for Gated Recurrent Unit model to forcast structural characteristics or properties of thin films of tin oxide SnO _2 (110). Indeed, thin films samples are elaborated and managed experimentally and the collected data dictionary is then used to generate an AI – Artificial Intelligence – GRU model for the thin films of tin oxide SnO _2 (110) structural property characterization.

[AI-96] How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities

链接: https://arxiv.org/abs/2409.11654
作者: Charlotte Bunne,Yusuf Roohani,Yanay Rosen,Ankit Gupta,Xikun Zhang,Marcel Roed,Theo Alexandrov,Mohammed AlQuraishi,Patricia Brennan,Daniel B. Burkhardt,Andrea Califano,Jonah Cool,Abby F. Dernburg,Kirsty Ewing,Emily B. Fox,Matthias Haury,Amy E. Herr,Eric Horvitz,Patrick D. Hsu,Viren Jain,Gregory R. Johnson,Thomas Kalil,David R. Kelley,Shana O. Kelley,Anna Kreshuk,Tim Mitchison,Stephani Otte,Jay Shendure,Nicholas J. Sofroniew,Fabian Theis,Christina V. Theodoris,Srigokul Upadhyayula,Marc Valer,Bo Wang,Eric Xing,Serena Yeung-Levy,Marinka Zitnik,Theofanis Karaletsos,Aviv Regev,Emma Lundberg,Jure Leskovec,Stephen R. Quake
关键词-EN: Virtual Cells, arguably the smallest, smallest unit, unit of life, cells
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The cell is arguably the smallest unit of life and is central to understanding biology. Accurate modeling of cells is important for this understanding as well as for determining the root causes of disease. Recent advances in artificial intelligence (AI), combined with the ability to generate large-scale experimental data, present novel opportunities to model cells. Here we propose a vision of AI-powered Virtual Cells, where robust representations of cells and cellular systems under different conditions are directly learned from growing biological data across measurements and scales. We discuss desired capabilities of AI Virtual Cells, including generating universal representations of biological entities across scales, and facilitating interpretable in silico experiments to predict and understand their behavior using Virtual Instruments. We further address the challenges, opportunities and requirements to realize this vision including data needs, evaluation strategies, and community standards and engagement to ensure biological accuracy and broad utility. We envision a future where AI Virtual Cells help identify new drug targets, predict cellular responses to perturbations, as well as scale hypothesis exploration. With open science collaborations across the biomedical ecosystem that includes academia, philanthropy, and the biopharma and AI industries, a comprehensive predictive understanding of cell mechanisms and interactions is within reach.

[AI-97] Few-Shot Learning Approach on Tuberculosis Classification Based on Chest X-Ray Images

链接: https://arxiv.org/abs/2409.11644
作者: A.A.G. Yogi Pramana,Faiz Ihza Permana,Muhammad Fazil Maulana,Dzikri Rahadian Fudholi
关键词-EN: bacterium Mycobacterium tuberculosis, Mycobacterium tuberculosis, bacterium Mycobacterium, primarily affecting, affecting the lungs
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages. Pre-print

点击查看摘要

Abstract:Tuberculosis (TB) is caused by the bacterium Mycobacterium tuberculosis, primarily affecting the lungs. Early detection is crucial for improving treatment effectiveness and reducing transmission risk. Artificial intelligence (AI), particularly through image classification of chest X-rays, can assist in TB detection. However, class imbalance in TB chest X-ray datasets presents a challenge for accurate classification. In this paper, we propose a few-shot learning (FSL) approach using the Prototypical Network algorithm to address this issue. We compare the performance of ResNet-18, ResNet-50, and VGG16 in feature extraction from the TBX11K Chest X-ray dataset. Experimental results demonstrate classification accuracies of 98.93% for ResNet-18, 98.60% for ResNet-50, and 33.33% for VGG16. These findings indicate that the proposed method outperforms others in mitigating data imbalance, which is particularly beneficial for disease classification applications.

[AI-98] Harnessing AI data-driven global weather models for climate attribution: An analysis of the 2017 Oroville Dam extreme atmospheric river

链接: https://arxiv.org/abs/2409.11605
作者: Jorge Baño-Medina,Agniv Sengupta,Allison Michaelis,Luca Delle Monache,Julie Kalansky,Duncan Watson-Parris
关键词-EN: short inference times, provide real time, Pangu Weather, real time attributions, inference times
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
*备注: This Work has been submitted to Artificial Intelligence for the Earth Systems

点击查看摘要

Abstract:AI data-driven models (Graphcast, Pangu Weather, Fourcastnet, and SFNO) are explored for storyline-based climate attribution due to their short inference times, which can accelerate the number of events studied, and provide real time attributions when public attention is heightened. The analysis is framed on the extreme atmospheric river episode of February 2017 that contributed to the Oroville dam spillway incident in Northern California. Past and future simulations are generated by perturbing the initial conditions with the pre-industrial and the late-21st century temperature climate change signals, respectively. The simulations are compared to results from a dynamical model which represents plausible pseudo-realities under both climate environments. Overall, the AI models show promising results, projecting a 5-6 % increase in the integrated water vapor over the Oroville dam in the present day compared to the pre-industrial, in agreement with the dynamical model. Different geopotential-moisture-temperature dependencies are unveiled for each of the AI-models tested, providing valuable information for understanding the physicality of the attribution response. However, the AI models tend to simulate weaker attribution values than the pseudo-reality imagined by the dynamical model, suggesting some reduced extrapolation skill, especially for the late-21st century regime. Large ensembles generated with an AI model (500 members) produced statistically significant present-day to pre-industrial attribution results, unlike the 20-member ensemble from the dynamical model. This analysis highlights the potential of AI models to conduct attribution analysis, while emphasizing future lines of work on explainable artificial intelligence to gain confidence in these tools, which can enable reliable attribution studies in real-time.

[AI-99] Uncertainty Decomposition and Error Margin Detection of Homodyned-K Distribution in Quantitative Ultrasound

链接: https://arxiv.org/abs/2409.11583
作者: Dorsa Ameri,Ali K. Z. Tehrani,Ivan M. Rosado-Mendez,Hassan Rivaz
关键词-EN: Bayesian Neural Networks, Homodyned K-distribution, Neural Networks, Bayesian Neural, quantitative ultrasound
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Medical Physics (physics.med-ph); Machine Learning (stat.ML)
*备注: 4 pages, 2 figures

点击查看摘要

Abstract:Homodyned K-distribution (HK-distribution) parameter estimation in quantitative ultrasound (QUS) has been recently addressed using Bayesian Neural Networks (BNNs). BNNs have been shown to significantly reduce computational time in speckle statistics-based QUS without compromising accuracy and precision. Additionally, they provide estimates of feature uncertainty, which can guide the clinician’s trust in the reported feature value. The total predictive uncertainty in Bayesian modeling can be decomposed into epistemic (uncertainty over the model parameters) and aleatoric (uncertainty inherent in the data) components. By decomposing the predictive uncertainty, we can gain insights into the factors contributing to the total uncertainty. In this study, we propose a method to compute epistemic and aleatoric uncertainties for HK-distribution parameters ( \alpha and k ) estimated by a BNN, in both simulation and experimental data. In addition, we investigate the relationship between the prediction error and both uncertainties, shedding light on the interplay between these uncertainties and HK parameters errors.

[AI-100] Automating proton PBS treatment planning for head and neck cancers using policy gradient-based deep reinforcement learning

链接: https://arxiv.org/abs/2409.11576
作者: Qingqing Wang,Chang Chang
关键词-EN: pencil beam scanning, Proton pencil beam, planning, planning objectives, beam scanning
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Proton pencil beam scanning (PBS) treatment planning for head and neck (HN) cancers is a time-consuming and experience-demanding task where a large number of planning objectives are involved. Deep reinforcement learning (DRL) has recently been introduced to the planning processes of intensity-modulated radiation therapy and brachytherapy for prostate, lung, and cervical cancers. However, existing approaches are built upon the Q-learning framework and weighted linear combinations of clinical metrics, suffering from poor scalability and flexibility and only capable of adjusting a limited number of planning objectives in discrete action spaces. We propose an automatic treatment planning model using the proximal policy optimization (PPO) algorithm and a dose distribution-based reward function for proton PBS treatment planning of HN cancers. Specifically, a set of empirical rules is used to create auxiliary planning structures from target volumes and organs-at-risk (OARs), along with their associated planning objectives. These planning objectives are fed into an in-house optimization engine to generate the spot monitor unit (MU) values. A decision-making policy network trained using PPO is developed to iteratively adjust the involved planning objective parameters in a continuous action space and refine the PBS treatment plans using a novel dose distribution-based reward function. Proton HN treatment plans generated by the model show improved OAR sparing with equal or superior target coverage when compared with human-generated plans. Moreover, additional experiments on liver cancer demonstrate that the proposed method can be successfully generalized to other treatment sites. To the best of our knowledge, this is the first DRL-based automatic treatment planning model capable of achieving human-level performance for HN cancers.

[AI-101] Multi-Domain Data Aggregation for Axon and Myelin Segmentation in Histology Images

链接: https://arxiv.org/abs/2409.11552
作者: Armand Collin,Arthur Boschet,Mathieu Boudreau,Julien Cohen-Adad
关键词-EN: Quantifying axon, neurodegenerative diseases, provide useful information, information about microstructural, microstructural changes caused
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Quantifying axon and myelin properties (e.g., axon diameter, myelin thickness, g-ratio) in histology images can provide useful information about microstructural changes caused by neurodegenerative diseases. Automatic tissue segmentation is an important tool for these datasets, as a single stained section can contain up to thousands of axons. Advances in deep learning have made this task quick and reliable with minimal overhead, but a deep learning model trained by one research group will hardly ever be usable by other groups due to differences in their histology training data. This is partly due to subject diversity (different body parts, species, genetics, pathologies) and also to the range of modern microscopy imaging techniques resulting in a wide variability of image features (i.e., contrast, resolution). There is a pressing need to make AI accessible to neuroscience researchers to facilitate and accelerate their workflow, but publicly available models are scarce and poorly maintained. Our approach is to aggregate data from multiple imaging modalities (bright field, electron microscopy, Raman spectroscopy) and species (mouse, rat, rabbit, human), to create an open-source, durable tool for axon and myelin segmentation. Our generalist model makes it easier for researchers to process their data and can be fine-tuned for better performance on specific domains. We study the benefits of different aggregation schemes. This multi-domain segmentation model performs better than single-modality dedicated learners (p=0.03077), generalizes better on out-of-distribution data and is easier to use and maintain. Importantly, we package the segmentation tool into a well-maintained open-source software ecosystem (see this https URL).

[AI-102] NCT-CRC-HE: Not All Histopathological Datasets Are Equally Useful

链接: https://arxiv.org/abs/2409.11546
作者: Andrey Ignatov,Grigory Malivenko
关键词-EN: Numerous deep learning-based, deep learning-based solutions, past years, deep learning-based, Numerous deep
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Numerous deep learning-based solutions have been proposed for histopathological image analysis over the past years. While they usually demonstrate exceptionally high accuracy, one key question is whether their precision might be affected by low-level image properties not related to histopathology but caused by microscopy image handling and pre-processing. In this paper, we analyze a popular NCT-CRC-HE-100K colorectal cancer dataset used in numerous prior works and show that both this dataset and the obtained results may be affected by data-specific biases. The most prominent revealed dataset issues are inappropriate color normalization, severe JPEG artifacts inconsistent between different classes, and completely corrupted tissue samples resulting from incorrect image dynamic range handling. We show that even the simplest model using only 3 features per image (red, green and blue color intensities) can demonstrate over 50% accuracy on this 9-class dataset, while using color histogram not explicitly capturing cell morphology features yields over 82% accuracy. Moreover, we show that a basic EfficientNet-B0 ImageNet pretrained model can achieve over 97.7% accuracy on this dataset, outperforming all previously proposed solutions developed for this task, including dedicated foundation histopathological models and large cell morphology-aware neural networks. The NCT-CRC-HE dataset is publicly available and can be freely used to replicate the presented results. The codes and pre-trained models used in this paper are available at this https URL

[AI-103] Federated Learning with Quantum Computing and Fully Homomorphic Encryption: A Novel Computing Paradigm Shift in Privacy-Preserving ML

链接: https://arxiv.org/abs/2409.11430
作者: Siddhant Dutta,Pavana P Karanth,Pedro Maciel Xavier,Iago Leal de Freitas,Nouhaila Innan,Sadok Ben Yahia,Muhammad Shafique,David E. Bernal Neira
关键词-EN: information security worldwide, Fully Homomorphic Encryption, widespread deployment, deployment of products, products powered
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The widespread deployment of products powered by machine learning models is raising concerns around data privacy and information security worldwide. To address this issue, Federated Learning was first proposed as a privacy-preserving alternative to conventional methods that allow multiple learning clients to share model knowledge without disclosing private data. A complementary approach known as Fully Homomorphic Encryption (FHE) is a quantum-safe cryptographic system that enables operations to be performed on encrypted weights. However, implementing mechanisms such as these in practice often comes with significant computational overhead and can expose potential security threats. Novel computing paradigms, such as analog, quantum, and specialized digital hardware, present opportunities for implementing privacy-preserving machine learning systems while enhancing security and mitigating performance loss. This work instantiates these ideas by applying the FHE scheme to a Federated Learning Neural Network architecture that integrates both classical and quantum layers.

计算机视觉

[CV-0] Vista3D: Unravel the 3D Darkside of a Single Image ECCV’2024

链接: https://arxiv.org/abs/2409.12193
作者: Qiuhong Shen,Xingyi Yang,Michael Bi Mi,Xinchao Wang
关键词-EN: age-old quest, unveiling the hidden, hidden dimensions, Gaussian Splatting, mere glimpses
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multimedia (cs.MM)
*备注: ECCV’2024

点击查看摘要

Abstract:We embark on the age-old quest: unveiling the hidden dimensions of objects from mere glimpses of their visible parts. To address this, we present Vista3D, a framework that realizes swift and consistent 3D generation within a mere 5 minutes. At the heart of Vista3D lies a two-phase approach: the coarse phase and the fine phase. In the coarse phase, we rapidly generate initial geometry with Gaussian Splatting from a single image. In the fine phase, we extract a Signed Distance Function (SDF) directly from learned Gaussian Splatting, optimizing it with a differentiable isosurface representation. Furthermore, it elevates the quality of generation by using a disentangled representation with two independent implicit functions to capture both visible and obscured aspects of objects. Additionally, it harmonizes gradients from 2D diffusion prior with 3D-aware diffusion priors by angular diffusion prior composition. Through extensive evaluation, we demonstrate that Vista3D effectively sustains a balance between the consistency and diversity of the generated 3D objects. Demos and code will be available at this https URL.

[CV-1] DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

链接: https://arxiv.org/abs/2409.12192
作者: Zichen Jeff Cui,Hengkai Pan,Aadhithya Iyer,Siddhant Haldar,Lerrel Pinto
关键词-EN: complex visuomotor policies, training complex visuomotor, visuomotor policies, powerful tool, tool for training
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at this https URL

[CV-2] Qwen2-VL: Enhancing Vision-Language Models Perception of the World at Any Resolution

链接: https://arxiv.org/abs/2409.12191
作者: Peng Wang,Shuai Bai,Sinan Tan,Shijie Wang,Zhihao Fan,Jinze Bai,Keqin Chen,Xuejing Liu,Jialin Wang,Wenbin Ge,Yang Fan,Kai Dang,Mengfei Du,Xuancheng Ren,Rui Men,Dayiheng Liu,Chang Zhou,Jingren Zhou,Junyang Lin
关键词-EN: Naive Dynamic Resolution, conventional predetermined-resolution approach, previous Qwen-VL models, Dynamic Resolution mechanism, Rotary Position Embedding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Code is available at this https URL . arXiv admin note: text overlap with arXiv:2408.15262 by other authors

点击查看摘要

Abstract:We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model’s visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at \urlthis https URL.

[CV-3] Bundle Adjustment in the Eager Mode

链接: https://arxiv.org/abs/2409.12190
作者: Zitong Zhan,Huan Xu,Zihang Fang,Xinpeng Wei,Yaoyu Hu,Chen Wang
关键词-EN: Bundle adjustment, augmented reality, robotic applications, localization and mapping, critical technique
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Bundle adjustment (BA) is a critical technique in various robotic applications, such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C+±based BA frameworks, such as GTSAM, g ^2 o, and Ceres, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, adaptability, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA framework seamlessly integrated with PyPose, providing PyTorch-compatible interfaces with high efficiency. Our approach includes GPU-accelerated, differentiable, and sparse operations designed for 2nd-order optimization, Lie group and Lie algebra operations, and linear solvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5 \times , 22 \times , and 23 \times compared to GTSAM, g ^2 o, and Ceres, respectively.

[CV-4] Massively Multi-Person 3D Human Motion Forecasting with Scene Context

链接: https://arxiv.org/abs/2409.12189
作者: Felix B Mueller,Julian Tanke,Juergen Gall
关键词-EN: Forecasting long-term, human behavior makes, generate realistic human, behavior makes, makes it hard
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at this https URL.

[CV-5] NSSR-DIL: Null-Shot Image Super-Resolution Using Deep Identity Learning

链接: https://arxiv.org/abs/2409.12165
作者: Sree Rama Vamsidhar S,Rama Krishna Gorthi
关键词-EN: Deep Identity Learning, employ Deep Learning, existing SotA ISR, ISR, ISR task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The present State-of-the-Art (SotA) Image Super-Resolution (ISR) methods employ Deep Learning (DL) techniques using a large amount of image data. The primary limitation to extending the existing SotA ISR works for real-world instances is their computational and time complexities. In this paper, contrary to the existing methods, we present a novel and computationally efficient ISR algorithm that is independent of the image dataset to learn the ISR task. The proposed algorithm reformulates the ISR task from generating the Super-Resolved (SR) images to computing the inverse of the kernels that span the degradation space. We introduce Deep Identity Learning, exploiting the identity relation between the degradation and inverse degradation models. The proposed approach neither relies on the ISR dataset nor on a single input low-resolution (LR) image (like the self-supervised method i.e. ZSSR) to model the ISR task. Hence we term our model as Null-Shot Super-Resolution Using Deep Identity Learning (NSSR-DIL). The proposed NSSR-DIL model requires fewer computational resources, at least by an order of 10, and demonstrates a competitive performance on benchmark ISR datasets. Another salient aspect of our proposition is that the NSSR-DIL framework detours retraining the model and remains the same for varying scale factors like X2, X3, and X4. This makes our highly efficient ISR model more suitable for real-world applications.

[CV-6] Precise Forecasting of Sky Images Using Spatial Warping

链接: https://arxiv.org/abs/2409.12162
作者: Leron Julian,Aswin C. Sankaranarayanan
关键词-EN: key factors inhibiting, due to occlusion, residential settings, key factors, factors inhibiting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The intermittency of solar power, due to occlusion from cloud cover, is one of the key factors inhibiting its widespread use in both commercial and residential settings. Hence, real-time forecasting of solar irradiance for grid-connected photovoltaic systems is necessary to schedule and allocate resources across the grid. Ground-based imagers that capture wide field-of-view images of the sky are commonly used to monitor cloud movement around a particular site in an effort to forecast solar irradiance. However, these wide FOV imagers capture a distorted image of sky image, where regions near the horizon are heavily compressed. This hinders the ability to precisely predict cloud motion near the horizon which especially affects prediction over longer time horizons. In this work, we combat the aforementioned constraint by introducing a deep learning method to predict a future sky image frame with higher resolution than previous methods. Our main contribution is to derive an optimal warping method to counter the adverse affects of clouds at the horizon, and learn a framework for future sky image prediction which better determines cloud evolution for longer time horizons.

[CV-7] JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation BMVC2024

链接: https://arxiv.org/abs/2409.12156
作者: Sai Tanmay Reddy Chakkera,Aggelina Chatziagapi,Dimitris Samaras
关键词-EN: talking face generation, audio-guided talking face, face generation, joint expression, expression
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by BMVC 2024. Project Page: this https URL

点击查看摘要

Abstract:We introduce a novel method for joint expression and audio-guided talking face generation. Recent approaches either struggle to preserve the speaker identity or fail to produce faithful facial expressions. To address these challenges, we propose a NeRF-based network. Since we train our network on monocular videos without any ground truth, it is essential to learn disentangled representations for audio and expression. We first learn audio features in a self-supervised manner, given utterances from multiple subjects. By incorporating a contrastive learning technique, we ensure that the learned audio features are aligned to the lip motion and disentangled from the muscle motion of the rest of the face. We then devise a transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from the speech-specific mouth movements. Through quantitative and qualitative evaluation, we demonstrate that our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer along with lip synchronization to unseen audio.

[CV-8] MoRAG – Multi-Fusion Retrieval Augmented Generation for Human Motion

链接: https://arxiv.org/abs/2409.12140
作者: Kalakonda Sai Shashank,Shubh Maheshwari,Ravi Kiran Sarvadevabhatla
关键词-EN: based retrieval-augmented generation, fusion based retrieval-augmented, retrieval-augmented generation strategy, human motion generation, multi-part fusion based
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:We introduce MoRAG, a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation. The method enhances motion diffusion models by leveraging additional knowledge obtained through an improved motion retrieval process. By effectively prompting large language models (LLMs), we address spelling errors and rephrasing issues in motion retrieval. Our approach utilizes a multi-part retrieval strategy to improve the generalizability of motion retrieval across the language space. We create diverse samples through the spatial composition of the retrieved motions. Furthermore, by utilizing low-level, part-specific motion information, we can construct motion samples for unseen text descriptions. Our experiments demonstrate that our framework can serve as a plug-and-play module, improving the performance of motion diffusion models. Code, pretrained models and sample videos will be made available at: this https URL

[CV-9] Applications of Knowledge Distillation in Remote Sensing: A Survey

链接: https://arxiv.org/abs/2409.12111
作者: Yassine Himeur,Nour Aburaed,Omar Elharrouss,Iraklis Varlamis,Shadi Atalla,Wathiq Mansoor,Hussain Al Ahmad
关键词-EN: balance model accuracy, remote sensing, ever-growing complexity, increasing demand, demand for solutions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 50 pages, 11 figures and 9 tables

点击查看摘要

Abstract:With the ever-growing complexity of models in the field of remote sensing (RS), there is an increasing demand for solutions that balance model accuracy with computational efficiency. Knowledge distillation (KD) has emerged as a powerful tool to meet this need, enabling the transfer of knowledge from large, complex models to smaller, more efficient ones without significant loss in performance. This review article provides an extensive examination of KD and its innovative applications in RS. KD, a technique developed to transfer knowledge from a complex, often cumbersome model (teacher) to a more compact and efficient model (student), has seen significant evolution and application across various domains. Initially, we introduce the fundamental concepts and historical progression of KD methods. The advantages of employing KD are highlighted, particularly in terms of model compression, enhanced computational efficiency, and improved performance, which are pivotal for practical deployments in RS scenarios. The article provides a comprehensive taxonomy of KD techniques, where each category is critically analyzed to demonstrate the breadth and depth of the alternative options, and illustrates specific case studies that showcase the practical implementation of KD methods in RS tasks, such as instance segmentation and object detection. Further, the review discusses the challenges and limitations of KD in RS, including practical constraints and prospective future directions, providing a comprehensive overview for researchers and practitioners in the field of RS. Through this organization, the paper not only elucidates the current state of research in KD but also sets the stage for future research opportunities, thereby contributing significantly to both academic research and real-world applications.

[CV-10] SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba

链接: https://arxiv.org/abs/2409.12108
作者: Xiangning Zhang,Jinnan Chen,Qingwei Zhang,Chengfeng Zhou,Zhengjie Zhang,Xiaobo Li,Dahong Qian
关键词-EN: Endoscopic Submucosal Dissection, Endoscopic Submucosal, Submucosal Dissection, early gastric cancer, surgical phase recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure initially designed for the treatment of early gastric cancer but is now widely used for various gastrointestinal lesions. Computer-assisted Surgery systems have played a crucial role in improving the precision and safety of ESD procedures, however, their effectiveness is limited by the accurate recognition of surgical phases. The intricate nature of ESD, with different lesion characteristics and tissue structures, presents challenges for real-time surgical phase recognition algorithms. Existing surgical phase recognition algorithms struggle to efficiently capture temporal contexts in video-based scenarios, leading to insufficient performance. To address these issues, we propose SPRMamba, a novel Mamba-based framework for ESD surgical phase recognition. SPRMamba leverages the strengths of Mamba for long-term temporal modeling while introducing the Scaled Residual TranMamba block to enhance the capture of fine-grained details, overcoming the limitations of traditional temporal models like Temporal Convolutional Networks and Transformers. Moreover, a Temporal Sample Strategy is introduced to accelerate the processing, which is essential for real-time phase recognition in clinical settings. Extensive testing on the ESD385 dataset and the cholecystectomy Cholec80 dataset demonstrates that SPRMamba surpasses existing state-of-the-art methods and exhibits greater robustness across various surgical phase recognition tasks.

[CV-11] Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance

链接: https://arxiv.org/abs/2409.12099
作者: Jaehoon Joo,Taejin Jeong,Seongjae Hwang
关键词-EN: Understanding how humans, humans process visual, Latent Diffusion Model, humans process, crucial steps
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.

[CV-12] Online Refractive Camera Model Calibration in Visual Inertial Odometry IROS2024

链接: https://arxiv.org/abs/2409.12074
作者: Mohit Singh,Kostas Alexis
关键词-EN: general refractive camera, refractive index, paper presents, presents a general, refractive camera model
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024), 8 pages

点击查看摘要

Abstract:This paper presents a general refractive camera model and online co-estimation of odometry and the refractive index of unknown media. This enables operation in diverse and varying refractive fluids, given only the camera calibration in air. The refractive index is estimated online as a state variable of a monocular visual-inertial odometry framework in an iterative formulation using the proposed camera model. The method was verified on data collected using an underwater robot traversing inside a pool. The evaluations demonstrate convergence to the ideal refractive index for water despite significant perturbations in the initialization. Simultaneously, the approach enables on-par visual-inertial odometry performance in refractive media without prior knowledge of the refractive index or requirement of medium-specific camera calibration.

[CV-13] PAD-FT: A Lightweight Defense for Backdoor Attacks via Data Purification and Fine-Tuning

链接: https://arxiv.org/abs/2409.12072
作者: Yukai Xu,Yujie Gu,Kouichi Sakurai
关键词-EN: deep neural networks, increasingly subtle implantation, neural networks, subtle implantation, pose a significant
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Backdoor attacks pose a significant threat to deep neural networks, particularly as recent advancements have led to increasingly subtle implantation, making the defense more challenging. Existing defense mechanisms typically rely on an additional clean dataset as a standard reference and involve retraining an auxiliary model or fine-tuning the entire victim model. However, these approaches are often computationally expensive and not always feasible in practical applications. In this paper, we propose a novel and lightweight defense mechanism, termed PAD-FT, that does not require an additional clean dataset and fine-tunes only a very small part of the model to disinfect the victim model. To achieve this, our approach first introduces a simple data purification process to identify and select the most-likely clean data from the poisoned training dataset. The self-purified clean dataset is then used for activation clipping and fine-tuning only the last classification layer of the victim model. By integrating data purification, activation clipping, and classifier fine-tuning, our mechanism PAD-FT demonstrates superior effectiveness across multiple backdoor attack methods and datasets, as confirmed through extensive experimental evaluation.

[CV-14] SFDA-rPPG: Source-Free Domain Adaptive Remote Physiological Measurement with Spatio-Temporal Consistency

链接: https://arxiv.org/abs/2409.12040
作者: Yiping Xie,Zitong Yu,Bingjie Wu,Weicheng Xie,Linlin Shen
关键词-EN: Remote Photoplethysmography, Domain Adaptation, physiological metrics measurement, Source-free Domain Adaptation, source domain data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Remote Photoplethysmography (rPPG) is a non-contact method that uses facial video to predict changes in blood volume, enabling physiological metrics measurement. Traditional rPPG models often struggle with poor generalization capacity in unseen domains. Current solutions to this problem is to improve its generalization in the target domain through Domain Generalization (DG) or Domain Adaptation (DA). However, both traditional methods require access to both source domain data and target domain data, which cannot be implemented in scenarios with limited access to source data, and another issue is the privacy of accessing source domain data. In this paper, we propose the first Source-free Domain Adaptation benchmark for rPPG measurement (SFDA-rPPG), which overcomes these limitations by enabling effective domain adaptation without access to source domain data. Our framework incorporates a Three-Branch Spatio-Temporal Consistency Network (TSTC-Net) to enhance feature consistency across domains. Furthermore, we propose a new rPPG distribution alignment loss based on the Frequency-domain Wasserstein Distance (FWD), which leverages optimal transport to align power spectrum distributions across domains effectively and further enforces the alignment of the three branches. Extensive cross-domain experiments and ablation studies demonstrate the effectiveness of our proposed method in source-free domain adaptation settings. Our findings highlight the significant contribution of the proposed FWD loss for distributional alignment, providing a valuable reference for future research and applications. The source code is available at this https URL

[CV-15] Multi-Sensor Deep Learning for Glacier Mapping

链接: https://arxiv.org/abs/2409.12034
作者: Codruţ-Andrei Diaconu,Konrad Heidler,Jonathan L. Bamber,Harry Zekollari
关键词-EN: water resource management, influencing sea-level rise, deep learning, multi-sensor earth observation, deep learning multi-sensor
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: This article will be a chapter of the book Deep Learning for Multi-Sensor Earth Observation, to be published by Elsevier

点击查看摘要

Abstract:The more than 200,000 glaciers outside the ice sheets play a crucial role in our society by influencing sea-level rise, water resource management, natural hazards, biodiversity, and tourism. However, only a fraction of these glaciers benefit from consistent and detailed in-situ observations that allow for assessing their status and changes over time. This limitation can, in part, be overcome by relying on satellite-based Earth Observation techniques. Satellite-based glacier mapping applications have historically mainly relied on manual and semi-automatic detection methods, while recently, a fast and notable transition to deep learning techniques has started. This chapter reviews how combining multi-sensor remote sensing data and deep learning allows us to better delineate (i.e. map) glaciers and detect their temporal changes. We explain how relying on deep learning multi-sensor frameworks to map glaciers benefits from the extensive availability of regional and global glacier inventories. We also analyse the rationale behind glacier mapping, the benefits of deep learning methodologies, and the inherent challenges in integrating multi-sensor earth observation data with deep learning algorithms. While our review aims to provide a broad overview of glacier mapping efforts, we highlight a few setups where deep learning multi-sensor remote sensing applications have a considerable potential added value. This includes applications for debris-covered and rock glaciers that are visually difficult to distinguish from surroundings and for calving glaciers that are in contact with the ocean. These specific cases are illustrated through a series of visual imageries, highlighting some significant advantages and challenges when detecting glacier changes, including dealing with seasonal snow cover, changing debris coverage, and distinguishing glacier fronts from the surrounding sea ice. Comments: This article will be a chapter of the book Deep Learning for Multi-Sensor Earth Observation, to be published by Elsevier Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2409.12034 [cs.CV] (or arXiv:2409.12034v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.12034 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-16] PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba

链接: https://arxiv.org/abs/2409.12031
作者: Chaoqi Luo,Yiping Xie,Zitong Yu
关键词-EN: Facial-video based Remote, based Remote photoplethysmography, showing significant potential, monitoring heart activity, Remote photoplethysmography
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CCBR 2024

点击查看摘要

Abstract:Facial-video based Remote photoplethysmography (rPPG) aims at measuring physiological signals and monitoring heart activity without any contact, showing significant potential in various applications. Previous deep learning based rPPG measurement are primarily based on CNNs and Transformers. However, the limited receptive fields of CNNs restrict their ability to capture long-range spatio-temporal dependencies, while Transformers also struggle with modeling long video sequences with high complexity. Recently, the state space models (SSMs) represented by Mamba are known for their impressive performance on capturing long-range dependencies from long sequences. In this paper, we propose the PhysMamba, a Mamba-based framework, to efficiently represent long-range physiological dependencies from facial videos. Specifically, we introduce the Temporal Difference Mamba block to first enhance local dynamic differences and further model the long-range spatio-temporal context. Moreover, a dual-stream SlowFast architecture is utilized to fuse the multi-scale temporal features. Extensive experiments are conducted on three benchmark datasets to demonstrate the superiority and efficiency of PhysMamba. The codes are available at this https URL

[CV-17] On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery

链接: https://arxiv.org/abs/2409.12026
作者: BW Sheffield,Jeffrey Ellen,Ben Whitmore
关键词-EN: presents unique challenges, Side-scan sonar, imagery presents unique, Convolutional Neural Networks, presents unique
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor due to the complex and varied underwater environments. Historically, experts have manually interpreted SSS images, relying on conventional machine learning techniques with hand-crafted features. While Convolutional Neural Networks (CNNs) significantly advanced automated classification in this domain, they often fall short when dealing with diverse seafloor textures, such as rocky or ripple sand bottoms, where false positive rates may increase. Recently, Vision Transformers (ViTs) have shown potential in addressing these limitations by utilizing a self-attention mechanism to capture global information in image patches, offering more flexibility in processing spatial hierarchies. This paper rigorously compares the performance of ViT models alongside commonly used CNN architectures, such as ResNet and ConvNext, for binary classification tasks in SSS imagery. The dataset encompasses diverse geographical seafloor types and is balanced between the presence and absence of man-made objects. ViT-based models exhibit superior classification performance across f1-score, precision, recall, and accuracy metrics, although at the cost of greater computational resources. CNNs, with their inductive biases, demonstrate better computational efficiency, making them suitable for deployment in resource-constrained environments like underwater vehicles. Future research directions include exploring self-supervised learning for ViTs and multi-modal fusion to further enhance performance in challenging underwater environments.

[CV-18] LEMON: Localized Editing with Mesh Optimization and Neural Shaders

链接: https://arxiv.org/abs/2409.12024
作者: Furkan Mert Algan,Umut Yazgan,Driton Salihu,Cem Eteke,Eckehard Steinbach
关键词-EN: practical use cases, time-consuming for users, mesh, faster than generating, challenging and time-consuming
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In practical use cases, polygonal mesh editing can be faster than generating new ones, but it can still be challenging and time-consuming for users. Existing solutions for this problem tend to focus on a single task, either geometry or novel view synthesis, which often leads to disjointed results between the mesh and view. In this work, we propose LEMON, a mesh editing pipeline that combines neural deferred shading with localized mesh optimization. Our approach begins by identifying the most important vertices in the mesh for editing, utilizing a segmentation model to focus on these key regions. Given multi-view images of an object, we optimize a neural shader and a polygonal mesh while extracting the normal map and the rendered image from each view. By using these outputs as conditioning data, we edit the input images with a text-to-image diffusion model and iteratively update our dataset while deforming the mesh. This process results in a polygonal mesh that is edited according to the given text instruction, preserving the geometric characteristics of the initial mesh while focusing on the most significant areas. We evaluate our pipeline using the DTU dataset, demonstrating that it generates finely-edited meshes more rapidly than the current state-of-the-art methods. We include our code and additional results in the supplementary material.

[CV-19] Computational Imaging for Long-Term Prediction of Solar Irradiance

链接: https://arxiv.org/abs/2409.12016
作者: Leron Julian,Haejoon Lee,Soummya Kar,Aswin C. Sankaranarayanan
关键词-EN: primary energy source, solar power generation, solar power, primary sources, power generation
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The occlusion of the sun by clouds is one of the primary sources of uncertainties in solar power generation, and is a factor that affects the wide-spread use of solar power as a primary energy source. Real-time forecasting of cloud movement and, as a result, solar irradiance is necessary to schedule and allocate energy across grid-connected photovoltaic systems. Previous works monitored cloud movement using wide-angle field of view imagery of the sky. However, such images have poor resolution for clouds that appear near the horizon, which reduces their effectiveness for long term prediction of solar occlusion. Specifically, to be able to predict occlusion of the sun over long time periods, clouds that are near the horizon need to be detected, and their velocities estimated precisely. To enable such a system, we design and deploy a catadioptric system that delivers wide-angle imagery with uniform spatial resolution of the sky over its field of view. To enable prediction over a longer time horizon, we design an algorithm that uses carefully selected spatio-temporal slices of the imagery using estimated wind direction and velocity as inputs. Using ray-tracing simulations as well as a real testbed deployed outdoors, we show that the system is capable of predicting solar occlusion as well as irradiance for tens of minutes in the future, which is an order of magnitude improvement over prior work.

[CV-20] BRDF-NeRF: Neural Radiance Fields with Optical Satellite Images and BRDF Modelling

链接: https://arxiv.org/abs/2409.12014
作者: Lulin Zhang,Ewelina Rupnik,Tri Dung Nguyen,Stéphane Jacquemoud,Yann Klinger
关键词-EN: Understanding the anisotropic, complex Earth surfaces, numerous applications, complex Earth, crucial for numerous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Understanding the anisotropic reflectance of complex Earth surfaces from satellite imagery is crucial for numerous applications. Neural radiance fields (NeRF) have become popular as a machine learning technique capable of deducing the bidirectional reflectance distribution function (BRDF) of a scene from multiple images. However, prior research has largely concentrated on applying NeRF to close-range imagery, estimating basic Microfacet BRDF models, which fall short for many Earth surfaces. Moreover, high-quality NeRFs generally require several images captured simultaneously, a rare occurrence in satellite imaging. To address these limitations, we propose BRDF-NeRF, developed to explicitly estimate the Rahman-Pinty-Verstraete (RPV) model, a semi-empirical BRDF model commonly employed in remote sensing. We assess our approach using two datasets: (1) Djibouti, captured in a single epoch at varying viewing angles with a fixed Sun position, and (2) Lanzhou, captured over multiple epochs with different viewing angles and Sun positions. Our results, based on only three to four satellite images for training, demonstrate that BRDF-NeRF can effectively synthesize novel views from directions far removed from the training data and produce high-quality digital surface models (DSMs).

[CV-21] Mixture of Prompt Learning for Vision Language Models

链接: https://arxiv.org/abs/2409.12011
作者: Yu Du,Tong Niu,Rong Zhao
关键词-EN: CLIP gain prominence, pre-trained vision-language models, powerful pre-trained vision-language, CLIP gain, combine VLMs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requiring a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture of soft prompt learning method incorporating a routing module. This module is able to capture a dataset’s varied styles and dynamically selects the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retaining knowledge from hard prompts and improving selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applied a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. The code will be available at \urlhttps://anonymous.4open.science/r/mocoop-6387

[CV-22] ChefFusion: Multimodal Foundation Model Integrating Recipe and Food Image Generation

链接: https://arxiv.org/abs/2409.12010
作者: Peiyu Li,Xiaobao Huang,Yijun Tian,Nitesh V. Chawla
关键词-EN: studies typically focus, Significant work, food image generation, food, titles and ingredients
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Significant work has been conducted in the domain of food computing, yet these studies typically focus on single tasks such as t2t (instruction generation from food titles and ingredients), i2t (recipe generation from food images), or t2i (food image generation from recipes). None of these approaches integrate all modalities simultaneously. To address this gap, we introduce a novel food computing foundation model that achieves true multimodality, encompassing tasks such as t2t, t2i, i2t, it2t, and t2ti. By leveraging large language models (LLMs) and pre-trained image encoder and decoder models, our model can perform a diverse array of food computing-related tasks, including food understanding, food recognition, recipe generation, and food image generation. Compared to previous models, our foundation model demonstrates a significantly broader range of capabilities and exhibits superior performance, particularly in food image generation and recipe generation tasks. We open-sourced ChefFusion at GitHub.

[CV-23] Panoptic-Depth Forecasting

链接: https://arxiv.org/abs/2409.12008
作者: Juana Valeria Hurtado,Riya Mohan,Abhinav Valada
关键词-EN: plan actions safely, actions safely, panoptic scene forecasting, essential for robots, robots to navigate
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Forecasting the semantics and 3D structure of scenes is essential for robots to navigate and plan actions safely. Recent methods have explored semantic and panoptic scene forecasting; however, they do not consider the geometry of the scene. In this work, we propose the panoptic-depth forecasting task for jointly predicting the panoptic segmentation and depth maps of unobserved future frames, from monocular camera images. To facilitate this work, we extend the popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR point clouds and leveraging sequential labeled data. We also introduce a suitable evaluation metric that quantifies both the panoptic quality and depth estimation accuracy of forecasts in a coherent manner. Furthermore, we present two baselines and propose the novel PDcast architecture that learns rich spatio-temporal representations by incorporating a transformer-based encoder, a forecasting module, and task-specific decoders to predict future panoptic-depth outputs. Extensive evaluations demonstrate the effectiveness of PDcast across two datasets and three forecasting tasks, consistently addressing the primary challenges. We make the code publicly available at this https URL.

[CV-24] owards Global Localization using Multi-Modal Object-Instance Re-Identification ICRA2025

链接: https://arxiv.org/abs/2409.12002
作者: Aneesh Chavan,Vaibhav Agrawal,Vineeth Bhat,Sarthak Chittawar,Siddharth Srivastava,Chetan Arora,K Madhava Krishna
关键词-EN: computer vision, predominantly studied, pedestrians and vehicles, critical challenge, challenge in computer
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 5 figures, 3 tables. Submitted to ICRA 2025

点击查看摘要

Abstract:Re-identification (ReID) is a critical challenge in computer vision, predominantly studied in the context of pedestrians and vehicles. However, robust object-instance ReID, which has significant implications for tasks such as autonomous exploration, long-term perception, and scene understanding, remains underexplored. In this work, we address this gap by proposing a novel dual-path object-instance re-identification transformer architecture that integrates multimodal RGB and depth information. By leveraging depth data, we demonstrate improvements in ReID across scenes that are cluttered or have varying illumination conditions. Additionally, we develop a ReID-based localization framework that enables accurate camera localization and pose identification across different viewpoints. We validate our methods using two custom-built RGB-D datasets, as well as multiple sequences from the open-source TUM RGB-D datasets. Our approach demonstrates significant improvements in both object instance ReID (mAP of 75.18) and localization accuracy (success rate of 83% on TUM-RGBD), highlighting the essential role of object ReID in advancing robotic perception. Our models, frameworks, and datasets have been made publicly available.

[CV-25] Intraoperative Registration by Cross-Modal Inverse Neural Rendering MICCAI2024

链接: https://arxiv.org/abs/2409.11983
作者: Maximilian Fehrentz,Mohammad Farid Azampour,Reuben Dorent,Hassan Rasheed,Colin Galvin,Alexandra Golby,William M. Wells,Sarah Frisken,Nassir Navab,Nazim Haouchine
关键词-EN: cross-modal inverse neural, neurosurgery via cross-modal, cross-modal inverse, Neural Radiance Field, inverse neural rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at MICCAI 2024

点击查看摘要

Abstract:We present in this paper a novel approach for 3D/2D intraoperative registration during neurosurgery via cross-modal inverse neural rendering. Our approach separates implicit neural representation into two components, handling anatomical structure preoperatively and appearance intraoperatively. This disentanglement is achieved by controlling a Neural Radiance Field’s appearance with a multi-style hypernetwork. Once trained, the implicit neural representation serves as a differentiable rendering engine, which can be used to estimate the surgical camera pose by minimizing the dissimilarity between its rendered images and the target intraoperative image. We tested our method on retrospective patients’ data from clinical cases, showing that our method outperforms state-of-the-art while meeting current clinical standards for registration. Code and additional resources can be found at this https URL.

[CV-26] MitoSeg: Mitochondria Segmentation Tool

链接: https://arxiv.org/abs/2409.11974
作者: Faris Serdar Taşel,Efe Çiftci
关键词-EN: Recent studies suggest, Recent studies, studies suggest, suggest a potential, potential link
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent studies suggest a potential link between the physical structure of mitochondria and neurodegenerative diseases. With advances in Electron Microscopy techniques, it has become possible to visualize the boundary and internal membrane structures of mitochondria in detail. It is crucial to automatically segment mitochondria from these images to investigate the relationship between mitochondria and diseases. In this paper, we present a software solution for mitochondrial segmentation, highlighting mitochondria boundaries in electron microscopy tomography images and generating corresponding 3D meshes.

[CV-27] Unveiling the Black Box: Independent Functional Module Evaluation for Birds-Eye-View Perception Model

链接: https://arxiv.org/abs/2409.11969
作者: Ludan Zhang,Xiaokang Ding,Yuqi Dai,Lei He,Keqiang Li
关键词-EN: autonomous driving perception, mainstream in autonomous, autonomous driving, Functional Module Evaluation, Independent Functional Module
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:End-to-end models are emerging as the mainstream in autonomous driving perception. However, the inability to meticulously deconstruct their internal mechanisms results in diminished development efficacy and impedes the establishment of trust. Pioneering in the issue, we present the Independent Functional Module Evaluation for Bird’s-Eye-View Perception Model (BEV-IFME), a novel framework that juxtaposes the module’s feature maps against Ground Truth within a unified semantic Representation Space to quantify their similarity, thereby assessing the training maturity of individual functional modules. The core of the framework lies in the process of feature map encoding and representation aligning, facilitated by our proposed two-stage Alignment AutoEncoder, which ensures the preservation of salient information and the consistency of feature structure. The metric for evaluating the training maturity of functional modules, Similarity Score, demonstrates a robust positive correlation with BEV metrics, with an average correlation coefficient of 0.9387, attesting to the framework’s reliability for assessment purposes.

[CV-28] A Chinese Continuous Sign Language Dataset Based on Complex Environments

链接: https://arxiv.org/abs/2409.11960
作者: Qidan Zhu,Jing Li,Fei Yuan,Jiaojiao Fan,Quan Gan
关键词-EN: television program recordings, continuous sign language, sign language recognition, sign language, chinese sign language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:The current bottleneck in continuous sign language recognition (CSLR) research lies in the fact that most publicly available datasets are limited to laboratory environments or television program recordings, resulting in a single background environment with uniform lighting, which significantly deviates from the diversity and complexity found in real-life scenarios. To address this challenge, we have constructed a new, large-scale dataset for Chinese continuous sign language (CSL) based on complex environments, termed the complex environment - chinese sign language dataset (CE-CSL). This dataset encompasses 5,988 continuous CSL video clips collected from daily life scenes, featuring more than 70 different complex backgrounds to ensure representativeness and generalization capability. To tackle the impact of complex backgrounds on CSLR performance, we propose a time-frequency network (TFNet) model for continuous sign language recognition. This model extracts frame-level features and then utilizes both temporal and spectral information to separately derive sequence features before fusion, aiming to achieve efficient and accurate CSLR. Experimental results demonstrate that our approach achieves significant performance improvements on the CE-CSL, validating its effectiveness under complex background conditions. Additionally, our proposed method has also yielded highly competitive results when applied to three publicly available CSL datasets.

[CV-29] racking Any Point with Frame-Event Fusion Network at High Frame Rate

链接: https://arxiv.org/abs/2409.11953
作者: Jiaxiong Liu,Bo Wang,Zhen Tan,Jinpu Zhang,Hui Shen,Dewen Hu
关键词-EN: leading to instability, real-world applications, instability in high-speed, limited generalization, generalization in real-world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Tracking any point based on image frames is constrained by frame rates, leading to instability in high-speed scenarios and limited generalization in real-world applications. To overcome these limitations, we propose an image-event fusion point tracker, FE-TAP, which combines the contextual information from image frames with the high temporal resolution of events, achieving high frame rate and robust point tracking under various challenging conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to model the image generation process guided by events. This module can effectively integrate valuable information from both modalities operating at different frequencies. To achieve smoother point trajectories, we employed a transformer-based refinement strategy that updates the point’s trajectories and features iteratively. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, particularly improving expected feature age by 24 % on EDS datasets. Finally, we qualitatively validated the robustness of our algorithm in real driving scenarios using our custom-designed high-resolution image-event synchronization device. Our source code will be released at this https URL.

[CV-30] GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations SIGGRAPH

链接: https://arxiv.org/abs/2409.11951
作者: Kartik Teotia,Hyeongwoo Kim,Pablo Garrido,Marc Habermann,Mohamed Elgharib,Christian Theobalt
关键词-EN: computer graphics applications, augmented reality, head, computer graphics, facial
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: ACM Transaction on Graphics (SIGGRAPH Asia 2024); Project page: this https URL

点击查看摘要

Abstract:Real-time rendering of human head avatars is a cornerstone of many computer graphics applications, such as augmented reality, video games, and films, to name a few. Recent approaches address this challenge with computationally efficient geometry primitives in a carefully calibrated multi-view setup. Albeit producing photorealistic head renderings, it often fails to represent complex motion changes such as the mouth interior and strongly varying head poses. We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time. At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements. First, with rich facial features extracted from raw input frames, we learn to deform the coarse facial geometry of the template mesh. We then initialize 3D Gaussians on the deformed surface and refine their positions in a fine step. We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework. This enables not only controllable facial animation via video inputs, but also high-fidelity novel view synthesis of challenging facial expressions, such as tongue deformations and fine-grained teeth structure under large motion changes. Moreover, it encourages the learned head avatar to generalize towards new facial expressions and head poses at inference time. We demonstrate the performance of our method with comparisons against the related methods on different datasets, spanning challenging facial expression sequences across multiple identities. We also show the potential application of our approach by demonstrating a cross-identity facial performance transfer application.

[CV-31] Differentiable Collision-Supervised Tooth Arrangement Network with a Decoupling Perspective

链接: https://arxiv.org/abs/2409.11937
作者: Zhihui He,Chengyuan Wang,Shidong Yang,Li Chen,Yanheng Zhou,Shuo Wang
关键词-EN: orthodontic planning process, digital orthodontic planning, Tooth arrangement, planning process, essential step
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 13 figures

点击查看摘要

Abstract:Tooth arrangement is an essential step in the digital orthodontic planning process. Existing learning-based methods use hidden teeth features to directly regress teeth motions, which couples target pose perception and motion regression. It could lead to poor perceptions of three-dimensional transformation. They also ignore the possible overlaps or gaps between teeth of predicted dentition, which is generally unacceptable. Therefore, we propose DTAN, a differentiable collision-supervised tooth arrangement network, decoupling predicting tasks and feature modeling. DTAN decouples the tooth arrangement task by first predicting the hidden features of the final teeth poses and then using them to assist in regressing the motions between the beginning and target teeth. To learn the hidden features better, DTAN also decouples the teeth-hidden features into geometric and positional features, which are further supervised by feature consistency constraints. Furthermore, we propose a novel differentiable collision loss function for point cloud data to constrain the related gestures between teeth, which can be easily extended to other 3D point cloud tasks. We propose an arch-width guided tooth arrangement network, named C-DTAN, to make the results controllable. We construct three different tooth arrangement datasets and achieve drastically improved performance on accuracy and speed compared with existing methods.

[CV-32] Agglomerative Token Clustering ECCV2024 ATC

链接: https://arxiv.org/abs/2409.11923
作者: Joakim Bruslund Haurum,Sergio Escalera,Graham W. Taylor,Thomas B. Moeslund
关键词-EN: Agglomerative Token Clustering, present Agglomerative Token, consistently outperforms previous, object detection segmentation, present Agglomerative
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024. Project webpage at this https URL

点击查看摘要

Abstract:We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable parameters. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par with prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning. ATC is particularly effective when applied with low keep rates, where only a small fraction of tokens are kept and retaining task performance is especially difficult.

[CV-33] Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models

链接: https://arxiv.org/abs/2409.11920
作者: Lorenzo Mandelli,Stefano Berretti
关键词-EN: address the challenge, challenge of generating, human motion, human motion contained, generating realistic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.

[CV-34] LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models ECCV2024

链接: https://arxiv.org/abs/2409.11919
作者: Amaia Cardiel,Eloi Zablocki,Oriane Siméoni,Elias Ramzi,Matthieu Cord
关键词-EN: Vision Language Models, Vision Language, shown impressive performances, shown impressive, zero-shot capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: EVAL-FoMo workshop, ECCV 2024

点击查看摘要

Abstract:Vision Language Models (VLMs) have shown impressive performances on numerous tasks but their zero-shot capabilities can be limited compared to dedicated or fine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires white-box' access to the model's architecture and weights as well as expertise to design the fine-tuning objectives and optimize the hyper-parameters, which are specific to each VLM and downstream task. In this work, we propose LLM-wrapper, a novel approach to adapt VLMs in a black-box’ manner by leveraging large language models (LLMs) so as to reason on their outputs. We demonstrate the effectiveness of LLM-wrapper on Referring Expression Comprehension (REC), a challenging open-vocabulary task that requires spatial and semantic reasoning. Our approach significantly boosts the performance of off-the-shelf models, resulting in competitive results when compared with classic fine-tuning.

[CV-35] Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation

链接: https://arxiv.org/abs/2409.11904
作者: Dimitrios Christodoulou,Mads Kuhlmann-Jørgensen
关键词-EN: inherently requires subjective, requires subjective judgment, Efficiently evaluating, making it hard, inherently requires
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata’s technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.

[CV-36] ABHINAW: A method for Automatic Evaluation of Typography within AI-Generated Images

链接: https://arxiv.org/abs/2409.11874
作者: Abhinaw Jagtap,Nachiket Tapas,R. G. Brajesh
关键词-EN: Stable Diffusion, Diffusion have transformed, field of Generative, platforms like MidJourney, fast-evolving field
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In the fast-evolving field of Generative AI, platforms like MidJourney, DALL-E, and Stable Diffusion have transformed Text-to-Image (T2I) Generation. However, despite their impressive ability to create high-quality images, they often struggle to generate accurate text within these images. Theoretically, if we could achieve accurate text generation in AI images in a ``zero-shot’’ manner, it would not only make AI-generated images more meaningful but also democratize the graphic design industry. The first step towards this goal is to create a robust scoring matrix for evaluating text accuracy in AI-generated images. Although there are existing bench-marking methods like CLIP SCORE and T2I-CompBench++, there’s still a gap in systematically evaluating text and typography in AI-generated images, especially with diffusion-based methods. In this paper, we introduce a novel evaluation matrix designed explicitly for quantifying the performance of text and typography generation within AI-generated images. We have used letter by letter matching strategy to compute the exact matching scores from the reference text to the AI generated text. Our novel approach to calculate the score takes care of multiple redundancies such as repetition of words, case sensitivity, mixing of words, irregular incorporation of letters etc. Moreover, we have developed a Novel method named as brevity adjustment to handle excess text. In addition we have also done a quantitative analysis of frequent errors arise due to frequently used words and less frequently used words. Project page is available at: this https URL.

[CV-37] SpheriGait: Enriching Spatial Representation via Spherical Projection for LiDAR-based Gait Recognition

链接: https://arxiv.org/abs/2409.11869
作者: Yanxi Wang,Zhigang Chang,Chen Wu,Zihao Cheng,Hongmin Gao
关键词-EN: Lidar-based gait recognition, rapidly progressing technique, Gait recognition, identification of individuals, gait recognition methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Gait recognition is a rapidly progressing technique for the remote identification of individuals. Prior research predominantly employing 2D sensors to gather gait data has achieved notable advancements; nonetheless, they have unavoidably neglected the influence of 3D dynamic characteristics on recognition. Gait recognition utilizing LiDAR 3D point clouds not only directly captures 3D spatial features but also diminishes the impact of lighting conditions while ensuring privacy protection.The essence of the problem lies in how to effectively extract discriminative 3D dynamic representation from point this http URL this paper, we proposes a method named SpheriGait for extracting and enhancing dynamic features from point clouds for Lidar-based gait recognition. Specifically, it substitutes the conventional point cloud plane projection method with spherical projection to augment the perception of dynamic feature.Additionally, a network block named DAM-L is proposed to extract gait cues from the projected point cloud data. We conducted extensive experiments and the results demonstrated the SpheriGait achieved state-of-the-art performance on the SUSTech1K dataset, and verified that the spherical projection method can serve as a universal data preprocessing technique to enhance the performance of other LiDAR-based gait recognition methods, exhibiting exceptional flexibility and practicality.

[CV-38] Distillation-free Scaling of Large SSMs for Images and Videos

链接: https://arxiv.org/abs/2409.11867
作者: Hamid Suleman,Syed Talal Wasim,Muzammal Naseer,Juergen Gall
关键词-EN: integrating state-space techniques, context modeling method, integrating state-space, context modeling, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to +1.7 .

[CV-39] Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments IROS2024

链接: https://arxiv.org/abs/2409.11854
作者: Lei Cheng,Junpeng Hu,Haodong Yan,Mariia Gladkova,Tianyu Huang,Yun-Hui Liu,Daniel Cremers,Haoang Li
关键词-EN: Photometric bundle adjustment, Lambertian world, assuming a Lambertian, bundle adjustment, geometry by assuming
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:Photometric bundle adjustment (PBA) is widely used in estimating the camera pose and 3D geometry by assuming a Lambertian world. However, the assumption of photometric consistency is often violated since the non-diffuse reflection is common in real-world environments. The photometric inconsistency significantly affects the reliability of existing PBA methods. To solve this problem, we propose a novel physically-based PBA method. Specifically, we introduce the physically-based weights regarding material, illumination, and light path. These weights distinguish the pixel pairs with different levels of photometric inconsistency. We also design corresponding models for material estimation based on sequential images and illumination estimation based on point clouds. In addition, we establish the first SLAM-related dataset of non-Lambertian scenes with complete ground truth of illumination and material. Extensive experiments demonstrated that our PBA method outperforms existing approaches in accuracy.

[CV-40] RaggeDi: Diffusion-based State Estimation of Disordered Rags Sheets Towels and Blankets

链接: https://arxiv.org/abs/2409.11831
作者: Jikai Ye,Wanze Li,Shiraz Khan,Gregory S. Chirikjian
关键词-EN: Cloth state estimation, Cloth state, estimating cloth state, Cloth, cloth state accurately
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cloth state estimation is an important problem in robotics. It is essential for the robot to know the accurate state to manipulate cloth and execute tasks such as robotic dressing, stitching, and covering/uncovering human beings. However, estimating cloth state accurately remains challenging due to its high flexibility and self-occlusion. This paper proposes a diffusion model-based pipeline that formulates the cloth state estimation as an image generation problem by representing the cloth state as an RGB image that describes the point-wise translation (translation map) between a pre-defined flattened mesh and the deformed mesh in a canonical space. Then we train a conditional diffusion-based image generation model to predict the translation map based on an observation. Experiments are conducted in both simulation and the real world to validate the performance of our method. Results indicate that our method outperforms two recent methods in both accuracy and speed.

[CV-41] End-to-End Probabilistic Geometry-Guided Regression for 6DoF Object Pose Estimation

链接: https://arxiv.org/abs/2409.11819
作者: Thomas Pöllabauer,Jiayin Li,Volker Knauthe,Sarah Berkei,Arjan Kuijper
关键词-EN: chosen coordinate system, object pose estimation, object pose, object, pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:6D object pose estimation is the problem of identifying the position and orientation of an object relative to a chosen coordinate system, which is a core technology for modern XR applications. State-of-the-art 6D object pose estimators directly predict an object pose given an object observation. Due to the ill-posed nature of the pose estimation problem, where multiple different poses can correspond to a single observation, generating additional plausible estimates per observation can be valuable. To address this, we reformulate the state-of-the-art algorithm GDRNPP and introduce EPRO-GDR (End-to-End Probabilistic Geometry-Guided Regression). Instead of predicting a single pose per detection, we estimate a probability density distribution of the pose. Using the evaluation procedure defined by the BOP (Benchmark for 6D Object Pose Estimation) Challenge, we test our approach on four of its core datasets and demonstrate superior quantitative results for EPRO-GDR on LM-O, YCB-V, and ITODD. Our probabilistic solution shows that predicting a pose distribution instead of a single pose can improve state-of-the-art single-view pose estimation while providing the additional benefit of being able to sample multiple meaningful pose candidates.

[CV-42] EFCM: Efficient Fine-tuning on Compressed Models for deployment of large models in medical image analysis

链接: https://arxiv.org/abs/2409.11817
作者: Shaojie Li,Zhaoshuo Diao
关键词-EN: medicine shows remarkable, shows remarkable performance, deep learning large, learning large models, recent development
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent development of deep learning large models in medicine shows remarkable performance in medical image analysis and diagnosis, but their large number of parameters causes memory and inference latency challenges. Knowledge distillation offers a solution, but the slide-level gradients cannot be backpropagated for student model updates due to high-resolution pathological images and slide-level labels. This study presents an Efficient Fine-tuning on Compressed Models (EFCM) framework with two stages: unsupervised feature distillation and fine-tuning. In the distillation stage, Feature Projection Distillation (FPD) is proposed with a TransScan module for adaptive receptive field adjustment to enhance the knowledge absorption capability of the student model. In the slide-level fine-tuning stage, three strategies (Reuse CLAM, Retrain CLAM, and End2end Train CLAM (ETC)) are compared. Experiments are conducted on 11 downstream datasets related to three large medical models: RETFound for retina, MRM for chest X-ray, and BROW for histopathology. The experimental results demonstrate that the EFCM framework significantly improves accuracy and efficiency in handling slide-level pathological image problems, effectively addressing the challenges of deploying large medical models. Specifically, it achieves a 4.33% increase in ACC and a 5.2% increase in AUC compared to the large model BROW on the TCGA-NSCLC and TCGA-BRCA datasets. The analysis of model inference efficiency highlights the high efficiency of the distillation fine-tuning method.

[CV-43] SymFace: Additional Facial Symmetry Loss for Deep Face Recognition WACV2025

链接: https://arxiv.org/abs/2409.11816
作者: Pritesh Prakash,Koteswar Rao Jerripothula,Ashish Jacob Sam,Prinsh Kumar Singh,S Umamaheswaran
关键词-EN: machine learning methods, recognition algorithms leveraging, algorithms leveraging advanced, leveraging advanced machine, advanced machine learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 Pages, 6 Figures, 5 Tables, Submitted for WACV 2025

点击查看摘要

Abstract:Over the past decade, there has been a steady advancement in enhancing face recognition algorithms leveraging advanced machine learning methods. The role of the loss function is pivotal in addressing face verification problems and playing a game-changing role. These loss functions have mainly explored variations among intra-class or inter-class separation. This research examines the natural phenomenon of facial symmetry in the face verification problem. The symmetry between the left and right hemi faces has been widely used in many research areas in recent decades. This paper adopts this simple approach judiciously by splitting the face image vertically into two halves. With the assumption that the natural phenomena of facial symmetry can enhance face verification methodology, we hypothesize that the two output embedding vectors of split faces must project close to each other in the output embedding space. Inspired by this concept, we penalize the network based on the disparity of embedding of the symmetrical pair of split faces. Symmetrical loss has the potential to minimize minor asymmetric features due to facial expression and lightning conditions, hence significantly increasing the inter-class variance among the classes and leading to more reliable face embedding. This loss function propels any network to outperform its baseline performance across all existing network architectures and configurations, enabling us to achieve SoTA results.

[CV-44] EventAug: Multifaceted Spatio-Temporal Data Augmentation Methods for Event-based Learning

链接: https://arxiv.org/abs/2409.11813
作者: Yukun Tian,Hao Chen,Yongjian Deng,Feihong Shen,Kepan Liu,Wei You,Ziyang Zhang
关键词-EN: high dynamic range, low time latency, demonstrated significant success, wide range, dynamic range
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The event camera has demonstrated significant success across a wide range of areas due to its low time latency and high dynamic range. However, the community faces challenges such as data deficiency and limited diversity, often resulting in over-fitting and inadequate feature learning. Notably, the exploration of data augmentation techniques in the event community remains scarce. This work aims to address this gap by introducing a systematic augmentation scheme named EventAug to enrich spatial-temporal diversity. In particular, we first propose Multi-scale Temporal Integration (MSTI) to diversify the motion speed of objects, then introduce Spatial-salient Event Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants. Our EventAug can facilitate models learning with richer motion patterns, object variants and local spatio-temporal relations, thus improving model robustness to varied moving speeds, occlusions, and action disruptions. Experiment results show that our augmentation method consistently yields significant improvements across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128 Gesture). Our code will be publicly available for this community.

[CV-45] Latent fingerprint enhancement for accurate minutiae detection

链接: https://arxiv.org/abs/2409.11802
作者: Abdul Wahab,Tariq Mahmood Khan,Shahzaib Iqbal,Bandar AlShammari,Bandar Alhaqbani,Imran Razzak
关键词-EN: latent fingerprints, commonly referred, suspects based, based on partial, partial and smudged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Identification of suspects based on partial and smudged fingerprints, commonly referred to as fingermarks or latent fingerprints, presents a significant challenge in the field of fingerprint recognition. Although fixed-length embeddings have shown effectiveness in recognising rolled and slap fingerprints, the methods for matching latent fingerprints have primarily centred around local minutiae-based embeddings, failing to fully exploit global representations for matching purposes. Consequently, enhancing latent fingerprints becomes critical to ensuring robust identification for forensic investigations. Current approaches often prioritise restoring ridge patterns, overlooking the fine-macroeconomic details crucial for accurate fingerprint recognition. To address this, we propose a novel approach that uses generative adversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE) through a structured approach to fingerprint generation. By directly optimising the minutiae information during the generation process, the model produces enhanced latent fingerprints that exhibit exceptional fidelity to ground-truth instances. This leads to a significant improvement in identification performance. Our framework integrates minutiae locations and orientation fields, ensuring the preservation of both local and structural fingerprint features. Extensive evaluations conducted on two publicly available datasets demonstrate our method’s dominance over existing state-of-the-art techniques, highlighting its potential to significantly enhance latent fingerprint recognition accuracy in forensic applications.

[CV-46] Efficient Low-Resolution Face Recognition via Bridge Distillation

链接: https://arxiv.org/abs/2409.11786
作者: Shiming Ge,Shengwei Zhao,Chenyu Li,Yu Zhang,Jia Li
关键词-EN: fast inference speed, fast inference, faces, private high-resolution faces, Face recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: This paper is published in IEEE TIP 2020

点击查看摘要

Abstract:Face recognition in the wild is now advancing towards light-weight models, fast inference speed and resolution-adapted capability. In this paper, we propose a bridge distillation approach to turn a complex face model pretrained on private high-resolution faces into a light-weight one for low-resolution face recognition. In our approach, such a cross-dataset resolution-adapted knowledge transfer problem is solved via two-step distillation. In the first step, we conduct cross-dataset distillation to transfer the prior knowledge from private high-resolution faces to public high-resolution faces and generate compact and discriminative features. In the second step, the resolution-adapted distillation is conducted to further transfer the prior knowledge to synthetic low-resolution faces via multi-task learning. By learning low-resolution face representations and mimicking the adapted high-resolution knowledge, a light-weight student model can be constructed with high efficiency and promising accuracy in recognizing low-resolution faces. Experimental results show that the student model performs impressively in recognizing low-resolution faces with only 0.21M parameters and 0.057MB memory. Meanwhile, its speed reaches up to 14,705, ~934 and 763 faces per second on GPU, CPU and mobile phone, respectively.

[CV-47] Distilling Channels for Efficient Deep Tracking

链接: https://arxiv.org/abs/2409.11785
作者: Shiming Ge,Zhao Luo,Chunhui Zhang,Yingying Hua,Dacheng Tao
关键词-EN: proven success, success in visual, Deep, Deep trackers, deep networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Published by IEEE TIP 2020

点击查看摘要

Abstract:Deep trackers have proven success in visual tracking. Typically, these trackers employ optimally pre-trained deep networks to represent all diverse objects with multi-channel features from some fixed layers. The deep networks employed are usually trained to extract rich knowledge from massive data used in object classification and so they are capable to represent generic objects very well. However, these networks are too complex to represent a specific moving object, leading to poor generalization as well as high computational and memory costs. This paper presents a novel and general framework termed channel distillation to facilitate deep trackers. To validate the effectiveness of channel distillation, we take discriminative correlation filter (DCF) and ECO for example. We demonstrate that an integrated formulation can turn feature compression, response map generation, and model update into a unified energy minimization problem to adaptively select informative feature channels that improve the efficacy of tracking moving objects on the fly. Channel distillation can accurately extract good channels, alleviating the influence of noisy channels and generally reducing the number of channels, as well as adaptively generalizing to different channels and networks. The resulting deep tracker is accurate, fast, and has low memory requirements. Extensive experimental evaluations on popular benchmarks clearly demonstrate the effectiveness and generalizability of our framework.

[CV-48] Knowledge Adaptation Network for Few-Shot Class-Incremental Learning

链接: https://arxiv.org/abs/2409.11770
作者: Ye Wang,Yaxiong Wang,Guoshuai Zhao,Xueming Qian
关键词-EN: Few-shot class-incremental learning, Few-shot class-incremental, aims to incrementally, incrementally recognize, samples while maintaining
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 13 pages;6 figures

点击查看摘要

Abstract:Few-shot class-incremental learning (FSCIL) aims to incrementally recognize new classes using a few samples while maintaining the performance on previously learned classes. One of the effective methods to solve this challenge is to construct prototypical evolution classifiers. Despite the advancement achieved by most existing methods, the classifier weights are simply initialized using mean features. Because representations for new classes are weak and biased, we argue such a strategy is suboptimal. In this paper, we tackle this issue from two aspects. Firstly, thanks to the development of foundation models, we employ a foundation model, the CLIP, as the network pedestal to provide a general representation for each class. Secondly, to generate a more reliable and comprehensive instance representation, we propose a Knowledge Adapter (KA) module that summarizes the data-specific knowledge from training data and fuses it into the general representation. Additionally, to tune the knowledge learned from the base classes to the upcoming classes, we propose a mechanism of Incremental Pseudo Episode Learning (IPEL) by simulating the actual FSCIL. Taken together, our proposed method, dubbed as Knowledge Adaptation Network (KANet), achieves competitive performance on a wide range of datasets, including CIFAR100, CUB200, and ImageNet-R.

[CV-49] Neural Encoding for Image Recall: Human-Like Memory

链接: https://arxiv.org/abs/2409.11750
作者: Virgile Foussereau,Robin Dumas
关键词-EN: Achieving human-like memory, Achieving human-like, computer vision, remains a challenging, challenging frontier
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 7 figures

点击查看摘要

Abstract:Achieving human-like memory recall in artificial systems remains a challenging frontier in computer vision. Humans demonstrate remarkable ability to recall images after a single exposure, even after being shown thousands of images. However, this capacity diminishes significantly when confronted with non-natural stimuli such as random textures. In this paper, we present a method inspired by human memory processes to bridge this gap between artificial and biological memory systems. Our approach focuses on encoding images to mimic the high-level information retained by the human brain, rather than storing raw pixel data. By adding noise to images before encoding, we introduce variability akin to the non-deterministic nature of human memory encoding. Leveraging pre-trained models’ embedding layers, we explore how different architectures encode images and their impact on memory recall. Our method achieves impressive results, with 97% accuracy on natural images and near-random performance (52%) on textures. We provide insights into the encoding process and its implications for machine learning memory systems, shedding light on the parallels between human and artificial intelligence memory mechanisms.

[CV-50] RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework

链接: https://arxiv.org/abs/2409.11749
作者: Xiaoyu Li,Peidong Li,Lijun Zhao,Dedong Liu,Jinghan Gao,Xian Wu,Yitao Wu,Dixiao Cui
关键词-EN: cost-effective multi-camera setups, obtains significant performance, significant performance improvements, obtains significant, rapid advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: RockTrack establishes a new state-of-the-art with 59.1% AMOTA on the nuScenes vision-only test leaderboard with ResNet50-level backbone

点击查看摘要

Abstract:3D Multi-Object Tracking (MOT) obtains significant performance improvements with the rapid advancements in 3D object detection, particularly in cost-effective multi-camera setups. However, the prevalent end-to-end training approach for multi-camera trackers results in detector-specific models, limiting their versatility. Moreover, current generic trackers overlook the unique features of multi-camera detectors, i.e., the unreliability of motion observations and the feasibility of visual information. To address these challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors. Following the Tracking-By-Detection framework, RockTrack is compatible with various off-the-shelf detectors. RockTrack incorporates a confidence-guided preprocessing module to extract reliable motion and image observations from distinct representation spaces from a single detector. These observations are then fused in an association module that leverages geometric and appearance cues to minimize mismatches. The resulting matches are propagated through a staged estimation process, forming the basis for heuristic noise modeling. Additionally, we introduce a novel appearance similarity metric for explicitly characterizing object affinities in multi-camera settings. RockTrack achieves state-of-the-art performance on the nuScenes vision-only tracking leaderboard with 59.1% AMOTA while demonstrating impressive computational efficiency.

[CV-51] Exploring Gaze Pattern in Autistic Children: Clustering Visualization and Prediction

链接: https://arxiv.org/abs/2409.11744
作者: Weiyan Shi,Haihong Zhang,Jin Yang,Ruiqing Ding,YongWei Zhu,Kenny Tsu Wei Choo
关键词-EN: Autism Spectrum Disorder, Autism Spectrum, Spectrum Disorder, ASD, gaze
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Autism Spectrum Disorder (ASD) significantly affects the social and communication abilities of children, and eye-tracking is commonly used as a diagnostic tool by identifying associated atypical gaze patterns. Traditional methods demand manual identification of Areas of Interest in gaze patterns, lowering the performance of gaze behavior analysis in ASD subjects. To tackle this limitation, we propose a novel method to automatically analyze gaze behaviors in ASD children with superior accuracy. To be specific, we first apply and optimize seven clustering algorithms to automatically group gaze points to compare ASD subjects with typically developing peers. Subsequently, we extract 63 significant features to fully describe the patterns. These features can describe correlations between ASD diagnosis and gaze patterns. Lastly, using these features as prior knowledge, we train multiple predictive machine learning models to predict and diagnose ASD based on their gaze behaviors. To evaluate our method, we apply our method to three ASD datasets. The experimental and visualization results demonstrate the improvements of clustering algorithms in the analysis of unique gaze patterns in ASD children. Additionally, these predictive machine learning models achieved state-of-the-art prediction performance ( 81% AUC) in the field of automatically constructed gaze point features for ASD diagnosis. Our code is available at \urlthis https URL.

[CV-52] InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models

链接: https://arxiv.org/abs/2409.11734
作者: Yan Zheng,Lemeng Wu
关键词-EN: customized user requirements, short for GEO, exceptionally versatile image, global scales, exceptionally versatile
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:In this paper, we introduce Geometry-Inverse-Meet-Pixel-Insert, short for GEO, an exceptionally versatile image editing technique designed to cater to customized user requirements at both local and global scales. Our approach seamlessly integrates text prompts and image prompts to yield diverse and precise editing outcomes. Notably, our method operates without the need for training and is driven by two key contributions: (i) a novel geometric accumulation loss that enhances DDIM inversion to faithfully preserve pixel space geometry and layout, and (ii) an innovative boosted image prompt technique that combines pixel-level editing for text-only inversion with latent space geometry guidance for standard classifier-free reversion. Leveraging the publicly available Stable Diffusion model, our approach undergoes extensive evaluation across various image types and challenging prompt editing scenarios, consistently delivering high-fidelity editing results for real images.

[CV-53] DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information

链接: https://arxiv.org/abs/2409.11729
作者: Shota Nakada,Taichi Nishimura,Hokuto Munakata,Masayoshi Kondo,Tatsuya Komatsu
关键词-EN: Current audio-visual representation, recognize fine-grained details, rough object categories, audio-visual representation learning, capture rough object
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: under review

点击查看摘要

Abstract:Current audio-visual representation learning can capture rough object categories (e.g., animals'' and instruments’‘), but it lacks the ability to recognize fine-grained details, such as specific categories like dogs'' and flutes’’ within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using state-of-the-art language-audio models and object detectors. We evaluate the method of audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and visual-to-audio retrieval, respectively, and an improvement in accuracy of +0.6% for audio-visual classification.

[CV-54] Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression ECCV2024

链接: https://arxiv.org/abs/2409.11718
作者: Yuan Tian,Guo Lu,Guangtao Zhai
关键词-EN: recently garnered attention, Unsupervised video semantic, Unsupervised video, garnered attention, support various analysis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV2024

点击查看摘要

Abstract:Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.

[CV-55] RopeBEV: A Multi-Camera Roadside Perception Network in Birds-Eye-View

链接: https://arxiv.org/abs/2409.11706
作者: Jinrang Jia,Guangqi Yi,Yifeng Shi
关键词-EN: gained wide application, multi-camera BEV, multi-camera BEV perception, multi-camera BEV solution, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-camera perception methods in Bird’s-Eye-View (BEV) have gained wide application in autonomous driving. However, due to the differences between roadside and vehicle-side scenarios, there currently lacks a multi-camera BEV solution in roadside. This paper systematically analyzes the key challenges in multi-camera BEV perception for roadside scenarios compared to vehicle-side. These challenges include the diversity in camera poses, the uncertainty in Camera numbers, the sparsity in perception regions, and the ambiguity in orientation angles. In response, we introduce RopeBEV, the first dense multi-camera BEV approach. RopeBEV introduces BEV augmentation to address the training balance issues caused by diverse camera poses. By incorporating CamMask and ROIMask (Region of Interest Mask), it supports variable camera numbers and sparse perception, respectively. Finally, camera rotation embedding is utilized to resolve orientation ambiguity. Our method ranks 1st on the real-world highway dataset RoScenes and demonstrates its practical value on a private urban dataset that covers more than 50 intersections and 600 cameras.

[CV-56] Discovering Conceptual Knowledge with Analytic Ontology Templates for Articulated Objects

链接: https://arxiv.org/abs/2409.11702
作者: Jianhua Sun,Yuxuan Li,Longfei Xu,Jiude Wei,Liang Chai,Cewu Lu
关键词-EN: leverage fundamental conceptual, Human cognition, articulated objects, fundamental conceptual knowledge, appropriately perceive
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human cognition can leverage fundamental conceptual knowledge, like geometric and kinematic ones, to appropriately perceive, comprehend and interact with novel objects. Motivated by this finding, we aim to endow machine intelligence with an analogous capability through performing at the conceptual level, in order to understand and then interact with articulated objects, especially for those in novel categories, which is challenging due to the intricate geometric structures and diverse joint types of articulated objects. To achieve this goal, we propose Analytic Ontology Template (AOT), a parameterized and differentiable program description of generalized conceptual ontologies. A baseline approach called AOTNet driven by AOTs is designed accordingly to equip intelligent agents with these generalized concepts, and then empower the agents to effectively discover the conceptual knowledge on the structure and affordance of articulated objects. The AOT-driven approach yields benefits in three key perspectives: i) enabling concept-level understanding of articulated objects without relying on any real training data, ii) providing analytic structure information, and iii) introducing rich affordance information indicating proper ways of interaction. We conduct exhaustive experiments and the results demonstrate the superiority of our approach in understanding and then interacting with articulated objects.

[CV-57] ORB-SfMLearner: ORB-Guided Self-supervised Visual Odometry with Selective Online Adaptation

链接: https://arxiv.org/abs/2409.11692
作者: Yanlin Jin,Rui-Yang Ju,Haojun Liu,Yuzhong Zhong
关键词-EN: Deep visual odometry, visual odometry, extensive research, broader application, guided visual odometry
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep visual odometry, despite extensive research, still faces limitations in accuracy and generalizability that prevent its broader application. To address these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided visual odometry with selective online adaptation named ORB-SfMLearner. We present a novel use of ORB features for learning-based ego-motion estimation, leading to more robust and accurate results. We also introduce the cross-attention mechanism to enhance the explainability of PoseNet and have revealed that driving direction of the vehicle can be explained through attention weights, marking a novel exploration in this area. To improve generalizability, our selective online adaptation allows the network to rapidly and selectively adjust to the optimal parameters across different domains. Experimental results on KITTI and vKITTI datasets show that our method outperforms previous state-of-the-art deep visual odometry methods in terms of ego-motion accuracy and generalizability.

[CV-58] GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation

链接: https://arxiv.org/abs/2409.11689
作者: Shuowen Liang,Sisi Li,Qingyun Wang,Cen Zhang,Kaiquan Zhu,Tian Yang
关键词-EN: pose-controllable image generation, important reference, reference in pose-controllable, Pose skeleton, Pose skeleton images
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.

[CV-59] SLAM assisted 3D tracking system for laparoscopic surgery

链接: https://arxiv.org/abs/2409.11688
作者: Jingwei Song,Ray Zhang,Wenwei Zhang,Hao Zhou,Maani Ghaffari
关键词-EN: minimally invasive surgery, internal anatomical structures, feedback and transparency, major limitation, limitation of minimally
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Demo: this https URL

点击查看摘要

Abstract:A major limitation of minimally invasive surgery is the difficulty in accurately locating the internal anatomical structures of the target organ due to the lack of tactile feedback and transparency. Augmented reality (AR) offers a promising solution to overcome this challenge. Numerous studies have shown that combining learning-based and geometric methods can achieve accurate preoperative and intraoperative data registration. This work proposes a real-time monocular 3D tracking algorithm for post-registration tasks. The ORB-SLAM2 framework is adopted and modified for prior-based 3D tracking. The primitive 3D shape is used for fast initialization of the monocular SLAM. A pseudo-segmentation strategy is employed to separate the target organ from the background for tracking purposes, and the geometric prior of the 3D shape is incorporated as an additional constraint in the pose graph. Experiments from in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system provides robust 3D tracking and effectively handles typical challenges such as fast motion, out-of-field-of-view scenarios, partial visibility, and “organ-background” relative motion.

[CV-60] Detecting Underdiagnosed Medical Conditions with Deep Learning-Based Opportunistic CT Imaging

链接: https://arxiv.org/abs/2409.11686
作者: Asad Aali,Andrew Johnston,Louis Blankemeier,Dave Van Veen,Laura T Derry,David Svec,Jason Hom,Robert D. Boutin,Akshay S. Chaudhari
关键词-EN: Abdominal computed tomography, Abdominal computed, computed tomography, frequently performed, clinical settings
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Abdominal computed tomography (CT) scans are frequently performed in clinical settings. Opportunistic CT involves repurposing routine CT images to extract diagnostic information and is an emerging tool for detecting underdiagnosed conditions such as sarcopenia, hepatic steatosis, and ascites. This study utilizes deep learning methods to promote accurate diagnosis and clinical documentation. We analyze 2,674 inpatient CT scans to identify discrepancies between imaging phenotypes (characteristics derived from opportunistic CT scans) and their corresponding documentation in radiology reports and ICD coding. Through our analysis, we find that only 0.5%, 3.2%, and 30.7% of scans diagnosed with sarcopenia, hepatic steatosis, and ascites (respectively) through either opportunistic imaging or radiology reports were ICD-coded. Our findings demonstrate opportunistic CT’s potential to enhance diagnostic precision and accuracy of risk adjustment models, offering advancements in precision medicine.

[CV-61] SRIF: Semantic Shape Registration Empowered by Diffusion-based Image Morphing and Flow Estimation

链接: https://arxiv.org/abs/2409.11682
作者: Mingze Sun,Chen Guo,Puhua Jiang,Shiwei Mao,Yurun Chen,Ruqi Huang
关键词-EN: diffusion-based Image morphing, Registration framework based, framework based, interpolation framework based, based on diffusion-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose SRIF, a novel Semantic shape Registration framework based on diffusion-based Image morphing and Flow estimation. More concretely, given a pair of extrinsically aligned shapes, we first render them from multi-views, and then utilize an image interpolation framework based on diffusion models to generate sequences of intermediate images between them. The images are later fed into a dynamic 3D Gaussian splatting framework, with which we reconstruct and post-process for intermediate point clouds respecting the image morphing processing. In the end, tailored for the above, we propose a novel registration module to estimate continuous normalizing flow, which deforms source shape consistently towards the target, with intermediate point clouds as weak guidance. Our key insight is to leverage large vision models (LVMs) to associate shapes and therefore obtain much richer semantic information on the relationship between shapes than the ad-hoc feature extraction and alignment. As a consequence, SRIF achieves high-quality dense correspondences on challenging shape pairs, but also delivers smooth, semantically meaningful interpolation in between. Empirical evidence justifies the effectiveness and superiority of our method as well as specific design choices. The code is released at this https URL.

[CV-62] Gradient-Driven 3D Segmentation and Affordance Transfer in Gaussian Splatting Using 2D Masks ICRA2025

链接: https://arxiv.org/abs/2409.11681
作者: Joji Joseph,Bharadwaj Amrutur,Shalabh Bhatnagar
关键词-EN: scene representation technique, capturing fine details, Splatting has emerged, Gaussian Splatting, scene representation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint, Under review for ICRA 2025

点击查看摘要

Abstract:3D Gaussian Splatting has emerged as a powerful 3D scene representation technique, capturing fine details with high efficiency. In this paper, we introduce a novel voting-based method that extends 2D segmentation models to 3D Gaussian splats. Our approach leverages masked gradients, where gradients are filtered by input 2D masks, and these gradients are used as votes to achieve accurate segmentation. As a byproduct, we discovered that inference-time gradients can also be used to prune Gaussians, resulting in up to 21% compression. Additionally, we explore few-shot affordance transfer, allowing annotations from 2D images to be effectively transferred onto 3D Gaussian splats. The robust yet straightforward mathematical formulation underlying this approach makes it a highly effective tool for numerous downstream applications, such as augmented reality (AR), object editing, and robotics. The project code and additional resources are available at this https URL.

[CV-63] Agent Aggregator with Mask Denoise Mechanism for Histopathology Whole Slide Image Analysis

链接: https://arxiv.org/abs/2409.11664
作者: Xitong Ling,Minxi Ouyang,Yizhi Wang,Xinrui Chen,Renao Yan,Hongbo Chu,Junru Cheng,Tian Guan,Sufang Tian,Xiaoping Liu,Yonghong He
关键词-EN: Histopathology analysis, gold standard, standard for medical, medical diagnosis, Histopathology
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Histopathology analysis is the gold standard for medical diagnosis. Accurate classification of whole slide images (WSIs) and region-of-interests (ROIs) localization can assist pathologists in diagnosis. The gigapixel resolution of WSI and the absence of fine-grained annotations make direct classification and analysis challenging. In weakly supervised learning, multiple instance learning (MIL) presents a promising approach for WSI classification. The prevailing strategy is to use attention mechanisms to measure instance importance for classification. However, attention mechanisms fail to capture inter-instance information, and self-attention causes quadratic computational complexity. To address these challenges, we propose AMD-MIL, an agent aggregator with a mask denoise mechanism. The agent token acts as an intermediate variable between the query and key for computing instance importance. Mask and denoising matrices, mapped from agents-aggregated value, dynamically mask low-contribution representations and eliminate noise. AMD-MIL achieves better attention allocation by adjusting feature representations, capturing micro-metastases in cancer, and improving interpretability. Extensive experiments on CAMELYON-16, CAMELYON-17, TCGA-KIDNEY, and TCGA-LUNG show AMD-MIL’s superiority over state-of-the-art methods.

[CV-64] Bridging Domain Gap for Flight-Ready Spaceborne Vision CEC

链接: https://arxiv.org/abs/2409.11661
作者: Tae Ha Park,Simone D’Amico
关键词-EN: Spacecraft Pose Network, work presents Spacecraft, Neural Network, presents Spacecraft Pose, non-cooperative target spacecraft
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to Journal of Spacecraft and Rockets; Appeared as Chapter 4 of Tae Ha Park’s PhD thesis

点击查看摘要

Abstract:This work presents Spacecraft Pose Network v3 (SPNv3), a Neural Network (NN) for monocular pose estimation of a known, non-cooperative target spacecraft. As opposed to existing literature, SPNv3 is designed and trained to be computationally efficient while providing robustness to spaceborne images that have not been observed during offline training and validation on the ground. These characteristics are essential to deploying NNs on space-grade edge devices. They are achieved through careful NN design choices, and an extensive trade-off analysis reveals features such as data augmentation, transfer learning and vision transformer architecture as a few of those that contribute to simultaneously maximizing robustness and minimizing computational overhead. Experiments demonstrate that the final SPNv3 can achieve state-of-the-art pose accuracy on hardware-in-the-loop images from a robotic testbed while having trained exclusively on computer-generated synthetic images, effectively bridging the domain gap between synthetic and real imagery. At the same time, SPNv3 runs well above the update frequency of modern satellite navigation filters when tested on a representative graphical processing unit system with flight heritage. Overall, SPNv3 is an efficient, flight-ready NN model readily applicable to a wide range of close-range rendezvous and proximity operations with target resident space objects. The code implementation of SPNv3 will be made publicly available.

[CV-65] VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

链接: https://arxiv.org/abs/2409.11656
作者: Humen Zhong,Zhibo Yang,Zhaohai Li,Peng Wang,Jun Tang,Wenqing Cheng,Cong Yao
关键词-EN: vision and language, Text recognition, advanced text recognition, inherent integration, texture in stroke
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM-MM2024

点击查看摘要

Abstract:Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. Towards advanced text recognition, there are three key challenges: (1) an encoder capable of representing the visual and semantic distributions; (2) a decoder that ensures the alignment between vision and semantics; and (3) consistency in the framework during pre-training, if it exists, and fine-tuning. Inspired by masked autoencoding, a successful pre-training strategy in both vision and language, we propose an innovative scene text recognition approach, named VL-Reader. The novelty of the VL-Reader lies in the pervasive interplay between vision and language throughout the entire process. Concretely, we first introduce a Masked Visual-Linguistic Reconstruction (MVLR) objective, which aims at simultaneously modeling visual and linguistic information. Then, we design a Masked Visual-Linguistic Decoder (MVLD) to further leverage masked vision-language context and achieve bi-modal feature interaction. The architecture of VL-Reader maintains consistency from pre-training to fine-tuning. In the pre-training stage, VL-Reader reconstructs both masked visual and text tokens, while in the fine-tuning stage, the network degrades to reconstruct all characters from an image without any masked regions. VL-reader achieves an average accuracy of 97.1% on six typical datasets, surpassing the SOTA by 1.1%. The improvement was even more significant on challenging datasets. The results demonstrate that vision and language reconstructor can serve as an effective scene text recognizer.

[CV-66] Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

链接: https://arxiv.org/abs/2409.11653
作者: Qian Shao,Jiangrui Kang,Qiyuan Chen,Zepeng Li,Hongxia Xu,Yiwen Cao,Jiajuan Liang,Jian Wu
关键词-EN: human labor, preferred paradigm, deep learning tasks, sample selection, Learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The sample selection task in SSL has been under-explored for a long time. To fill in this gap, we propose a Representative and Diverse Sample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm to minimise a novel criterion \alpha -Maximum Mean Discrepancy ( \alpha -MMD), RDSS samples a representative and diverse subset for annotation from the unlabeled data. We demonstrate that minimizing \alpha -MMD enhances the generalization ability of low-budget learning. Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained annotation budgets.

[CV-67] Relax DARTS: Relaxing the Constraints of Differentiable Architecture Search for Eye Movement Recognition

链接: https://arxiv.org/abs/2409.11652
作者: Hongyu Zhu,Xin Jin,Hongchao Liao,Yan Xiang,Mounim A. El-Yacoubi,Huafeng Qin
关键词-EN: innovative identification method, Eye movement biometrics, Relax DARTS, secure and innovative, innovative identification
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: Accepted By CCBR 2024

点击查看摘要

Abstract:Eye movement biometrics is a secure and innovative identification method. Deep learning methods have shown good performance, but their network architecture relies on manual design and combined priori knowledge. To address these issues, we introduce automated network search (NAS) algorithms to the field of eye movement recognition and present Relax DARTS, which is an improvement of the Differentiable Architecture Search (DARTS) to realize more efficient network search and training. The key idea is to circumvent the issue of weight sharing by independently training the architecture parameters \alpha to achieve a more precise target architecture. Moreover, the introduction of module input weights \beta allows cells the flexibility to select inputs, to alleviate the overfitting phenomenon and improve the model performance. Results on four public databases demonstrate that the Relax DARTS achieves state-of-the-art recognition performance. Notably, Relax DARTS exhibits adaptability to other multi-feature temporal classification tasks.

[CV-68] DAF-Net: A Dual-Branch Feature Decomposition Fusion Network with Domain Adaptive for Infrared and Visible Image Fusion

链接: https://arxiv.org/abs/2409.11642
作者: Jian Xu,Xin He
关键词-EN: comprehensive scene understanding, combine complementary information, visible image fusion, image fusion aims, scene understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5pages,4figures

点击查看摘要

Abstract:Infrared and visible image fusion aims to combine complementary information from both modalities to provide a more comprehensive scene understanding. However, due to the significant differences between the two modalities, preserving key features during the fusion process remains a challenge. To address this issue, we propose a dual-branch feature decomposition fusion network (DAF-Net) with domain adaptive, which introduces Multi-Kernel Maximum Mean Discrepancy (MK-MMD) into the base encoder and designs a hybrid kernel function suitable for infrared and visible image fusion. The base encoder built on the Restormer network captures global structural information while the detail encoder based on Invertible Neural Networks (INN) focuses on extracting detail texture information. By incorporating MK-MMD, the DAF-Net effectively aligns the latent feature spaces of visible and infrared images, thereby improving the quality of the fused images. Experimental results demonstrate that the proposed method outperforms existing techniques across multiple datasets, significantly enhancing both visual quality and fusion performance. The related Python code is available at this https URL.

[CV-69] PainDiffusion: Can robot express pain?

链接: https://arxiv.org/abs/2409.11635
作者: Quang Tien Dam,Tri Tung Nguyen Nguyen,Dinh Tuan Tran,Joo-Ho Lee
关键词-EN: rehabilitation nurse training, communicating problems, intuitive and user-friendly, rehabilitation nurse, nurse training robots
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under reviewing

点击查看摘要

Abstract:Pain is a more intuitive and user-friendly way of communicating problems, making it especially useful in rehabilitation nurse training robots. While most previous methods have focused on classifying or recognizing pain expressions, these approaches often result in unnatural, jiggling robot faces. We introduce PainDiffusion, a model that generates facial expressions in response to pain stimuli, with controllable pain expressiveness and emotion status. PainDiffusion leverages diffusion forcing to roll out predictions over arbitrary lengths using a conditioned temporal U-Net. It operates as a latent diffusion model within EMOCA’s facial expression latent space, ensuring a compact data representation and quick rendering time. For training data, we process the BioVid Heatpain Database, extracting expression codes and subject identity configurations. We also propose a novel set of metrics to evaluate pain expressions, focusing on expressiveness, diversity, and the appropriateness of model-generated outputs. Finally, we demonstrate that PainDiffusion outperforms the autoregressive method, both qualitatively and quantitatively. Code, videos, and further analysis are available at: \hrefthis https URLthis https URL.

[CV-70] Multimodal Generalized Category Discovery

链接: https://arxiv.org/abs/2409.11624
作者: Yuchang Su,Renping Zhou,Siyu Huang,Xingjian Li,Tianyang Wang,Ziyue Wang,Min Xu
关键词-EN: Generalized Category Discovery, Generalized Category, Category Discovery, open-world scientific discoveries, aims to classify
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to classify inputs into both known and novel categories, a task crucial for open-world scientific discoveries. However, current GCD methods are limited to unimodal data, overlooking the inherently multimodal nature of most real-world data. In this work, we extend GCD to a multimodal setting, where inputs from different modalities provide richer and complementary information. Through theoretical analysis and empirical validation, we identify that the key challenge in multimodal GCD lies in effectively aligning heterogeneous information across modalities. To address this, we propose MM-GCD, a novel framework that aligns both the feature and output spaces of different modalities using contrastive learning and distillation techniques. MM-GCD achieves new state-of-the-art performance on the UPMC-Food101 and N24News datasets, surpassing previous methods by 11.5% and 4.7%, respectively.

[CV-71] Self-Contrastive Forward-Forward Algorithm

链接: https://arxiv.org/abs/2409.11593
作者: Xing Chen,Dongshu Liu,Jeremie Laydevant,Julie Grollier
关键词-EN: updates weights locally, purely forward-mode learning, purely forward-mode, updates weights, weights locally
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The Forward-Forward (FF) algorithm is a recent, purely forward-mode learning method, that updates weights locally and layer-wise and supports supervised as well as unsupervised learning. These features make it ideal for applications such as brain-inspired learning, low-power hardware neural networks, and distributed learning in large models. However, while FF has shown promise on written digit recognition tasks, its performance on natural images and time-series remains a challenge. A key limitation is the need to generate high-quality negative examples for contrastive learning, especially in unsupervised tasks, where versatile solutions are currently lacking. To address this, we introduce the Self-Contrastive Forward-Forward (SCFF) method, inspired by self-supervised contrastive learning. SCFF generates positive and negative examples applicable across different datasets, surpassing existing local forward algorithms for unsupervised classification accuracy on MNIST (MLP: 98.7%), CIFAR-10 (CNN: 80.75%), and STL-10 (CNN: 77.3%). Additionally, SCFF is the first to enable FF training of recurrent neural networks, opening the door to more complex tasks and continuous-time video and text processing.

[CV-72] Preference Tuning with Human Feedback on Language Speech and Vision Tasks: A Survey

链接: https://arxiv.org/abs/2409.11564
作者: Genta Indra Winata,Hanyang Zhao,Anirban Das,Wenpin Tang,David D. Yao,Shi-Xiong Zhang,Sambit Sahu
关键词-EN: aligning deep generative, Preference tuning, deep generative models, preference tuning tasks, Preference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Survey paper

点击查看摘要

Abstract:Preference tuning is a crucial process for aligning deep generative models with human preferences. This survey offers a thorough overview of recent advancements in preference tuning and the integration of human feedback. The paper is organized into three main sections: 1) introduction and preliminaries: an introduction to reinforcement learning frameworks, preference tuning tasks, models, and datasets across various modalities: language, speech, and vision, as well as different policy approaches, 2) in-depth examination of each preference tuning approach: a detailed analysis of the methods used in preference tuning, and 3) applications, discussion, and future directions: an exploration of the applications of preference tuning in downstream tasks, including evaluation methods for different modalities, and an outlook on future research directions. Our objective is to present the latest methodologies in preference tuning and model alignment, enhancing the understanding of this field for researchers and practitioners. We hope to encourage further engagement and innovation in this area.

[CV-73] Open-Set Semantic Uncertainty Aware Metric-Semantic Graph Matching

链接: https://arxiv.org/abs/2409.11555
作者: Kurran Singh,John J. Leonard
关键词-EN: mapping requires incorporating, visual foundation models, requires incorporating visual, incorporating visual foundation, previously unseen object
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Underwater object-level mapping requires incorporating visual foundation models to handle the uncommon and often previously unseen object classes encountered in marine scenarios. In this work, a metric of semantic uncertainty for open-set object detections produced by visual foundation models is calculated and then incorporated into an object-level uncertainty tracking framework. Object-level uncertainties and geometric relationships between objects are used to enable robust object-level loop closure detection for unknown object classes. The above loop closure detection problem is formulated as a graph-matching problem. While graph matching, in general, is NP-Complete, a solver for an equivalent formulation of the proposed graph matching problem as a graph editing problem is tested on multiple challenging underwater scenes. Results for this solver as well as three other solvers demonstrate that the proposed methods are feasible for real-time use in marine environments for the robust, open-set, multi-object, semantic-uncertainty-aware loop closure detection. Further experimental results on the KITTI dataset demonstrate that the method generalizes to large-scale terrestrial scenes.

[CV-74] VALO: A Versatile Anytime Framework for LiDAR-based Object Detection Deep Neural Networks

链接: https://arxiv.org/abs/2409.11542
作者: Ahmet Soyyigit,Shuochao Yao,Heechul Yun
关键词-EN: LiDAR object detection, LiDAR object, object detection, object detection deep, object detection DNNs
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work addresses the challenge of adapting dynamic deadline requirements for LiDAR object detection deep neural networks (DNNs). The computing latency of object detection is critically important to ensure safe and efficient navigation. However, state-of-the-art LiDAR object detection DNNs often exhibit significant latency, hindering their real-time performance on resource-constrained edge platforms. Therefore, a tradeoff between detection accuracy and latency should be dynamically managed at runtime to achieve optimum results. In this paper, we introduce VALO (Versatile Anytime algorithm for LiDAR Object detection), a novel data-centric approach that enables anytime computing of 3D LiDAR object detection DNNs. VALO employs a deadline-aware scheduler to selectively process input regions, making execution time and accuracy tradeoffs without architectural modifications. Additionally, it leverages efficient forecasting of past detection results to mitigate possible loss of accuracy due to partial processing of input. Finally, it utilizes a novel input reduction technique within its detection heads to significantly accelerate execution without sacrificing accuracy. We implement VALO on state-of-the-art 3D LiDAR object detection networks, namely CenterPoint and VoxelNext, and demonstrate its dynamic adaptability to a wide range of time constraints while achieving higher accuracy than the prior state-of-the-art. Code is available athttps://github.com/CSL-KU/VALOthis http URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.11542 [cs.CV] (or arXiv:2409.11542v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.11542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-75] Obfuscation Based Privacy Preserving Representations are Recoverable Using Neighborhood Information

链接: https://arxiv.org/abs/2409.11536
作者: Kunal Chelani,Assia Benbihi,Fredrik Kahl,Torsten Sattler,Zuzana Kukelova
关键词-EN: Rapid growth, Toggle, visual localization systems, cloud-based visual localization, Code
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Rapid growth in the popularity of AR/VR/MR applications and cloud-based visual localization systems has given rise to an increased focus on the privacy of user content in the localization process. This privacy concern has been further escalated by the ability of deep neural networks to recover detailed images of a scene from a sparse set of 3D or 2D points and their descriptors - the so-called inversion attacks. Research on privacy-preserving localization has therefore focused on preventing these inversion attacks on both the query image keypoints and the 3D points of the scene map. To this end, several geometry obfuscation techniques that lift points to higher-dimensional spaces, i.e., lines or planes, or that swap coordinates between points % have been proposed. In this paper, we point to a common weakness of these obfuscations that allows to recover approximations of the original point positions under the assumption of known neighborhoods. We further show that these neighborhoods can be computed by learning to identify descriptors that co-occur in neighborhoods. Extensive experiments show that our approach for point recovery is practically applicable to all existing geometric obfuscation schemes. Our results show that these schemes should not be considered privacy-preserving, even though they are claimed to be privacy-preserving. Code will be available at \urlthis https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.11536 [cs.CV] (or arXiv:2409.11536v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.11536 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Kunal Chelani [view email] [v1] Tue, 17 Sep 2024 20:13:54 UTC (27,694 KB) Full-text links: Access Paper: View a PDF of the paper titled Obfuscation Based Privacy Preserving Representations are Recoverable Using Neighborhood Information, by Kunal Chelani and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2024-09 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[CV-76] Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints

链接: https://arxiv.org/abs/2409.11518
作者: Chen Jiang,Allie Luo,Martin Jagersand
关键词-EN: referring image segmentation, image segmentation model, image segmentation, referring image, compact referring image
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we perform robot manipulation activities in real-world environments with language contexts by integrating a compact referring image segmentation model into the robot’s perception module. First, we propose CLIPU ^2 Net, a lightweight referring image segmentation model designed for fine-grain boundary and structure segmentation from language expressions. Then, we deploy the model in an eye-in-hand visual servoing system to enact robot control in the real world. The key to our system is the representation of salient visual information as geometric constraints, linking the robot’s visual perception to actionable commands. Experimental results on 46 real-world robot manipulation tasks demonstrate that our method outperforms traditional visual servoing methods relying on labor-intensive feature annotations, excels in fine-grain referring image segmentation with a compact decoder size of 6.6 MB, and supports robot control across diverse contexts.

[CV-77] Mamba Fusion: Learning Actions Through Questioning

链接: https://arxiv.org/abs/2409.11513
作者: Zhikang Dong,Apoorva Beedu,Jason Sheinkopf,Irfan Essa
关键词-EN: Video Language Models, Video Language, enhance learning, crucial for generalizing, generalizing across diverse
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Video Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic computational complexity, high GPU memory usage, and difficulty with long-term dependencies. To address these limitations, we introduce MambaVL, a novel model that leverages recent advancements in selective state space modality fusion to efficiently capture long-range dependencies and learn joint representations for vision and language data. MambaVL utilizes a shared state transition matrix across both modalities, allowing the model to capture information about actions from multiple perspectives within the scene. Furthermore, we propose a question-answering task that helps guide the model toward relevant cues. These questions provide critical information about actions, objects, and environmental context, leading to enhanced performance. As a result, MambaVL achieves state-of-the-art performance in action recognition on the Epic-Kitchens-100 dataset and outperforms baseline methods in action anticipation.

[CV-78] Good Grasps Only: A data engine for self-supervised fine-tuning of pose estimation using grasp poses for verification

链接: https://arxiv.org/abs/2409.11512
作者: Frederik Hagelskjær
关键词-EN: pose estimation, zero-shot pose estimation, self-supervised fine-tuning, in-hand pose estimation, pose
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 7 figures, 3 tables

点击查看摘要

Abstract:In this paper, we present a novel method for self-supervised fine-tuning of pose estimation for bin-picking. Leveraging zero-shot pose estimation, our approach enables the robot to automatically obtain training data without manual labeling. After pose estimation the object is grasped, and in-hand pose estimation is used for data validation. Our pipeline allows the system to fine-tune while the process is running, removing the need for a learning phase. The motivation behind our work lies in the need for rapid setup of pose estimation solutions. Specifically, we address the challenging task of bin picking, which plays a pivotal role in flexible robotic setups. Our method is implemented on a robotics work-cell, and tested with four different objects. For all objects, our method increases the performance and outperforms a state-of-the-art method trained on the CAD model of the objects. Comments: 8 pages, 7 figures, 3 tables Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2409.11512 [cs.RO] (or arXiv:2409.11512v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.11512 Focus to learn more arXiv-issued DOI via DataCite

[CV-79] wo Stage Segmentation of Cervical Tumors using PocketNet

链接: https://arxiv.org/abs/2409.11456
作者: Awj Twam,Megan Jacobsen,Rachel Glenn,Ann Klopp,Aradhana M. Venkatesan,David Fuentes
关键词-EN: includes external beam, external beam radiation, locally advanced cervical, definitive treatment regimen, mainstay definitive treatment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cervical cancer remains the fourth most common malignancy amongst women worldwide.1 Concurrent chemoradiotherapy (CRT) serves as the mainstay definitive treatment regimen for locally advanced cervical cancers and includes external beam radiation followed by brachytherapy.2 Integral to radiotherapy treatment planning is the routine contouring of both the target tumor at the level of the cervix, associated gynecologic anatomy and the adjacent organs at risk (OARs). However, manual contouring of these structures is both time and labor intensive and associated with known interobserver variability that can impact treatment outcomes. While multiple tools have been developed to automatically segment OARs and the high-risk clinical tumor volume (HR-CTV) using computed tomography (CT) images,3,4,5,6 the development of deep learning-based tumor segmentation tools using routine T2-weighted (T2w) magnetic resonance imaging (MRI) addresses an unmet clinical need to improve the routine contouring of both anatomical structures and cervical cancers, thereby increasing quality and consistency of radiotherapy planning. This work applied a novel deep-learning model (PocketNet) to segment the cervix, vagina, uterus, and tumor(s) on T2w MRI. The performance of the PocketNet architecture was evaluated, when trained on data via 5-fold cross validation. PocketNet achieved a mean Dice-Sorensen similarity coefficient (DSC) exceeding 70% for tumor segmentation and 80% for organ segmentation. These results suggest that PocketNet is robust to variations in contrast protocols, providing reliable segmentation of the ROIs.

[CV-80] Continual Learning of Conjugated Visual Representations through Higher-order Motion Flows

链接: https://arxiv.org/abs/2409.11441
作者: Simone Marullo,Matteo Tiezzi,Marco Gori,Stefano Melacci
关键词-EN: visual information presents, presents several challenges, challenges due, visual information, information presents
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Currently under review

点击查看摘要

Abstract:Learning with neural networks from a continuous stream of visual information presents several challenges due to the non-i.i.d. nature of the data. However, it also offers novel opportunities to develop representations that are consistent with the information flow. In this paper we investigate the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints, therefore named motion-conjugated feature representations. Differently from existing approaches, motion is not a given signal (either ground-truth or estimated by external modules), but is the outcome of a progressive and autonomous learning process, occurring at various levels of the feature hierarchy. Multiple motion flows are estimated with neural networks and characterized by different levels of abstractions, spanning from traditional optical flow to other latent signals originating from higher-level features, hence called higher-order motions. Continuously learning to develop consistent multi-order flows and representations is prone to trivial solutions, which we counteract by introducing a self-supervised contrastive loss, spatially-aware and based on flow-induced similarity. We assess our model on photorealistic synthetic streams and real-world videos, comparing to pre-trained state-of-the art feature extractors (also based on Transformers) and to recent unsupervised learning models, significantly outperforming these alternatives.

[CV-81] Scale-covariant and scale-invariant Gaussian derivative networks

链接: https://arxiv.org/abs/2011.14759
作者: Tony Lindeberg
关键词-EN: deep learning architecture, deep learning, coupling parameterized scale-space, parameterized scale-space operations, multiple scale channels
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages, 10 figures

点击查看摘要

Abstract:This paper presents a hybrid approach between scale-space theory and deep learning, where a deep learning architecture is constructed by coupling parameterized scale-space operations in cascade. By sharing the learnt parameters between multiple scale channels, and by using the transformation properties of the scale-space primitives under scaling transformations, the resulting network becomes provably scale covariant. By in addition performing max pooling over the multiple scale channels, a resulting network architecture for image classification also becomes provably scale invariant. We investigate the performance of such networks on the MNISTLargeScale dataset, which contains rescaled images from original MNIST over a factor of 4 concerning training data and over a factor of 16 concerning testing data. It is demonstrated that the resulting approach allows for scale generalization, enabling good performance for classifying patterns at scales not present in the training data.

[CV-82] Understanding when spatial transformer networks do not support invariance and what to do about it

链接: https://arxiv.org/abs/2004.11678
作者: Lukas Finnveden,Ylva Jansson,Tony Lindeberg
关键词-EN: enable convolutional neural, CNN feature maps, convolutional neural networks, CNN feature, convolutional neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Spatial transformer networks (STNs) were designed to enable convolutional neural networks (CNNs) to learn invariance to image transformations. STNs were originally proposed to transform CNN feature maps as well as input images. This enables the use of more complex features when predicting transformation parameters. However, since STNs perform a purely spatial transformation, they do not, in the general case, have the ability to align the feature maps of a transformed image with those of its original. STNs are therefore unable to support invariance when transforming CNN feature maps. We present a simple proof for this and study the practical implications, showing that this inability is coupled with decreased classification accuracy. We therefore investigate alternative STN architectures that make use of complex features. We find that while deeper localization networks are difficult to train, localization networks that share parameters with the classification network remain stable as they grow deeper, which allows for higher classification accuracy on difficult datasets. Finally, we explore the interaction between localization network complexity and iterative image alignment.

[CV-83] multiPI-TransBTS: A Multi-Path Learning Framework for Brain Tumor Image Segmentation Based on Multi-Physical Information

链接: https://arxiv.org/abs/2409.12167
作者: Hongjun Zhu,Jiaohang Huang,Kuo Chen,Xuehui Ying,Ying Qian
关键词-EN: Brain Tumor Segmentation, Brain Tumor, treatment planning, Tumor, plays a critical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Brain Tumor Segmentation (BraTS) plays a critical role in clinical diagnosis, treatment planning, and monitoring the progression of brain tumors. However, due to the variability in tumor appearance, size, and intensity across different MRI modalities, automated segmentation remains a challenging task. In this study, we propose a novel Transformer-based framework, multiPI-TransBTS, which integrates multi-physical information to enhance segmentation accuracy. The model leverages spatial information, semantic information, and multi-modal imaging data, addressing the inherent heterogeneity in brain tumor characteristics. The multiPI-TransBTS framework consists of an encoder, an Adaptive Feature Fusion (AFF) module, and a multi-source, multi-scale feature decoder. The encoder incorporates a multi-branch architecture to separately extract modality-specific features from different MRI sequences. The AFF module fuses information from multiple sources using channel-wise and element-wise attention, ensuring effective feature recalibration. The decoder combines both common and task-specific features through a Task-Specific Feature Introduction (TSFI) strategy, producing accurate segmentation outputs for Whole Tumor (WT), Tumor Core (TC), and Enhancing Tumor (ET) regions. Comprehensive evaluations on the BraTS2019 and BraTS2020 datasets demonstrate the superiority of multiPI-TransBTS over the state-of-the-art methods. The model consistently achieves better Dice coefficients, Hausdorff distances, and Sensitivity scores, highlighting its effectiveness in addressing the BraTS challenges. Our results also indicate the need for further exploration of the balance between precision and recall in the ET segmentation task. The proposed framework represents a significant advancement in BraTS, with potential implications for improving clinical outcomes for brain tumor patients.

[CV-84] Autopet III challenge: Incorporating anatomical knowledge into nnUNet for lesion segmentation in PET/CT

链接: https://arxiv.org/abs/2409.12155
作者: Hamza Kalisch,Fabian Hörst,Ken Herrmann,Jens Kleesiek,Constantin Seibold
关键词-EN: supports personalized treatment, personalized treatment planning, precise tumor characterization, enhances diagnostic precision, precision in oncology
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: AutoPET III challenge submission

点击查看摘要

Abstract:Lesion segmentation in PET/CT imaging is essential for precise tumor characterization, which supports personalized treatment planning and enhances diagnostic precision in oncology. However, accurate manual segmentation of lesions is time-consuming and prone to inter-observer variability. Given the rising demand and clinical use of PET/CT, automated segmentation methods, particularly deep-learning-based approaches, have become increasingly more relevant. The autoPET III Challenge focuses on advancing automated segmentation of tumor lesions in PET/CT images in a multitracer multicenter setting, addressing the clinical need for quantitative, robust, and generalizable solutions. Building on previous challenges, the third iteration of the autoPET challenge introduces a more diverse dataset featuring two different tracers (FDG and PSMA) from two clinical centers. To this extent, we developed a classifier that identifies the tracer of the given PET/CT based on the Maximum Intensity Projection of the PET scan. We trained two individual nnUNet-ensembles for each tracer where anatomical labels are included as a multi-label task to enhance the model’s performance. Our final submission achieves cross-validation Dice scores of 76.90% and 61.33% for the publicly available FDG and PSMA datasets, respectively. The code is available at this https URL .

[CV-85] Optimal Visual Search with Highly Heuristic Decision Rules

链接: https://arxiv.org/abs/2409.12124
作者: Anqi Zhang,Wilson S. Geisler
关键词-EN: fundamental natural task, fundamental natural, humans, Bayesian-optimal decision process, potential target-object locations
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Visual search is a fundamental natural task for humans and other animals. We investigated the decision processes humans use when searching briefly presented displays having well-separated potential target-object locations. Performance was compared with the Bayesian-optimal decision process under the assumption that the information from the different potential target locations is statistically independent. Surprisingly, humans performed slightly better than optimal, despite humans’ substantial loss of sensitivity in the fovea, and the implausibility of the human brain replicating the optimal computations. We show that three factors can quantitatively explain these seemingly paradoxical results. Most importantly, simple and fixed heuristic decision rules reach near optimal search performance. Secondly, foveal neglect primarily affects only the central potential target location. Finally, spatially correlated neural noise causes search performance to exceed that predicted for independent noise. These findings have far-reaching implications for understanding visual search tasks and other identification tasks in humans and other animals.

[CV-86] Denoising diffusion models for high-resolution microscopy image restoration

链接: https://arxiv.org/abs/2409.12078
作者: Pamela Osuna-Vargas,Maren H. Wehrheim,Lucas Zinz,Johanna Rahm,Ashwin Balakrishnan,Alexandra Kaminer,Mike Heilemann,Matthias Kaschube
关键词-EN: unraveling intricate details, Advances in microscopy, imaging enable researchers, microscopy imaging enable, microscopy imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advances in microscopy imaging enable researchers to visualize structures at the nanoscale level thereby unraveling intricate details of biological organization. However, challenges such as image noise, photobleaching of fluorophores, and low tolerability of biological samples to high light doses remain, restricting temporal resolutions and experiment durations. Reduced laser doses enable longer measurements at the cost of lower resolution and increased noise, which hinders accurate downstream analyses. Here we train a denoising diffusion probabilistic model (DDPM) to predict high-resolution images by conditioning the model on low-resolution information. Additionally, the probabilistic aspect of the DDPM allows for repeated generation of images that tend to further increase the signal-to-noise ratio. We show that our model achieves a performance that is better or similar to the previously best-performing methods, across four highly diverse datasets. Importantly, while any of the previous methods show competitive performance for some, but not all datasets, our method consistently achieves high performance across all four data sets, suggesting high generalizability.

[CV-87] umor aware recurrent inter-patient deformable image registration of computed tomography scans with lung cancer

链接: https://arxiv.org/abs/2409.11910
作者: Jue Jiang,Chloe Min Seo Choi,Maria Thor,Joseph O. Deasy,Harini Veeraraghavan
关键词-EN: population level radiotherapy, outcomes modeling requires, modeling requires topology, requires topology preserving, topology preserving inter-patient
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Minor revision under the journal of Medical Physics

点击查看摘要

Abstract:Background: Voxel-based analysis (VBA) for population level radiotherapy (RT) outcomes modeling requires topology preserving inter-patient deformable image registration (DIR) that preserves tumors on moving images while avoiding unrealistic deformations due to tumors occurring on fixed images. Purpose: We developed a tumor-aware recurrent registration (TRACER) deep learning (DL) method and evaluated its suitability for VBA. Methods: TRACER consists of encoder layers implemented with stacked 3D convolutional long short term memory network (3D-CLSTM) followed by decoder and spatial transform layers to compute dense deformation vector field (DVF). Multiple CLSTM steps are used to compute a progressive sequence of deformations. Input conditioning was applied by including tumor segmentations with 3D image pairs as input channels. Bidirectional tumor rigidity, image similarity, and deformation smoothness losses were used to optimize the network in an unsupervised manner. TRACER and multiple DL methods were trained with 204 3D CT image pairs from patients with lung cancers (LC) and evaluated using (a) Dataset I (N = 308 pairs) with DL segmented LCs, (b) Dataset II (N = 765 pairs) with manually delineated LCs, and © Dataset III with 42 LC patients treated with RT. Results: TRACER accurately aligned normal tissues. It best preserved tumors, blackindicated by the smallest tumor volume difference of 0.24%, 0.40%, and 0.13 % and mean square error in CT intensities of 0.005, 0.005, 0.004, computed between original and resampled moving image tumors, for Datasets I, II, and III, respectively. It resulted in the smallest planned RT tumor dose difference computed between original and resampled moving images of 0.01 Gy and 0.013 Gy when using a female and a male reference.

[CV-88] NT-ViT: Neural Transcoding Vision Transformers for EEG-to-fMRI Synthesis ECCV24

链接: https://arxiv.org/abs/2409.11836
作者: Romeo Lanzino,Federico Fontana,Luigi Cinque,Francesco Scarcello,Atsuto Maki
关键词-EN: Transcoding Vision Transformer, Neural Transcoding Vision, functional Magnetic Resonance, Magnetic Resonance Imaging, Vision Transformer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV24 Workshop on Synthetic Data for Computer Vision

点击查看摘要

Abstract:This paper introduces the Neural Transcoding Vision Transformer (\modelname), a generative model designed to estimate high-resolution functional Magnetic Resonance Imaging (fMRI) samples from simultaneous Electroencephalography (EEG) data. A key feature of \modelname is its Domain Matching (DM) sub-module which effectively aligns the latent EEG representations with those of fMRI volumes, enhancing the model’s accuracy and reliability. Unlike previous methods that tend to struggle with fidelity and reproducibility of images, \modelname addresses these challenges by ensuring methodological integrity and higher-quality reconstructions which we showcase through extensive evaluation on two benchmark datasets; \modelname outperforms the current state-of-the-art by a significant margin in both cases, e.g. achieving a 10\times reduction in RMSE and a 3.14\times increase in SSIM on the Oddball dataset. An ablation study also provides insights into the contribution of each component to the model’s overall effectiveness. This development is critical in offering a new approach to lessen the time and financial constraints typically linked with high-resolution brain imaging, thereby aiding in the swift and precise diagnosis of neurological disorders. Although it is not a replacement for actual fMRI but rather a step towards making such imaging more accessible, we believe that it represents a pivotal advancement in clinical practice and neuroscience research. Code is available at \urlthis https URL.

[CV-89] Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation using Rein to Fine-tune Vision Foundation Models

链接: https://arxiv.org/abs/2409.11752
作者: Pengzhou Cai,Xueyuan Zhang,Ze Zhao
关键词-EN: digital pathology images, recent years, significant progress, digital pathology, made in tumor
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, significant progress has been made in tumor segmentation within the field of digital pathology. However, variations in organs, tissue preparation methods, and image acquisition processes can lead to domain discrepancies among digital pathology images. To address this problem, in this paper, we use Rein, a fine-tuning method, to parametrically and efficiently fine-tune various vision foundation models (VFMs) for MICCAI 2024 Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation (COSAS2024). The core of Rein consists of a set of learnable tokens, which are directly linked to instances, improving functionality at the instance level in each layer. In the data environment of the COSAS2024 Challenge, extensive experiments demonstrate that Rein fine-tuned the VFMs to achieve satisfactory results. Specifically, we used Rein to fine-tune ConvNeXt and DINOv2. Our team used the former to achieve scores of 0.7719 and 0.7557 on the preliminary test phase and final test phase in task1, respectively, while the latter achieved scores of 0.8848 and 0.8192 on the preliminary test phase and final test phase in task2. Code is available at GitHub.

[CV-90] Adaptive Selection of Sampling-Reconstruction in Fourier Compressed Sensing ECCV2024

链接: https://arxiv.org/abs/2409.11738
作者: Seongmin Hong,Jaehyeok Bae,Jongho Lee,Se Young Chun
关键词-EN: Compressed sensing, inefficiency of Nyquist, Nyquist sampling, sampling, emerged to overcome
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 30 pages, Accepted to ECCV 2024

点击查看摘要

Abstract:Compressed sensing (CS) has emerged to overcome the inefficiency of Nyquist sampling. However, traditional optimization-based reconstruction is slow and can not yield an exact image in practice. Deep learning-based reconstruction has been a promising alternative to optimization-based reconstruction, outperforming it in accuracy and computation speed. Finding an efficient sampling method with deep learning-based reconstruction, especially for Fourier CS remains a challenge. Existing joint optimization of sampling-reconstruction works (H1) optimize the sampling mask but have low potential as it is not adaptive to each data point. Adaptive sampling (H2) has also disadvantages of difficult optimization and Pareto sub-optimality. Here, we propose a novel adaptive selection of sampling-reconstruction (H1.5) framework that selects the best sampling mask and reconstruction network for each input data. We provide theorems that our method has a higher potential than H1 and effectively solves the Pareto sub-optimality problem in sampling-reconstruction by using separate reconstruction networks for different sampling masks. To select the best sampling mask, we propose to quantify the high-frequency Bayesian uncertainty of the input, using a super-resolution space generation model. Our method outperforms joint optimization of sampling-reconstruction (H1) and adaptive sampling (H2) by achieving significant improvements on several Fourier CS problems.

[CV-91] LFIC-DRASC: Deep Light Field Image Compression Using Disentangled Representation and Asymmetrical Strip Convolution

链接: https://arxiv.org/abs/2409.11711
作者: Shiyu Feng,Yun Zhang,Linwei Zhu,Sam Kwong
关键词-EN: Asymmetrical Strip Convolution, realistically presenting spatial, Image Compression, light rays, capable of realistically
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Light-Field (LF) image is emerging 4D data of light rays that is capable of realistically presenting spatial and angular information of 3D scene. However, the large data volume of LF images becomes the most challenging issue in real-time processing, transmission, and storage. In this paper, we propose an end-to-end deep LF Image Compression method Using Disentangled Representation and Asymmetrical Strip Convolution (LFIC-DRASC) to improve coding efficiency. Firstly, we formulate the LF image compression problem as learning a disentangled LF representation network and an image encoding-decoding network. Secondly, we propose two novel feature extractors that leverage the structural prior of LF data by integrating features across different dimensions. Meanwhile, disentangled LF representation network is proposed to enhance the LF feature disentangling and decoupling. Thirdly, we propose the LFIC-DRASC for LF image compression, where two Asymmetrical Strip Convolution (ASC) operators, i.e. horizontal and vertical, are proposed to capture long-range correlation in LF feature space. These two ASC operators can be combined with the square convolution to further decouple LF features, which enhances the model ability in representing intricate spatial relationships. Experimental results demonstrate that the proposed LFIC-DRASC achieves an average of 20.5% bit rate reductions comparing with the state-of-the-art methods.

[CV-92] Few-Shot Learning Approach on Tuberculosis Classification Based on Chest X-Ray Images

链接: https://arxiv.org/abs/2409.11644
作者: A.A.G. Yogi Pramana,Faiz Ihza Permana,Muhammad Fazil Maulana,Dzikri Rahadian Fudholi
关键词-EN: bacterium Mycobacterium tuberculosis, Mycobacterium tuberculosis, bacterium Mycobacterium, primarily affecting, affecting the lungs
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages. Pre-print

点击查看摘要

Abstract:Tuberculosis (TB) is caused by the bacterium Mycobacterium tuberculosis, primarily affecting the lungs. Early detection is crucial for improving treatment effectiveness and reducing transmission risk. Artificial intelligence (AI), particularly through image classification of chest X-rays, can assist in TB detection. However, class imbalance in TB chest X-ray datasets presents a challenge for accurate classification. In this paper, we propose a few-shot learning (FSL) approach using the Prototypical Network algorithm to address this issue. We compare the performance of ResNet-18, ResNet-50, and VGG16 in feature extraction from the TBX11K Chest X-ray dataset. Experimental results demonstrate classification accuracies of 98.93% for ResNet-18, 98.60% for ResNet-50, and 33.33% for VGG16. These findings indicate that the proposed method outperforms others in mitigating data imbalance, which is particularly beneficial for disease classification applications.

[CV-93] Hyperspectral Image Classification Based on Faster Residual Multi-branch Spiking Neural Network

链接: https://arxiv.org/abs/2409.11619
作者: Yang Liu,Yahui Li,Rui Li,Liming Zhou,Lanxue Dang,Huiyu Mu,Qiang Ge
关键词-EN: HSI classification tasks, Convolutional neural network, HSI classification, HSI classification algorithms, high energy consumption
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 15pages,12figures

点击查看摘要

Abstract:Convolutional neural network (CNN) performs well in Hyperspectral Image (HSI) classification tasks, but its high energy consumption and complex network structure make it difficult to directly apply it to edge computing devices. At present, spiking neural networks (SNN) have developed rapidly in HSI classification tasks due to their low energy consumption and event driven characteristics. However, it usually requires a longer time step to achieve optimal accuracy. In response to the above problems, this paper builds a spiking neural network (SNN-SWMR) based on the leaky integrate-and-fire (LIF) neuron model for HSI classification tasks. The network uses the spiking width mixed residual (SWMR) module as the basic unit to perform feature extraction operations. The spiking width mixed residual module is composed of spiking mixed convolution (SMC), which can effectively extract spatial-spectral features. Secondly, this paper designs a simple and efficient arcsine approximate derivative (AAD), which solves the non-differentiable problem of spike firing by fitting the Dirac function. Through AAD, we can directly train supervised spike neural networks. Finally, this paper conducts comparative experiments with multiple advanced HSI classification algorithms based on spiking neural networks on six public hyperspectral data sets. Experimental results show that the AAD function has strong robustness and a good fitting effect. Meanwhile, compared with other algorithms, SNN-SWMR requires a time step reduction of about 84%, training time, and testing time reduction of about 63% and 70% at the same accuracy. This study solves the key problem of SNN based HSI classification algorithms, which has important practical significance for promoting the practical application of HSI classification algorithms in edge devices such as spaceborne and airborne devices.

[CV-94] Multi-Domain Data Aggregation for Axon and Myelin Segmentation in Histology Images

链接: https://arxiv.org/abs/2409.11552
作者: Armand Collin,Arthur Boschet,Mathieu Boudreau,Julien Cohen-Adad
关键词-EN: Quantifying axon, neurodegenerative diseases, provide useful information, information about microstructural, microstructural changes caused
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Quantifying axon and myelin properties (e.g., axon diameter, myelin thickness, g-ratio) in histology images can provide useful information about microstructural changes caused by neurodegenerative diseases. Automatic tissue segmentation is an important tool for these datasets, as a single stained section can contain up to thousands of axons. Advances in deep learning have made this task quick and reliable with minimal overhead, but a deep learning model trained by one research group will hardly ever be usable by other groups due to differences in their histology training data. This is partly due to subject diversity (different body parts, species, genetics, pathologies) and also to the range of modern microscopy imaging techniques resulting in a wide variability of image features (i.e., contrast, resolution). There is a pressing need to make AI accessible to neuroscience researchers to facilitate and accelerate their workflow, but publicly available models are scarce and poorly maintained. Our approach is to aggregate data from multiple imaging modalities (bright field, electron microscopy, Raman spectroscopy) and species (mouse, rat, rabbit, human), to create an open-source, durable tool for axon and myelin segmentation. Our generalist model makes it easier for researchers to process their data and can be fine-tuned for better performance on specific domains. We study the benefits of different aggregation schemes. This multi-domain segmentation model performs better than single-modality dedicated learners (p=0.03077), generalizes better on out-of-distribution data and is easier to use and maintain. Importantly, we package the segmentation tool into a well-maintained open-source software ecosystem (see this https URL).

[CV-95] NCT-CRC-HE: Not All Histopathological Datasets Are Equally Useful

链接: https://arxiv.org/abs/2409.11546
作者: Andrey Ignatov,Grigory Malivenko
关键词-EN: Numerous deep learning-based, deep learning-based solutions, past years, deep learning-based, Numerous deep
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Numerous deep learning-based solutions have been proposed for histopathological image analysis over the past years. While they usually demonstrate exceptionally high accuracy, one key question is whether their precision might be affected by low-level image properties not related to histopathology but caused by microscopy image handling and pre-processing. In this paper, we analyze a popular NCT-CRC-HE-100K colorectal cancer dataset used in numerous prior works and show that both this dataset and the obtained results may be affected by data-specific biases. The most prominent revealed dataset issues are inappropriate color normalization, severe JPEG artifacts inconsistent between different classes, and completely corrupted tissue samples resulting from incorrect image dynamic range handling. We show that even the simplest model using only 3 features per image (red, green and blue color intensities) can demonstrate over 50% accuracy on this 9-class dataset, while using color histogram not explicitly capturing cell morphology features yields over 82% accuracy. Moreover, we show that a basic EfficientNet-B0 ImageNet pretrained model can achieve over 97.7% accuracy on this dataset, outperforming all previously proposed solutions developed for this task, including dedicated foundation histopathological models and large cell morphology-aware neural networks. The NCT-CRC-HE dataset is publicly available and can be freely used to replicate the presented results. The codes and pre-trained models used in this paper are available at this https URL

[CV-96] Unsupervised Hybrid framework for ANomaly Detection (HAND) – applied to Screening Mammogram

链接: https://arxiv.org/abs/2409.11534
作者: Zhemin Zhang,Bhavika Patel,Bhavik Patel,Imon Banerjee
关键词-EN: OOD samples, OOD, detection is crucial, OOD samples exhibit, crucial for enhancing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is crucial for enhancing the generalization of AI models used in mammogram screening. Given the challenge of limited prior knowledge about OOD samples in external datasets, unsupervised generative learning is a preferable solution which trains the model to discern the normal characteristics of in-distribution (ID) data. The hypothesis is that during inference, the model aims to reconstruct ID samples accurately, while OOD samples exhibit poorer reconstruction due to their divergence from normality. Inspired by state-of-the-art (SOTA) hybrid architectures combining CNNs and transformers, we developed a novel backbone - HAND, for detecting OOD from large-scale digital screening mammogram studies. To boost the learning efficiency, we incorporated synthetic OOD samples and a parallel discriminator in the latent space to distinguish between ID and OOD samples. Gradient reversal to the OOD reconstruction loss penalizes the model for learning OOD reconstructions. An anomaly score is computed by weighting the reconstruction and discriminator loss. On internal RSNA mammogram held-out test and external Mayo clinic hand-curated dataset, the proposed HAND model outperformed encoder-based and GAN-based baselines, and interestingly, it also outperformed the hybrid CNN+transformer baselines. Therefore, the proposed HAND pipeline offers an automated efficient computational solution for domain-specific quality checks in external screening mammograms, yielding actionable insights without direct exposure to the private medical imaging data.

[CV-97] Retinal Vessel Segmentation with Deep Graph and Capsule Reasoning

链接: https://arxiv.org/abs/2409.11508
作者: Xinxu Wei,Xi Lin,Haiyun Liu,Shixuan Zhao,Yongjie Li
关键词-EN: Graph Capsule Convolution, Capsule Convolution Network, global contextual awareness, Effective retinal vessel, Graph Attention Fusion
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Effective retinal vessel segmentation requires a sophisticated integration of global contextual awareness and local vessel continuity. To address this challenge, we propose the Graph Capsule Convolution Network (GCC-UNet), which merges capsule convolutions with CNNs to capture both local and global features. The Graph Capsule Convolution operator is specifically designed to enhance the representation of global context, while the Selective Graph Attention Fusion module ensures seamless integration of local and global information. To further improve vessel continuity, we introduce the Bottleneck Graph Attention module, which incorporates Channel-wise and Spatial Graph Attention mechanisms. The Multi-Scale Graph Fusion module adeptly combines features from various scales. Our approach has been rigorously validated through experiments on widely used public datasets, with ablation studies confirming the efficacy of each component. Comparative results highlight GCC-UNet’s superior performance over existing methods, setting a new benchmark in retinal vessel segmentation. Notably, this work represents the first integration of vanilla, graph, and capsule convolutional techniques in the domain of medical image segmentation.

[CV-98] Machine Learning for Analyzing Atomic Force Microscopy (AFM) Images Generated from Polymer Blends

链接: https://arxiv.org/abs/2409.11438
作者: Aanish Paruchuri,Yunfei Wang,Xiaodan Gu,Arthi Jayaraman
关键词-EN: atomic force microscopy, force microscopy images, microscopy images obtained, unsupervised learning techniques, AFM images
类目: Image and Video Processing (eess.IV); Materials Science (cond-mat.mtrl-sci); Soft Condensed Matter (cond-mat.soft); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 39 pages, 13 figures, 4 tables

点击查看摘要

Abstract:In this paper we present a new machine learning workflow with unsupervised learning techniques to identify domains within atomic force microscopy images obtained from polymer films. The goal of the workflow is to identify the spatial location of the two types of polymer domains with little to no manual intervention and calculate the domain size distributions which in turn can help qualify the phase separated state of the material as macrophase or microphase ordered or disordered domains. We briefly review existing approaches used in other fields, computer vision and signal processing that can be applicable for the above tasks that happen frequently in the field of polymer science and engineering. We then test these approaches from computer vision and signal processing on the AFM image dataset to identify the strengths and limitations of each of these approaches for our first task. For our first domain segmentation task, we found that the workflow using discrete Fourier transform or discrete cosine transform with variance statistics as the feature works the best. The popular ResNet50 deep learning approach from computer vision field exhibited relatively poorer performance in the domain segmentation task for our AFM images as compared to the DFT and DCT based workflows. For the second task, for each of 144 input AFM images, we then used an existing porespy python package to calculate the domain size distribution from the output of that image from DFT based workflow. The information and open source codes we share in this paper can serve as a guide for researchers in the polymer and soft materials fields who need ML modeling and workflows for automated analyses of AFM images from polymer samples that may have crystalline or amorphous domains, sharp or rough interfaces between domains, or micro or macrophase separated domains.

机器学习

[LG-0] DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control

链接: https://arxiv.org/abs/2409.12192
作者: Zichen Jeff Cui,Hengkai Pan,Aadhithya Iyer,Siddhant Haldar,Lerrel Pinto
关键词-EN: complex visuomotor policies, training complex visuomotor, visuomotor policies, powerful tool, tool for training
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at this https URL

[LG-1] Massively Multi-Person 3D Human Motion Forecasting with Scene Context

链接: https://arxiv.org/abs/2409.12189
作者: Felix B Mueller,Julian Tanke,Juergen Gall
关键词-EN: Forecasting long-term, human behavior makes, generate realistic human, behavior makes, makes it hard
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at this https URL.

[LG-2] Democratizing MLLMs in Healthcare: TinyLLaVA-Med for Efficient Healthcare Diagnostics in Resource-Constrained Settings

链接: https://arxiv.org/abs/2409.12184
作者: Aya El Mir,Lukelo Thadei Luoga,Boyuan Chen,Muhammad Abdullah Hanif,Muhammad Shafique
关键词-EN: Nvidia Jetson Xavier, Deploying Multi-Modal Large, Multi-Modal Large Language, Large Language Models, Jetson Xavier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deploying Multi-Modal Large Language Models (MLLMs) in healthcare is hindered by their high computational demands and significant memory requirements, which are particularly challenging for resource-constrained devices like the Nvidia Jetson Xavier. This problem is particularly evident in remote medical settings where advanced diagnostics are needed but resources are limited. In this paper, we introduce an optimization method for the general-purpose MLLM, TinyLLaVA, which we have adapted and renamed TinyLLaVA-Med. This adaptation involves instruction-tuning and fine-tuning TinyLLaVA on a medical dataset by drawing inspiration from the LLaVA-Med training pipeline. Our approach successfully minimizes computational complexity and power consumption, with TinyLLaVA-Med operating at 18.9W and using 11.9GB of memory, while achieving accuracies of 64.54% on VQA-RAD and 70.70% on SLAKE for closed-ended questions. Therefore, TinyLLaVA-Med achieves deployment viability in hardware-constrained environments with low computational resources, maintaining essential functionalities and delivering accuracies close to state-of-the-art models.

[LG-3] o CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

链接: https://arxiv.org/abs/2409.12183
作者: Zayne Sprague,Fangcong Yin,Juan Diego Rodriguez,Dongwei Jiang,Manya Wadhwa,Prasann Singhal,Xinyu Zhao,Xi Ye,Kyle Mahowald,Greg Durrett
关键词-EN: large language models, eliciting reasoning capabilities, facto method, method for eliciting, capabilities from large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking’’ really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model’s response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT’s gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

[LG-4] A Controlled Study on Long Context Extension and Generalization in LLMs

链接: https://arxiv.org/abs/2409.12181
作者: Yi Lu,Jing Nathan Yan,Songlin Yang,Justin T. Chiu,Siyu Ren,Fei Yuan,Wenting Zhao,Zhiyong Wu,Alexander M. Rush
关键词-EN: Broad textual understanding, in-context learning require, learning require language, utilize full document, full document contexts
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.

[LG-5] Finetuning Language Models to Emit Linguistic Expressions of Uncertainty

链接: https://arxiv.org/abs/2409.12180
作者: Arslan Chaudhry,Sridhar Thiagarajan,Dilan Gorur
关键词-EN: Large language models, Large language, decision-making tasks, increasingly employed, employed in information-seeking
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly employed in information-seeking and decision-making tasks. Despite their broad utility, LLMs tend to generate information that conflicts with real-world facts, and their persuasive style can make these inaccuracies appear confident and convincing. As a result, end-users struggle to consistently align the confidence expressed by LLMs with the accuracy of their predictions, often leading to either blind trust in all outputs or a complete disregard for their reliability. In this work, we explore supervised finetuning on uncertainty-augmented predictions as a method to develop models that produce linguistic expressions of uncertainty. Specifically, we measure the calibration of pre-trained models and then fine-tune language models to generate calibrated linguistic expressions of uncertainty. Through experiments on various question-answering datasets, we demonstrate that LLMs are well-calibrated in assessing their predictions, and supervised finetuning based on the model’s own confidence leads to well-calibrated expressions of uncertainty, particularly for single-claim answers.

[LG-6] Expanding Expressivity in Transformer Models with M"obiusAttention

链接: https://arxiv.org/abs/2409.12175
作者: Anna-Maria Halacheva,Mojtaba Nayyeri,Steffen Staab
关键词-EN: Natural Language Processing, revolutionized Natural Language, Language Processing, Natural Language, enabling exceptional modeling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Attention mechanisms and Transformer architectures have revolutionized Natural Language Processing (NLP) by enabling exceptional modeling of long-range dependencies and capturing intricate linguistic patterns. However, their inherent reliance on linear operations in the form of matrix multiplications limits their ability to fully capture inter-token relationships on their own. We propose MöbiusAttention, a novel approach that integrates Möbius transformations within the attention mechanism of Transformer-based models. Möbius transformations are non-linear operations in spaces over complex numbers with the ability to map between various geometries. By incorporating these properties, MöbiusAttention empowers models to learn more intricate geometric relationships between tokens and capture a wider range of information through complex-valued weight vectors. We build and pre-train a BERT and a RoFormer version enhanced with MöbiusAttention, which we then finetune on the GLUE benchmark. We evaluate empirically our approach against the baseline BERT and RoFormer models on a range of downstream tasks. Our approach compares favorably against the baseline models, even with smaller number of parameters suggesting the enhanced expressivity of MöbiusAttention. This research paves the way for exploring the potential of Möbius transformations in the complex projective space to enhance the expressivity and performance of foundation models.

[LG-7] he Unreliability of Acoustic Systems in Alzheimers Speech Datasets with Heterogeneous Recording Conditions

链接: https://arxiv.org/abs/2409.12170
作者: Lara Gauder,Pablo Riera,Andrea Slachevsky,Gonzalo Forno,Adolfo M. Garcia,Luciana Ferrer
关键词-EN: Automated speech analysis, detect early markers, Alzheimer disease, markers of Alzheimer, Automated speech
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 1 figure, 1 table

点击查看摘要

Abstract:Automated speech analysis is a thriving approach to detect early markers of Alzheimer’s disease (AD). Yet, recording conditions in most AD datasets are heterogeneous, with patients and controls often evaluated in different acoustic settings. While this is not a problem for analyses based on speech transcription or features obtained from manual alignment, it does cast serious doubts on the validity of acoustic features, which are strongly influenced by acquisition conditions. We examined this issue in the ADreSSo dataset, derived from the widely used Pitt corpus. We show that systems based on two acoustic features, MFCCs and Wav2vec 2.0 embeddings, can discriminate AD patients from controls with above-chance performance when using only the non-speech part of the audio signals. We replicated this finding in a separate dataset of Spanish speakers. Thus, in these datasets, the class can be partly predicted by recording conditions. Our results are a warning against the use of acoustic systems for identifying patients based on non-standardized recordings. We propose that acoustically heterogeneous datasets for dementia studies should be either (a) analyzed using only transcripts or other features derived from manual annotations, or (b) replaced by datasets collected with strictly controlled acoustic conditions.

[LG-8] LogoRA: Local-Global Representation Alignment for Robust Time Series Classification

链接: https://arxiv.org/abs/2409.12169
作者: Huanyu Zhang,Yi-Fan Zhang,Zhang Zhang,Qingsong Wen,Liang Wang
关键词-EN: disregarding domain-specific differences, Unsupervised domain adaptation, identify consistent patterns, time series UDA, time series aims
类目: Machine Learning (cs.LG)
*备注: Accepted by IEEE Transactions on Knowledge and Data Engineering

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) of time series aims to teach models to identify consistent patterns across various temporal scenarios, disregarding domain-specific differences, which can maintain their predictive accuracy and effectively adapt to new domains. However, existing UDA methods struggle to adequately extract and align both global and local features in time series data. To address this issue, we propose the Local-Global Representation Alignment framework (LogoRA), which employs a two-branch encoder, comprising a multi-scale convolutional branch and a patching transformer branch. The encoder enables the extraction of both local and global representations from time series. A fusion module is then introduced to integrate these representations, enhancing domain-invariant feature alignment from multi-scale perspectives. To achieve effective alignment, LogoRA employs strategies like invariant feature learning on the source domain, utilizing triplet loss for fine alignment and dynamic time warping-based feature alignment. Additionally, it reduces source-target domain gaps through adversarial training and per-class prototype alignment. Our evaluations on four time-series datasets demonstrate that LogoRA outperforms strong baselines by up to 12.52% , showcasing its superiority in time series UDA tasks.

[LG-9] Decoding Style: Efficient Fine-Tuning of LLMs for Image-Guided Outfit Recommendation with Preference CIKM2024

链接: https://arxiv.org/abs/2409.12150
作者: Najmeh Forouzandehmehr,Nima Farrokhsiar,Ramin Giahi,Evren Korpeoglu,Kannan Achan
关键词-EN: large language models, fashion compatibility understanding, Multimodal Large Language, large language, complex challenge
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CIKM 2024

点击查看摘要

Abstract:Personalized outfit recommendation remains a complex challenge, demanding both fashion compatibility understanding and trend awareness. This paper presents a novel framework that harnesses the expressive power of large language models (LLMs) for this task, mitigating their “black box” and static nature through fine-tuning and direct feedback integration. We bridge the item visual-textual gap in items descriptions by employing image captioning with a Multimodal Large Language Model (MLLM). This enables the LLM to extract style and color characteristics from human-curated fashion images, forming the basis for personalized recommendations. The LLM is efficiently fine-tuned on the open-source Polyvore dataset of curated fashion images, optimizing its ability to recommend stylish outfits. A direct preference mechanism using negative examples is employed to enhance the LLM’s decision-making process. This creates a self-enhancing AI feedback loop that continuously refines recommendations in line with seasonal fashion trends. Our framework is evaluated on the Polyvore dataset, demonstrating its effectiveness in two key tasks: fill-in-the-blank, and complementary item retrieval. These evaluations underline the framework’s ability to generate stylish, trend-aligned outfit suggestions, continuously improving through direct feedback. The evaluation results demonstrated that our proposed framework significantly outperforms the base LLM, creating more cohesive outfits. The improved performance in these tasks underscores the proposed framework’s potential to enhance the shopping experience with accurate suggestions, proving its effectiveness over the vanilla LLM based outfit generation.

[LG-10] GRIN: GRadient-INformed MoE

链接: https://arxiv.org/abs/2409.12136
作者: Liyuan Liu,Young Jin Kim,Shuohang Wang,Chen Liang,Yelong Shen,Hao Cheng,Xiaodong Liu,Masahiro Tanaka,Xiaoxia Wu,Wenxiang Hu,Vishrav Chaudhary,Zeqi Lin,Chenruidong Zhang,Jilong Xue,Hany Awadalla,Jianfeng Gao,Weizhu Chen
关键词-EN: selectively activating, expert routing, scale more effectively, small subset, sparse computation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 58 pages

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16 \times 3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.

[LG-11] Almost Sure Convergence of Linear Temporal Difference Learning with Arbitrary Features

链接: https://arxiv.org/abs/2409.12135
作者: Jiuqi Wang,Shangtong Zhang
关键词-EN: Temporal difference, powerful prediction algorithm, reinforcement learning, linear function approximation, classic and powerful
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 30 pages, 0 figures

点击查看摘要

Abstract:Temporal difference (TD) learning with linear function approximation, abbreviated as linear TD, is a classic and powerful prediction algorithm in reinforcement learning. While it is well understood that linear TD converges almost surely to a unique point, this convergence traditionally requires the assumption that the features used by the approximator are linearly independent. However, this linear independence assumption does not hold in many practical scenarios. This work is the first to establish the almost sure convergence of linear TD without requiring linearly independent features. In fact, we do not make any assumptions on the features. We prove that the approximated value function converges to a unique point and the weight iterates converge to a set. We also establish a notion of local stability of the weight iterates. Importantly, we do not need to introduce any other additional assumptions and do not need to make any modification to the linear TD algorithm. Key to our analysis is a novel characterization of bounded invariant sets of the mean ODE of linear TD.

[LG-12] Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

链接: https://arxiv.org/abs/2409.12122
作者: An Yang,Beichen Zhang,Binyuan Hui,Bofei Gao,Bowen Yu,Chengpeng Li,Dayiheng Liu,Jianhong Tu,Jingren Zhou,Junyang Lin,Keming Lu,Mingfeng Xue,Runji Lin,Tianyu Liu,Xingzhang Ren,Zhenru Zhang
关键词-EN: math-specific large language, large language models, math-specific large, SFT model, SFT
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it’s possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model’s performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2409.12122 [cs.CL] (or arXiv:2409.12122v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2409.12122 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-13] Stronger Baseline Models – A Key Requirement for Aligning Machine Learning Research with Clinical Utility

链接: https://arxiv.org/abs/2409.12116
作者: Nathan Wolfrath,Joel Wolfrath,Hengrui Hu,Anjishnu Banerjee,Anai N. Kothari
关键词-EN: Machine Learning, diverse application domains, recent years, application domains, increased substantially
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:Machine Learning (ML) research has increased substantially in recent years, due to the success of predictive modeling across diverse application domains. However, well-known barriers exist when attempting to deploy ML models in high-stakes, clinical settings, including lack of model transparency (or the inability to audit the inference process), large training data requirements with siloed data sources, and complicated metrics for measuring model utility. In this work, we show empirically that including stronger baseline models in healthcare ML evaluations has important downstream effects that aid practitioners in addressing these challenges. Through a series of case studies, we find that the common practice of omitting baselines or comparing against a weak baseline model (e.g. a linear model with no optimization) obscures the value of ML methods proposed in the research literature. Using these insights, we propose some best practices that will enable practitioners to more effectively study and deploy ML models in clinical settings.

[LG-14] Pareto Data Framework: Steps Towards Resource-Efficient Decision Making Using Minimum Viable Data (MVD)

链接: https://arxiv.org/abs/2409.12112
作者: Tashfain Ahmed,Josh Siegel
关键词-EN: Minimum Viable Data, Pareto Data Framework, Internet of Things, enabling machine learning, machine learning applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper introduces the Pareto Data Framework, an approach for identifying and selecting the Minimum Viable Data (MVD) required for enabling machine learning applications on constrained platforms such as embedded systems, mobile devices, and Internet of Things (IoT) devices. We demonstrate that strategic data reduction can maintain high performance while significantly reducing bandwidth, energy, computation, and storage costs. The framework identifies Minimum Viable Data (MVD) to optimize efficiency across resource-constrained environments without sacrificing performance. It addresses common inefficient practices in an IoT application such as overprovisioning of sensors and overprecision, and oversampling of signals, proposing scalable solutions for optimal sensor selection, signal extraction and transmission, and data representation. An experimental methodology demonstrates effective acoustic data characterization after downsampling, quantization, and truncation to simulate reduced-fidelity sensors and network and storage constraints; results shows that performance can be maintained up to 95% with sample rates reduced by 75% and bit depths and clip length reduced by 50% which translates into substantial cost and resource reduction. These findings have implications on the design and development of constrained systems. The paper also discusses broader implications of the framework, including the potential to democratize advanced AI technologies across IoT applications and sectors such as agriculture, transportation, and manufacturing to improve access and multiply the benefits of data-driven insights.

[LG-15] FedLF: Adaptive Logit Adjustment and Feature Optimization in Federated Long-Tailed Learning ACML2024

链接: https://arxiv.org/abs/2409.12105
作者: Xiuhua Lu,Peng Li,Xuefeng Jiang
关键词-EN: Federated learning offers, distributed machine learning, Federated learning, offers a paradigm, challenge of preserving
类目: Machine Learning (cs.LG)
*备注: Accepted by ACML 2024

点击查看摘要

Abstract:Federated learning offers a paradigm to the challenge of preserving privacy in distributed machine learning. However, datasets distributed across each client in the real world are inevitably heterogeneous, and if the datasets can be globally aggregated, they tend to be long-tailed distributed, which greatly affects the performance of the model. The traditional approach to federated learning primarily addresses the heterogeneity of data among clients, yet it fails to address the phenomenon of class-wise bias in global long-tailed data. This results in the trained model focusing on the head classes while neglecting the equally important tail classes. Consequently, it is essential to develop a methodology that considers classes holistically. To address the above problems, we propose a new method FedLF, which introduces three modifications in the local training phase: adaptive logit adjustment, continuous class centred optimization, and feature decorrelation. We compare seven state-of-the-art methods with varying degrees of data heterogeneity and long-tailed distribution. Extensive experiments on benchmark datasets CIFAR-10-LT and CIFAR-100-LT demonstrate that our approach effectively mitigates the problem of model performance degradation due to data heterogeneity and long-tailed distribution. our code is available at this https URL.

[LG-16] Symmetry-Enriched Learning: A Category-Theoretic Framework for Robust Machine Learning Models

链接: https://arxiv.org/abs/2409.12100
作者: Ronald Katende
关键词-EN: integrates higher-order symmetries, manuscript presents, framework that integrates, integrates higher-order, category theory
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This manuscript presents a novel framework that integrates higher-order symmetries and category theory into machine learning. We introduce new mathematical constructs, including hyper-symmetry categories and functorial representations, to model complex transformations within learning algorithms. Our contributions include the design of symmetry-enriched learning models, the development of advanced optimization techniques leveraging categorical symmetries, and the theoretical analysis of their implications for model robustness, generalization, and convergence. Through rigorous proofs and practical applications, we demonstrate that incorporating higher-dimensional categorical structures enhances both the theoretical foundations and practical capabilities of modern machine learning algorithms, opening new directions for research and innovation.

[LG-17] Skill matching at scale: freelancer-project alignment for efficient multilingual candidate retrieval

链接: https://arxiv.org/abs/2409.12097
作者: Warren Jouanneau,Marc Palyart,Emma Jouffroy
关键词-EN: Finding the perfect, perform at scale, perfect match, job proposal, easy task
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Finding the perfect match between a job proposal and a set of freelancers is not an easy task to perform at scale, especially in multiple languages. In this paper, we propose a novel neural retriever architecture that tackles this problem in a multilingual setting. Our method encodes project descriptions and freelancer profiles by leveraging pre-trained multilingual language models. The latter are used as backbone for a custom transformer architecture that aims to keep the structure of the profiles and project. This model is trained with a contrastive loss on historical data. Thanks to several experiments, we show that this approach effectively captures skill matching similarity and facilitates efficient matching, outperforming traditional methods.

[LG-18] he Impact of Element Ordering on LM Agent Performance

链接: https://arxiv.org/abs/2409.12089
作者: Wayne Chi,Ameet Talwalkar,Chris Donahue
关键词-EN: navigate virtual environments, surge of interest, ordering, environments, navigate virtual
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There has been a surge of interest in language model agents that can navigate virtual environments such as the web or desktop. To navigate such environments, agents benefit from information on the various elements (e.g., buttons, text, or images) present. It remains unclear which element attributes have the greatest impact on agent performance, especially in environments that only provide a graphical representation (i.e., pixels). Here we find that the ordering in which elements are presented to the language model is surprisingly impactful–randomizing element ordering in a webpage degrades agent performance comparably to removing all visible text from an agent’s state representation. While a webpage provides a hierarchical ordering of elements, there is no such ordering when parsing elements directly from pixels. Moreover, as tasks become more challenging and models more sophisticated, our experiments suggest that the impact of ordering increases. Finding an effective ordering is non-trivial. We investigate the impact of various element ordering methods in web and desktop environments. We find that dimensionality reduction provides a viable ordering for pixel-only environments. We train a UI element detection model to derive elements from pixels and apply our findings to an agent benchmark–OmniACT–where we only have access to pixels. Our method completes more than two times as many tasks on average relative to the previous state-of-the-art.

[LG-19] owards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques

链接: https://arxiv.org/abs/2409.12087
作者: Yubo Li,Saba Al-Sayouri,Rema Padman
关键词-EN: Chronic Kidney Disease, End-Stage Renal Disease, Kidney Disease, Renal Disease, Chronic Kidney
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10pages, 4 figures, AMIA 2024

点击查看摘要

Abstract:This study explores the potential of utilizing administrative claims data, combined with advanced machine learning and deep learning techniques, to predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major health insurance organization to develop prediction models for multiple observation windows using traditional machine learning methods such as Random Forest and XGBoost as well as deep learning approaches such as Long Short-Term Memory (LSTM) networks. Our findings demonstrate that the LSTM model, particularly with a 24-month observation window, exhibits superior performance in predicting ESRD progression, outperforming existing models in the literature. We further apply SHapley Additive exPlanations (SHAP) analysis to enhance interpretability, providing insights into the impact of individual features on predictions at the individual patient level. This study underscores the value of leveraging administrative claims data for CKD management and predicting ESRD progression.

[LG-20] Unsupervised Domain Adaptation Via Data Pruning

链接: https://arxiv.org/abs/2409.12076
作者: Andrea Napoli,Paul White
关键词-EN: machine learning models, removal of carefully-selected, recently emerged, improving the robustness, robustness of machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The removal of carefully-selected examples from training data has recently emerged as an effective way of improving the robustness of machine learning models. However, the best way to select these examples remains an open question. In this paper, we consider the problem from the perspective of unsupervised domain adaptation (UDA). We propose AdaPrune, a method for UDA whereby training examples are removed to attempt to align the training distribution to that of the target data. By adopting the maximum mean discrepancy (MMD) as the criterion for alignment, the problem can be neatly formulated and solved as an integer quadratic program. We evaluate our approach on a real-world domain shift task of bioacoustic event detection. As a method for UDA, we show that AdaPrune outperforms related techniques, and is complementary to other UDA algorithms such as CORAL. Our analysis of the relationship between the MMD and model accuracy, along with t-SNE plots, validate the proposed method as a principled and well-founded way of performing data pruning.

[LG-21] Dual-Layer Training and Decoding of Large Language Model with Simultaneously Thinking and Speaking

链接: https://arxiv.org/abs/2409.12059
作者: Ningyuan Xi,Xiaoyu Wang,Yetao Wu,Teng Chen,Qingqing Gu,Jinxian Qu,Zhonglin Jiang,Yong Chen,Luo Ji
关键词-EN: Large Language Model, generate human expressions, Large Language, human expressions, Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Large Language Model can reasonably understand and generate human expressions but may lack of thorough thinking and reasoning mechanisms. Recently there have been several studies which enhance the thinking ability of language models but most of them are not data-driven or training-based. In this paper, we are motivated by the cognitive mechanism in the natural world, and design a novel model architecture called TaS which allows it to first consider the thoughts and then express the response based upon the query. We design several pipelines to annotate or generate the thought contents from prompt-response samples, then add language heads in a middle layer which behaves as the thinking layer. We train the language model by the thoughts-augmented data and successfully let the thinking layer automatically generate reasonable thoughts and finally output more reasonable responses. Both qualitative examples and quantitative results validate the effectiveness and performance of TaS. Our code is available at https://anonymous.4open.science/r/TadE.

[LG-22] Extended Deep Submodular Functions

链接: https://arxiv.org/abs/2409.12053
作者: Seyed Mohammad Hosseini,Arash Jamshid,Seyed Mahdi Noormousavi,Mahdi Jafari Siavoshani,Naeimeh Omidvar
关键词-EN: Extended Deep Submodular, called Extended Deep, Deep Submodular functions, functions called Extended, Extended Deep
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:We introduce a novel category of set functions called Extended Deep Submodular functions (EDSFs), which are neural network-representable. EDSFs serve as an extension of Deep Submodular Functions (DSFs), inheriting crucial properties from DSFs while addressing innate limitations. It is known that DSFs can represent a limiting subset of submodular functions. In contrast, through an analysis of polymatroid properties, we establish that EDSFs possess the capability to represent all monotone submodular functions, a notable enhancement compared to DSFs. Furthermore, our findings demonstrate that EDSFs can represent any monotone set function, indicating the family of EDSFs is equivalent to the family of all monotone set functions. Additionally, we prove that EDSFs maintain the concavity inherent in DSFs when the components of the input vector are non-negative real numbers-an essential feature in certain combinatorial optimization problems. Through extensive experiments, we illustrate that EDSFs exhibit significantly lower empirical generalization error than DSFs in the learning of coverage functions. This suggests that EDSFs present a promising advancement in the representation and learning of set functions with improved generalization capabilities.

[LG-23] Handling Long-Term Safety and Uncertainty in Safe Reinforcement Learning

链接: https://arxiv.org/abs/2409.12045
作者: Jonas Günster,Puze Liu,Jan Peters,Davide Tateo
关键词-EN: key issues preventing, reinforcement learning techniques, Safe Reinforcement Learning, Reinforcement Learning area, reinforcement learning
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Safety is one of the key issues preventing the deployment of reinforcement learning techniques in real-world robots. While most approaches in the Safe Reinforcement Learning area do not require prior knowledge of constraints and robot kinematics and rely solely on data, it is often difficult to deploy them in complex real-world settings. Instead, model-based approaches that incorporate prior knowledge of the constraints and dynamics into the learning framework have proven capable of deploying the learning algorithm directly on the real robot. Unfortunately, while an approximated model of the robot dynamics is often available, the safety constraints are task-specific and hard to obtain: they may be too complicated to encode analytically, too expensive to compute, or it may be difficult to envision a priori the long-term safety requirements. In this paper, we bridge this gap by extending the safe exploration method, ATACOM, with learnable constraints, with a particular focus on ensuring long-term safety and handling of uncertainty. Our approach is competitive or superior to state-of-the-art methods in final performance while maintaining safer behavior during training.

[LG-24] Understanding the Effects of the Baidu-ULTR Logging Policy on Two-Tower Models RECSYS’24

链接: https://arxiv.org/abs/2409.12043
作者: Morris de Haan,Philipp Hager
关键词-EN: recent work suggests, logging policy confounding, learning to rank, recent work, industry applications
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Despite the popularity of the two-tower model for unbiased learning to rank (ULTR) tasks, recent work suggests that it suffers from a major limitation that could lead to its collapse in industry applications: the problem of logging policy confounding. Several potential solutions have even been proposed; however, the evaluation of these methods was mostly conducted using semi-synthetic simulation experiments. This paper bridges the gap between theory and practice by investigating the confounding problem on the largest real-world dataset, Baidu-ULTR. Our main contributions are threefold: 1) we show that the conditions for the confounding problem are given on Baidu-ULTR, 2) the confounding problem bears no significant effect on the two-tower model, and 3) we point to a potential mismatch between expert annotations, the golden standard in ULTR, and user click behavior.

[LG-25] A Unified Framework for Neural Computation and Learning Over Time

链接: https://arxiv.org/abs/2409.12038
作者: Stefano Melacci,Alessandro Betti,Michele Casoni,Tommaso Guidi,Matteo Tiezzi,Marco Gori
关键词-EN: proposes Hamiltonian Learning, paper proposes Hamiltonian, Learning, possibly infinite stream, future information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes Hamiltonian Learning, a novel unified framework for learning with neural networks “over time”, i.e., from a possibly infinite stream of data, in an online manner, without having access to future information. Existing works focus on the simplified setting in which the stream has a known finite length or is segmented into smaller sequences, leveraging well-established learning strategies from statistical machine learning. In this paper, the problem of learning over time is rethought from scratch, leveraging tools from optimal control theory, which yield a unifying view of the temporal dynamics of neural computations and learning. Hamiltonian Learning is based on differential equations that: (i) can be integrated without the need of external software solvers; (ii) generalize the well-established notion of gradient-based learning in feed-forward and recurrent networks; (iii) open to novel perspectives. The proposed framework is showcased by experimentally proving how it can recover gradient-based learning, comparing it to out-of-the box optimizers, and describing how it is flexible enough to switch from fully-local to partially/non-local computational schemes, possibly distributed over multiple devices, and BackPropagation without storing activations. Hamiltonian Learning is easy to implement and can help researches approach in a principled and innovative manner the problem of learning over time.

[LG-26] opological Deep Learning with State-Space Models: A Mamba Approach for Simplicial Complexes

链接: https://arxiv.org/abs/2409.12033
作者: Marco Montagna,Simone Scardapane,Lev Telyatnikov
关键词-EN: Graph Neural Networks, Neural Networks based, Neural Networks, handling graph-structured data, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks based on the message-passing (MP) mechanism are a dominant approach for handling graph-structured data. However, they are inherently limited to modeling only pairwise interactions, making it difficult to explicitly capture the complexity of systems with n -body relations. To address this, topological deep learning has emerged as a promising field for studying and modeling higher-order interactions using various topological domains, such as simplicial and cellular complexes. While these new domains provide powerful representations, they introduce new challenges, such as effectively modeling the interactions among higher-order structures through higher-order MP. Meanwhile, structured state-space sequence models have proven to be effective for sequence modeling and have recently been adapted for graph data by encoding the neighborhood of a node as a sequence, thereby avoiding the MP mechanism. In this work, we propose a novel architecture designed to operate with simplicial complexes, utilizing the Mamba state-space model as its backbone. Our approach generates sequences for the nodes based on the neighboring cells, enabling direct communication between all higher-order structures, regardless of their rank. We extensively validate our model, demonstrating that it achieves competitive performance compared to state-of-the-art models developed for simplicial complexes.

[LG-27] On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery

链接: https://arxiv.org/abs/2409.12026
作者: BW Sheffield,Jeffrey Ellen,Ben Whitmore
关键词-EN: presents unique challenges, Side-scan sonar, imagery presents unique, Convolutional Neural Networks, presents unique
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor due to the complex and varied underwater environments. Historically, experts have manually interpreted SSS images, relying on conventional machine learning techniques with hand-crafted features. While Convolutional Neural Networks (CNNs) significantly advanced automated classification in this domain, they often fall short when dealing with diverse seafloor textures, such as rocky or ripple sand bottoms, where false positive rates may increase. Recently, Vision Transformers (ViTs) have shown potential in addressing these limitations by utilizing a self-attention mechanism to capture global information in image patches, offering more flexibility in processing spatial hierarchies. This paper rigorously compares the performance of ViT models alongside commonly used CNN architectures, such as ResNet and ConvNext, for binary classification tasks in SSS imagery. The dataset encompasses diverse geographical seafloor types and is balanced between the presence and absence of man-made objects. ViT-based models exhibit superior classification performance across f1-score, precision, recall, and accuracy metrics, although at the cost of greater computational resources. CNNs, with their inductive biases, demonstrate better computational efficiency, making them suitable for deployment in resource-constrained environments like underwater vehicles. Future research directions include exploring self-supervised learning for ViTs and multi-modal fusion to further enhance performance in challenging underwater environments.

[LG-28] Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization

链接: https://arxiv.org/abs/2409.12020
作者: Zhi Chen,Lingxiao Jiang
关键词-EN: rapidly evolving field, organizations presents significant, presents significant challenges, significant challenges due, training
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper accepted to the ASE 2024 Conference Research Track

点击查看摘要

Abstract:In the rapidly evolving field of machine learning, training models with datasets from various locations and organizations presents significant challenges due to privacy and legal concerns. The exploration of effective collaborative training settings capable of leveraging valuable knowledge from distributed and isolated datasets is increasingly crucial. This study investigates key factors that impact the effectiveness of collaborative training methods in code next-token prediction, as well as the correctness and utility of the generated code, demonstrating the promise of such methods. Additionally, we evaluate the memorization of different participant training data across various collaborative training settings, including centralized, federated, and incremental training, highlighting their potential risks in leaking data. Our findings indicate that the size and diversity of code datasets are pivotal factors influencing the success of collaboratively trained code models. We show that federated learning achieves competitive performance compared to centralized training while offering better data protection, as evidenced by lower memorization ratios in the generated code. However, federated learning can still produce verbatim code snippets from hidden training data, potentially violating privacy or copyright. Our study further explores effectiveness and memorization patterns in incremental learning, emphasizing the sequence in which individual participant datasets are introduced. We also identify cross-organizational clones as a prevalent challenge in both centralized and federated learning scenarios. Our findings highlight the persistent risk of data leakage during inference, even when training data remains unseen. We conclude with recommendations for practitioners and researchers to optimize multisource datasets, propelling cross-organizational collaboration forward.

[LG-29] Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2409.12001
作者: Claude Formanek,Louise Beyers,Callum Rhys Tilbury,Jonathan P. Shock,Arnu Pretorius
关键词-EN: multi-agent reinforcement learning, find optimal control, optimal control policies, Offline multi-agent reinforcement, multi-agent systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems. Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results. We first substantiate this claim by surveying the literature, showing how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets. We then show why neglecting the nature of the data is problematic, through salient examples of how tightly algorithmic performance is coupled to the dataset used, necessitating a common foundation for experiments in the field. In response, we take a big step towards improving data usage and data awareness in offline MARL, with three key contributions: (1) a clear guideline for generating novel datasets; (2) a standardisation of over 80 existing datasets, hosted in a publicly available repository, using a consistent storage format and easy-to-use API; and (3) a suite of analysis tools that allow us to understand these datasets better, aiding further development.

[LG-30] “It Might be Technically Impressive But Its Practically Useless to Us”: Practices Challenges and Opportunities for Cross-Functional Collaboration around AI within the News Industry

链接: https://arxiv.org/abs/2409.12000
作者: Qing Xiao,Xianzhe Fan,Felix M. Simon,Bingbing Zhang,Motahhare Eslami
关键词-EN: integrated artificial intelligence, artificial intelligence, cross-functional collaboration, increasing number, integrated artificial
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 16 pages

点击查看摘要

Abstract:Recently, an increasing number of news organizations have integrated artificial intelligence (AI) into their workflows, leading to a further influx of AI technologists and data workers into the news industry. This has initiated cross-functional collaborations between these professionals and journalists. While prior research has explored the impact of AI-related roles entering the news industry, there is a lack of studies on how cross-functional collaboration unfolds between AI professionals and journalists. Through interviews with 17 journalists, 6 AI technologists, and 3 AI workers with cross-functional experience from leading news organizations, we investigate the current practices, challenges, and opportunities for cross-functional collaboration around AI in today’s news industry. We first study how journalists and AI professionals perceive existing cross-collaboration strategies. We further explore the challenges of cross-functional collaboration and provide recommendations for enhancing future cross-functional collaboration around AI in the news industry.

[LG-31] Unraveling the Hessian: A Key to Smooth Convergence in Loss Function Landscapes

链接: https://arxiv.org/abs/2409.11995
作者: Nikita Kiselev,Andrey Grabovoy
关键词-EN: improving their performance, neural loss landscapes, critical aspect, understanding its properties, properties is essential
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The loss landscape of neural networks is a critical aspect of their training, and understanding its properties is essential for improving their performance. In this paper, we investigate how the loss surface changes when the sample size increases, a previously unexplored issue. We theoretically analyze the convergence of the loss landscape in a fully connected neural network and derive upper bounds for the difference in loss function values when adding a new object to the sample. Our empirical study confirms these results on various datasets, demonstrating the convergence of the loss function surface for image classification tasks. Our findings provide insights into the local geometry of neural loss landscapes and have implications for the development of sample size determination techniques.

[LG-32] An Efficient Model-Agnostic Approach for Uncertainty Estimation in Data-Restricted Pedometric Applications ICML

链接: https://arxiv.org/abs/2409.11985
作者: Viacheslav Barkov,Jonas Schmidinger,Robin Gebbers,Martin Atzmueller
关键词-EN: digital soil mapping, enhance uncertainty estimation, model-agnostic approach designed, uncertainty estimation, paper introduces
类目: Machine Learning (cs.LG)
*备注: To be published in the proceedings of ICMLA 2024: 23rd International Conference on Machine Learning and Applications

点击查看摘要

Abstract:This paper introduces a model-agnostic approach designed to enhance uncertainty estimation in the predictive modeling of soil properties, a crucial factor for advancing pedometrics and the practice of digital soil mapping. For addressing the typical challenge of data scarcity in soil studies, we present an improved technique for uncertainty estimation. This method is based on the transformation of regression tasks into classification problems, which not only allows for the production of reliable uncertainty estimates but also enables the application of established machine learning algorithms with competitive performance that have not yet been utilized in pedometrics. Empirical results from datasets collected from two German agricultural fields showcase the practical application of the proposed methodology. Our results and findings suggest that the proposed approach has the potential to provide better uncertainty estimation than the models commonly used in pedometrics.

[LG-33] Metric-Semantic Factor Graph Generation based on Graph Neural Networks ICRA2025

链接: https://arxiv.org/abs/2409.11972
作者: Jose Andres Millan-Romera,Hriday Bavle,Muhammad Shaheer,Holger Voos,Jose Luis Sanchez-Lopez
关键词-EN: building accurate models, crucial for building, building accurate, accurate models, factor graph
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:Understanding the relationships between geometric structures and semantic concepts is crucial for building accurate models of complex environments. In indoors, certain spatial constraints, such as the relative positioning of planes, remain consistent despite variations in layout. This paper explores how these invariant relationships can be captured in a graph SLAM framework by representing high-level concepts like rooms and walls, linking them to geometric elements like planes through an optimizable factor graph. Several efforts have tackled this issue with add-hoc solutions for each concept generation and with manually-defined factors. This paper proposes a novel method for metric-semantic factor graph generation which includes defining a semantic scene graph, integrating geometric information, and learning the interconnecting factors, all based on Graph Neural Networks (GNNs). An edge classification network (G-GNN) sorts the edges between planes into same room, same wall or none types. The resulting relations are clustered, generating a room or wall for each cluster. A second family of networks (F-GNN) infers the geometrical origin of the new nodes. The definition of the factors employs the same F-GNN used for the metric attribute of the generated nodes. Furthermore, share the new factor graph with the S-Graphs+ algorithm, extending its graph expressiveness and scene representation with the ultimate goal of improving the SLAM performance. The complexity of the environments is increased to N-plane rooms by training the networks on L-shaped rooms. The framework is evaluated in synthetic and simulated scenarios as no real datasets of the required complex layouts are available. Comments: Submitted to ICRA 2025 Subjects: Robotics (cs.RO); Machine Learning (cs.LG) Cite as: arXiv:2409.11972 [cs.RO] (or arXiv:2409.11972v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2409.11972 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] Efficacy of Synthetic Data as a Benchmark

链接: https://arxiv.org/abs/2409.11968
作者: Gaurav Maheshwari,Dmitry Ivanov,Kevin El Haddad
关键词-EN: Large language models, few-shot learning settings, Large language, learning settings, including the generation
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have enabled a range of applications in zero-shot and few-shot learning settings, including the generation of synthetic datasets for training and testing. However, to reliably use these synthetic datasets, it is essential to understand how representative they are of real-world data. We investigate this by assessing the effectiveness of generating synthetic data through LLM and using it as a benchmark for various NLP tasks. Our experiments across six datasets, and three different tasks, show that while synthetic data can effectively capture performance of various methods for simpler tasks, such as intent classification, it falls short for more complex tasks like named entity recognition. Additionally, we propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks. We find that smaller LLMs exhibit biases towards their own generated data, whereas larger models do not. Overall, our findings suggest that the effectiveness of synthetic data as a benchmark varies depending on the task, and that practitioners should rely on data generated from multiple larger models whenever possible.

[LG-35] Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction

链接: https://arxiv.org/abs/2409.11964
作者: Jin Jie Sean Yeo,Ee-Leng Tan,Jisheng Bai,Santi Peksi,Woon-Seng Gan
关键词-EN: Acoustic Scene Classification, SNTL-NTU team submission, Data-Efficient Low-Complexity Acoustic, submission for Task, Low-Complexity Acoustic Scene
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:In this technical report, we describe the SNTL-NTU team’s submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection and classification of acoustic scenes and events (DCASE) 2024 challenge. Three systems are introduced to tackle training splits of different sizes. For small training splits, we explored reducing the complexity of the provided baseline model by reducing the number of base channels. We introduce data augmentation in the form of mixup to increase the diversity of training samples. For the larger training splits, we use FocusNet to provide confusing class information to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models and baseline models trained on the original sampling rate of 44.1 kHz. We use Knowledge Distillation to distill the ensemble model to the baseline student model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile development dataset yielded the highest average testing accuracy of (62.21, 59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over the three systems.

[LG-36] Reinforcement Learning with Lie Group Orientations for Robotics ICRA2025

链接: https://arxiv.org/abs/2409.11935
作者: Martin Schuck,Jan Brüdigam,Sandra Hirche,Angela Schoellig
关键词-EN: Handling orientations, robots and objects, crucial aspect, Handling, orientations
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to ICRA 2025

点击查看摘要

Abstract:Handling orientations of robots and objects is a crucial aspect of many applications. Yet, ever so often, there is a lack of mathematical correctness when dealing with orientations, especially in learning pipelines involving, for example, artificial neural networks. In this paper, we investigate reinforcement learning with orientations and propose a simple modification of the network’s input and output that adheres to the Lie group structure of orientations. As a result, we obtain an easy and efficient implementation that is directly usable with existing learning libraries and achieves significantly better performance than other common orientation representations. We briefly introduce Lie theory specifically for orientations in robotics to motivate and outline our approach. Subsequently, a thorough empirical evaluation of different combinations of orientation representations for states and actions demonstrates the superior performance of our proposed approach in different scenarios, including: direct orientation control, end effector orientation control, and pick-and-place tasks.

[LG-37] Reinforcement Learning as an Improvement Heuristic for Real-World Production Scheduling ICML

链接: https://arxiv.org/abs/2409.11933
作者: Arthur Müller,Lukas Vollenkemper
关键词-EN: Reinforcement Learning, solving optimization problems, integration of Reinforcement, search process, emerging trend
类目: Machine Learning (cs.LG)
*备注: This paper was accepted at the ICMLA 2024

点击查看摘要

Abstract:The integration of Reinforcement Learning (RL) with heuristic methods is an emerging trend for solving optimization problems, which leverages RL’s ability to learn from the data generated during the search process. One promising approach is to train an RL agent as an improvement heuristic, starting with a suboptimal solution that is iteratively improved by applying small changes. We apply this approach to a real-world multiobjective production scheduling problem. Our approach utilizes a network architecture that includes Transformer encoding to learn the relationships between jobs. Afterwards, a probability matrix is generated from which pairs of jobs are sampled and then swapped to improve the solution. We benchmarked our approach against other heuristics using real data from our industry partner, demonstrating its superior performance.

[LG-38] An Explainable Machine Learning Approach to Traffic Accident Fatality Prediction

链接: https://arxiv.org/abs/2409.11929
作者: Md. Asif Khan Rifat,Ahmedul Kabir,Armana Sabiha Huq
关键词-EN: health threat worldwide, significant public health, public health threat, pose a significant, threat worldwide
类目: Machine Learning (cs.LG)
*备注: 10 Pages, 6 figures, 2 tables, 28th International Conference on Knowledge-Based and Intelligent Information Engineering Systems (KES 2024)

点击查看摘要

Abstract:Road traffic accidents (RTA) pose a significant public health threat worldwide, leading to considerable loss of life and economic burdens. This is particularly acute in developing countries like Bangladesh. Building reliable models to forecast crash outcomes is crucial for implementing effective preventive measures. To aid in developing targeted safety interventions, this study presents a machine learning-based approach for classifying fatal and non-fatal road accident outcomes using data from the Dhaka metropolitan traffic crash database from 2017 to 2022. Our framework utilizes a range of machine learning classification algorithms, comprising Logistic Regression, Support Vector Machines, Naive Bayes, Random Forest, Decision Tree, Gradient Boosting, LightGBM, and Artificial Neural Network. We prioritize model interpretability by employing the SHAP (SHapley Additive exPlanations) method, which elucidates the key factors influencing accident fatality. Our results demonstrate that LightGBM outperforms other models, achieving a ROC-AUC score of 0.72. The global, local, and feature dependency analyses are conducted to acquire deeper insights into the behavior of the model. SHAP analysis reveals that casualty class, time of accident, location, vehicle type, and road type play pivotal roles in determining fatality risk. These findings offer valuable insights for policymakers and road safety practitioners in developing countries, enabling the implementation of evidence-based strategies to reduce traffic crash fatalities.

[LG-39] Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models

链接: https://arxiv.org/abs/2409.11920
作者: Lorenzo Mandelli,Stefano Berretti
关键词-EN: address the challenge, challenge of generating, human motion, human motion contained, generating realistic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.

[LG-40] Less Memory Means smaller GPUs: Backpropagation with Compressed Activations ECML KDD2024

链接: https://arxiv.org/abs/2409.11902
作者: Daniel Barley,Holger Fröning
关键词-EN: deep neural networks, computational resource requirements, equally rapid growth, Large Language Models, prominently Large Language
类目: Machine Learning (cs.LG)
*备注: Presented at ITEM workshop co-located with ECML PKDD 2024, Vilnius LT

点击查看摘要

Abstract:The ever-growing scale of deep neural networks (DNNs) has lead to an equally rapid growth in computational resource requirements. Many recent architectures, most prominently Large Language Models, have to be trained using supercomputers with thousands of accelerators, such as GPUs or TPUs. Next to the vast number of floating point operations the memory footprint of DNNs is also exploding. In contrast, GPU architectures are notoriously short on memory. Even comparatively small architectures like some EfficientNet variants cannot be trained on a single consumer-grade GPU at reasonable mini-batch sizes. During training, intermediate input activations have to be stored until backpropagation for gradient calculation. These make up the vast majority of the memory footprint. In this work we therefore consider compressing activation maps for the backward pass using pooling, which can reduce both the memory footprint and amount of data movement. The forward computation remains uncompressed. We empirically show convergence and study effects on feature detection at the example of the common vision architecture ResNet. With this approach we are able to reduce the peak memory consumption by 29% at the cost of a longer training schedule, while maintaining prediction accuracy compared to an uncompressed baseline.

[LG-41] Multi-Grid Graph Neural Networks with Self-Attention for Computational Mechanics

链接: https://arxiv.org/abs/2409.11899
作者: Paul Garnier,Jonathan Viquerat,Elie Hachem
关键词-EN: Computational Fluid Dynamics, driving research efforts, Computational Fluid, Convolutional Neural Networks, Graph Neural Networks
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Advancement in finite element methods have become essential in various disciplines, and in particular for Computational Fluid Dynamics (CFD), driving research efforts for improved precision and efficiency. While Convolutional Neural Networks (CNNs) have found success in CFD by mapping meshes into images, recent attention has turned to leveraging Graph Neural Networks (GNNs) for direct mesh processing. This paper introduces a novel model merging Self-Attention with Message Passing in GNNs, achieving a 15% reduction in RMSE on the well known flow past a cylinder benchmark. Furthermore, a dynamic mesh pruning technique based on Self-Attention is proposed, that leads to a robust GNN-based multigrid approach, also reducing RMSE by 15%. Additionally, a new self-supervised training method based on BERT is presented, resulting in a 25% RMSE reduction. The paper includes an ablation study and outperforms state-of-the-art models on several challenging datasets, promising advancements similar to those recently achieved in natural language and image processing. Finally, the paper introduces a dataset with meshes larger than existing ones by at least an order of magnitude. Code and Datasets will be released at this https URL.

[LG-42] Secure Control Systems for Autonomous Quadrotors against Cyber-Attacks

链接: https://arxiv.org/abs/2409.11897
作者: Samuel Belkadi
关键词-EN: extensively studied, safety for robotic, robotic systems, problem of safety, quadrotor
类目: Robotics (cs.RO); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: The paper is based on an undergraduate thesis and is not intended for publication in a journal

点击查看摘要

Abstract:The problem of safety for robotic systems has been extensively studied. However, little attention has been given to security issues for three-dimensional systems, such as quadrotors. Malicious adversaries can compromise robot sensors and communication networks, causing incidents, achieving illegal objectives, or even injuring people. This study first designs an intelligent control system for autonomous quadrotors. Then, it investigates the problems of optimal false data injection attack scheduling and countermeasure design for unmanned aerial vehicles. Using a state-of-the-art deep learning-based approach, an optimal false data injection attack scheme is proposed to deteriorate a quadrotor’s tracking performance with limited attack energy. Subsequently, an optimal tracking control strategy is learned to mitigate attacks and recover the quadrotor’s tracking performance. We base our work on Agilicious, a state-of-the-art quadrotor recently deployed for autonomous settings. This paper is the first in the United Kingdom to deploy this quadrotor and implement reinforcement learning on its platform. Therefore, to promote easy reproducibility with minimal engineering overhead, we further provide (1) a comprehensive breakdown of this quadrotor, including software stacks and hardware alternatives; (2) a detailed reinforcement-learning framework to train autonomous controllers on Agilicious agents; and (3) a new open-source environment that builds upon PyFlyt for future reinforcement learning research on Agilicious platforms. Both simulated and real-world experiments are conducted to show the effectiveness of the proposed frameworks in section 5.2.

[LG-43] Recent Advances in OOD Detection: Problems and Approaches

链接: https://arxiv.org/abs/2409.11884
作者: Shuo Lu,YingSheng Wang,LuJun Sheng,AiHua Zheng,LinXiao He,Jian Liang
关键词-EN: machine learning systems, detect test samples, building reliable machine, reliable machine learning, OOD detection
类目: Machine Learning (cs.LG)
*备注: September 18, 2024

点击查看摘要

Abstract:Out-of-distribution (OOD) detection aims to detect test samples outside the training category space, which is an essential component in building reliable machine learning systems. Existing reviews on OOD detection primarily focus on method taxonomy, surveying the field by categorizing various approaches. However, many recent works concentrate on non-traditional OOD detection scenarios, such as test-time adaptation, multi-modal data sources and other novel contexts. In this survey, we uniquely review recent advances in OOD detection from the problem scenario perspective for the first time. According to whether the training process is completely controlled, we divide OOD detection methods into training-driven and training-agnostic. Besides, considering the rapid development of pre-trained models, large pre-trained model-based OOD detection is also regarded as an important category and discussed separately. Furthermore, we provide a discussion of the evaluation scenarios, a variety of applications, and several future research directions. We believe this survey with new taxonomy will benefit the proposal of new methods and the expansion of more practical scenarios. A curated list of related papers is provided in the Github repository: \urlthis https URL

[LG-44] Location based Probabilistic Load Forecasting of EV Charging Sites: Deep Transfer Learning with Multi-Quantile Temporal Convolutional Network

链接: https://arxiv.org/abs/2409.11862
作者: Mohammad Wazed Ali(Intelligent Embedded Systems (IES), University of Kassel, Kassel, Germany),Asif bin Mustafa(School of CIT, Technical University of Munich, Munich, Germany),Md. Aukerul Moin Shuvo(Dept. of Computer Science and Engineering, Rajshahi University of Engg. amp; Technology, Rajshahi, Bangladesh),Bernhard Sick(Intelligent Embedded Systems (IES), University of Kassel, Kassel, Germany)
关键词-EN: lessening environmental pollution, reducing fossil fuel, fossil fuel usage, Electrification of vehicles, environmental pollution
类目: Machine Learning (cs.LG)
*备注: 11 pages, 10 figures

点击查看摘要

Abstract:Electrification of vehicles is a potential way of reducing fossil fuel usage and thus lessening environmental pollution. Electric Vehicles (EVs) of various types for different transport modes (including air, water, and land) are evolving. Moreover, different EV user groups (commuters, commercial or domestic users, drivers) may use different charging infrastructures (public, private, home, and workplace) at various times. Therefore, usage patterns and energy demand are very stochastic. Characterizing and forecasting the charging demand of these diverse EV usage profiles is essential in preventing power outages. Previously developed data-driven load models are limited to specific use cases and locations. None of these models are simultaneously adaptive enough to transfer knowledge of day-ahead forecasting among EV charging sites of diverse locations, trained with limited data, and cost-effective. This article presents a location-based load forecasting of EV charging sites using a deep Multi-Quantile Temporal Convolutional Network (MQ-TCN) to overcome the limitations of earlier models. We conducted our experiments on data from four charging sites, namely Caltech, JPL, Office-1, and NREL, which have diverse EV user types like students, full-time and part-time employees, random visitors, etc. With a Prediction Interval Coverage Probability (PICP) score of 93.62%, our proposed deep MQ-TCN model exhibited a remarkable 28.93% improvement over the XGBoost model for a day-ahead load forecasting at the JPL charging site. By transferring knowledge with the inductive Transfer Learning (TL) approach, the MQ-TCN model achieved a 96.88% PICP score for the load forecasting task at the NREL site using only two weeks of data.

[LG-45] ght and Efficient Upper Bound on Spectral Norm of Convolutional Layers ECCV2024

链接: https://arxiv.org/abs/2409.11859
作者: Ekaterina Grishina,Mikhail Gorbunov,Maxim Rakhuba
关键词-EN: Controlling the spectral, spectral norm, Jacobian matrix, robustness in CNNs, stability and robustness
类目: Machine Learning (cs.LG)
*备注: ECCV 2024

点击查看摘要

Abstract:Controlling the spectral norm of the Jacobian matrix, which is related to the convolution operation, has been shown to improve generalization, training stability and robustness in CNNs. Existing methods for computing the norm either tend to overestimate it or their performance may deteriorate quickly with increasing the input and kernel sizes. In this paper, we demonstrate that the tensor version of the spectral norm of a four-dimensional convolution kernel, up to a constant factor, serves as an upper bound for the spectral norm of the Jacobian matrix associated with the convolution operation. This new upper bound is independent of the input image resolution, differentiable and can be efficiently calculated during training. Through experiments, we demonstrate how this new bound can be used to improve the performance of convolutional architectures.

[LG-46] Edge-Based Graph Component Pooling KDD2024 ECML ALT

链接: https://arxiv.org/abs/2409.11856
作者: T. Snelleman,B.M. Renting,H.H. Hoos,J.N. van Rijn
关键词-EN: Graph-structured data naturally, data naturally occurs, Graph-structured data, research fields, chemistry and sociology
类目: Machine Learning (cs.LG)
*备注: 15 pages, presented at 21st International Workshop on Mining and Learning with Graphs, AstraZenica Bio Healthcare award Paper, ECML PKDD 2024 Vilnius

点击查看摘要

Abstract:Graph-structured data naturally occurs in many research fields, such as chemistry and sociology. The relational information contained therein can be leveraged to statistically model graph properties through geometrical deep learning. Graph neural networks employ techniques, such as message-passing layers, to propagate local features through a graph. However, message-passing layers can be computationally expensive when dealing with large and sparse graphs. Graph pooling operators offer the possibility of removing or merging nodes in such graphs, thus lowering computational costs. However, pooling operators that remove nodes cause data loss, and pooling operators that merge nodes are often computationally expensive. We propose a pooling operator that merges nodes so as not to cause data loss but is also conceptually simple and computationally inexpensive. We empirically demonstrate that the proposed pooling operator performs statistically significantly better than edge pool on four popular benchmark datasets while reducing time complexity and the number of trainable parameters by 70.6% on average. Compared to another maximally powerful method named Graph Isomporhic Network, we show that we outperform them on two popular benchmark datasets while reducing the number of learnable parameters on average by 60.9%.

[LG-47] An efficient wavelet-based physics-informed neural networks for singularly perturbed problems

链接: https://arxiv.org/abs/2409.11847
作者: Himanshu Pandey,Anshima Singh,Ratikanta Behera
关键词-EN: limited data availability, involve limited data, differential equations, Physics-informed neural networks, deep learning models
类目: Machine Learning (cs.LG)
*备注: 17 pages, 12 figures

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) are a class of deep learning models that utilize physics as differential equations to address complex problems, including ones that may involve limited data availability. However, tackling solutions of differential equations with oscillations or singular perturbations and shock-like structures becomes challenging for PINNs. Considering these challenges, we designed an efficient wavelet-based PINNs (W-PINNs) model to solve singularly perturbed differential equations. Here, we represent the solution in wavelet space using a family of smooth-compactly supported wavelets. This framework represents the solution of a differential equation with significantly fewer degrees of freedom while still retaining in capturing, identifying, and analyzing the local structure of complex physical phenomena. The architecture allows the training process to search for a solution within wavelet space, making the process faster and more accurate. The proposed model does not rely on automatic differentiations for derivatives involved in differential equations and does not require any prior information regarding the behavior of the solution, such as the location of abrupt features. Thus, through a strategic fusion of wavelets with PINNs, W-PINNs excel at capturing localized nonlinear information, making them well-suited for problems showing abrupt behavior in certain regions, such as singularly perturbed problems. The efficiency and accuracy of the proposed neural network model are demonstrated in various test problems, i.e., highly singularly perturbed nonlinear differential equations, the FitzHugh-Nagumo (FHN), and Predator-prey interaction models. The proposed design model exhibits impressive comparisons with traditional PINNs and the recently developed wavelet-based PINNs, which use wavelets as an activation function for solving nonlinear differential equations.

[LG-48] Graph Neural Network-State Predictive Information Bottleneck (GNN-SPIB) approach for learning molecular thermodynamics and kinetics

链接: https://arxiv.org/abs/2409.11843
作者: Ziyue Zou,Dedi Wang,Pratyush Tiwary
关键词-EN: Molecular dynamics simulations, face timescale limitations, dynamics simulations offer, simulations offer detailed, offer detailed insights
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft); Statistical Mechanics (cond-mat.stat-mech)
*备注:

点击查看摘要

Abstract:Molecular dynamics simulations offer detailed insights into atomic motions but face timescale limitations. Enhanced sampling methods have addressed these challenges but even with machine learning, they often rely on pre-selected expert-based features. In this work, we present the Graph Neural Network-State Predictive Information Bottleneck (GNN-SPIB) framework, which combines graph neural networks and the State Predictive Information Bottleneck to automatically learn low-dimensional representations directly from atomic coordinates. Tested on three benchmark systems, our approach predicts essential structural, thermodynamic and kinetic information for slow processes, demonstrating robustness across diverse systems. The method shows promise for complex systems, enabling effective enhanced sampling without requiring pre-defined reaction coordinates or input features.

[LG-49] RaggeDi: Diffusion-based State Estimation of Disordered Rags Sheets Towels and Blankets

链接: https://arxiv.org/abs/2409.11831
作者: Jikai Ye,Wanze Li,Shiraz Khan,Gregory S. Chirikjian
关键词-EN: Cloth state estimation, Cloth state, estimating cloth state, Cloth, cloth state accurately
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cloth state estimation is an important problem in robotics. It is essential for the robot to know the accurate state to manipulate cloth and execute tasks such as robotic dressing, stitching, and covering/uncovering human beings. However, estimating cloth state accurately remains challenging due to its high flexibility and self-occlusion. This paper proposes a diffusion model-based pipeline that formulates the cloth state estimation as an image generation problem by representing the cloth state as an RGB image that describes the point-wise translation (translation map) between a pre-defined flattened mesh and the deformed mesh in a canonical space. Then we train a conditional diffusion-based image generation model to predict the translation map based on an observation. Experiments are conducted in both simulation and the real world to validate the performance of our method. Results indicate that our method outperforms two recent methods in both accuracy and speed.

[LG-50] Optimizing Job Shop Scheduling in the Furniture Industry: A Reinforcement Learning Approach Considering Machine Setup Batch Variability and Intralogistics

链接: https://arxiv.org/abs/2409.11820
作者: Malte Schneevogt,Karsten Binninger,Noah Klarmann
关键词-EN: Deep Reinforcement Learning, Deep Reinforcement, Reinforcement Learning, application of Deep, Shop Scheduling Problem
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 18 pages, 8 pages

点击查看摘要

Abstract:This paper explores the potential application of Deep Reinforcement Learning in the furniture industry. To offer a broad product portfolio, most furniture manufacturers are organized as a job shop, which ultimately results in the Job Shop Scheduling Problem (JSSP). The JSSP is addressed with a focus on extending traditional models to better represent the complexities of real-world production environments. Existing approaches frequently fail to consider critical factors such as machine setup times or varying batch sizes. A concept for a model is proposed that provides a higher level of information detail to enhance scheduling accuracy and efficiency. The concept introduces the integration of DRL for production planning, particularly suited to batch production industries such as the furniture industry. The model extends traditional approaches to JSSPs by including job volumes, buffer management, transportation times, and machine setup times. This enables more precise forecasting and analysis of production flows and processes, accommodating the variability and complexity inherent in real-world manufacturing processes. The RL agent learns to optimize scheduling decisions. It operates within a discrete action space, making decisions based on detailed observations. A reward function guides the agent’s decision-making process, thereby promoting efficient scheduling and meeting production deadlines. Two integration strategies for implementing the RL agent are discussed: episodic planning, which is suitable for low-automation environments, and continuous planning, which is ideal for highly automated plants. While episodic planning can be employed as a standalone solution, the continuous planning approach necessitates the integration of the agent with ERP and Manufacturing Execution Systems. This integration enables real-time adjustments to production schedules based on dynamic changes.

[LG-51] Constraint Guided AutoEncoders for Joint Optimization of Condition Indicator Estimation and Anomaly Detection in Machine Condition Monitoring

链接: https://arxiv.org/abs/2409.11807
作者: Maarten Meire,Quinten Van Baelen,Ted Ooijevaar,Peter Karsmakers
关键词-EN: machine condition monitoring, industrial applications, condition monitoring, main goal, machine condition
类目: Machine Learning (cs.LG)
*备注: 32 pages, 7 figures, 4 tables

点击查看摘要

Abstract:The main goal of machine condition monitoring is, as the name implies, to monitor the condition of industrial applications. The objective of this monitoring can be mainly split into two problems. A diagnostic problem, where normal data should be distinguished from anomalous data, otherwise called Anomaly Detection (AD), or a prognostic problem, where the aim is to predict the evolution of a Condition Indicator (CI) that reflects the condition of an asset throughout its life time. When considering machine condition monitoring, it is expected that this CI shows a monotonic behavior, as the condition of a machine gradually degrades over time. This work proposes an extension to Constraint Guided AutoEncoders (CGAE), which is a robust AD method, that enables building a single model that can be used for both AD and CI estimation. For the purpose of improved CI estimation the extension incorporates a constraint that enforces the model to have monotonically increasing CI predictions over time. Experimental results indicate that the proposed algorithm performs similar, or slightly better, than CGAE, with regards to AD, while improving the monotonic behavior of the CI.

[LG-52] he Factuality of Large Language Models in the Legal Domain CIKM2024

链接: https://arxiv.org/abs/2409.11798
作者: Rajaa El Hamdani,Thomas Bonald,Fragkiskos Malliaros,Nils Holzenberger,Fabian Suchanek
关键词-EN: large language models, realistic usage scenario, language models, model abstain, usage scenario
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: CIKM 2024, short paper

点击查看摘要

Abstract:This paper investigates the factuality of large language models (LLMs) as knowledge bases in the legal domain, in a realistic usage scenario: we allow for acceptable variations in the answer, and let the model abstain from answering when uncertain. First, we design a dataset of diverse factual questions about case law and legislation. We then use the dataset to evaluate several LLMs under different evaluation methods, including exact, alias, and fuzzy matching. Our results show that the performance improves significantly under the alias and fuzzy matching methods. Further, we explore the impact of abstaining and in-context examples, finding that both strategies enhance precision. Finally, we demonstrate that additional pre-training on legal documents, as seen with SaulLM, further improves factual precision from 63% to 81%.

[LG-53] Consistent Estimation of a Class of Distances Between Covariance Matrices

链接: https://arxiv.org/abs/2409.11761
作者: Roberto Pereira,Xavier Mestre,Davig Gregoratti
关键词-EN: covariance matrices directly, problem of estimating, covariance matrices, covariance matrices lie, matrices directly
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work considers the problem of estimating the distance between two covariance matrices directly from the data. Particularly, we are interested in the family of distances that can be expressed as sums of traces of functions that are separately applied to each covariance matrix. This family of distances is particularly useful as it takes into consideration the fact that covariance matrices lie in the Riemannian manifold of positive definite matrices, thereby including a variety of commonly used metrics, such as the Euclidean distance, Jeffreys’ divergence, and the log-Euclidean distance. Moreover, a statistical analysis of the asymptotic behavior of this class of distance estimators has also been conducted. Specifically, we present a central limit theorem that establishes the asymptotic Gaussianity of these estimators and provides closed form expressions for the corresponding means and variances. Empirical evaluations demonstrate the superiority of our proposed consistent estimator over conventional plug-in estimators in multivariate analytical contexts. Additionally, the central limit theorem derived in this study provides a robust statistical framework to assess of accuracy of these estimators.

[LG-54] NPAT Null-Space Projected Adversarial Training Towards Zero Deterioration

链接: https://arxiv.org/abs/2409.11754
作者: Hanyi Hu,Qiao Han,Kui Chen,Yao Yang
关键词-EN: effective defense strategy, Projected Data Augmentation, Projected Gradient Descent, Null-space Projected Data, defense strategy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:To mitigate the susceptibility of neural networks to adversarial attacks, adversarial training has emerged as a prevalent and effective defense strategy. Intrinsically, this countermeasure incurs a trade-off, as it sacrifices the model’s accuracy in processing normal samples. To reconcile the trade-off, we pioneer the incorporation of null-space projection into adversarial training and propose two innovative Null-space Projection based Adversarial Training(NPAT) algorithms tackling sample generation and gradient optimization, named Null-space Projected Data Augmentation (NPDA) and Null-space Projected Gradient Descent (NPGD), to search for an overarching optimal solutions, which enhance robustness with almost zero deterioration in generalization performance. Adversarial samples and perturbations are constrained within the null-space of the decision boundary utilizing a closed-form null-space projector, effectively mitigating threat of attack stemming from unreliable features. Subsequently, we conducted experiments on the CIFAR10 and SVHN datasets and reveal that our methodology can seamlessly combine with adversarial training methods and obtain comparable robustness while keeping generalization close to a high-accuracy model.

[LG-55] HARP: Human-Assisted Regrouping with Permutation Invariant Critic for Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2409.11741
作者: Huawen Hu,Enze Shi,Chenxi Yue,Shuocun Yang,Zihao Wu,Yiwei Li,Tianyang Zhong,Tuo Zhang,Tianming Liu,Shu Zhang
关键词-EN: provide critical guidance, complex fields, Permutation Invariant Critic, expertise to accelerate, provide critical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Human-in-the-loop reinforcement learning integrates human expertise to accelerate agent learning and provide critical guidance and feedback in complex fields. However, many existing approaches focus on single-agent tasks and require continuous human involvement during the training process, significantly increasing the human workload and limiting scalability. In this paper, we propose HARP (Human-Assisted Regrouping with Permutation Invariant Critic), a multi-agent reinforcement learning framework designed for group-oriented tasks. HARP integrates automatic agent regrouping with strategic human assistance during deployment, enabling and allowing non-experts to offer effective guidance with minimal intervention. During training, agents dynamically adjust their groupings to optimize collaborative task completion. When deployed, they actively seek human assistance and utilize the Permutation Invariant Group Critic to evaluate and refine human-proposed groupings, allowing non-expert users to contribute valuable suggestions. In multiple collaboration scenarios, our approach is able to leverage limited guidance from non-experts and enhance performance. The project can be found at this https URL.

[LG-56] From Lists to Emojis: How Format Bias Affects Model Alignment

链接: https://arxiv.org/abs/2409.11704
作者: Xuanchang Zhang,Wei Xiong,Lichang Chen,Tianyi Zhou,Heng Huang,Tong Zhang
关键词-EN: LMSYS Chatbot Arena, RLHF, human feedback, biases, format
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Working in progress

点击查看摘要

Abstract:In this paper, we study format biases in reinforcement learning from human feedback (RLHF). We observe that many widely-used preference models, including human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark, exhibit strong biases towards specific format patterns, such as lists, links, bold text, and emojis. Furthermore, large language models (LLMs) can exploit these biases to achieve higher rankings on popular benchmarks like AlpacaEval and LMSYS Chatbot Arena. One notable example of this is verbosity bias, where current preference models favor longer responses that appear more comprehensive, even when their quality is equal to or lower than shorter, competing responses. However, format biases beyond verbosity remain largely underexplored in the literature. In this work, we extend the study of biases in preference learning beyond the commonly recognized length bias, offering a comprehensive analysis of a wider range of format biases. Additionally, we show that with a small amount of biased data (less than 1%), we can inject significant bias into the reward model. Moreover, these format biases can also be easily exploited by downstream alignment algorithms, such as best-of-n sampling and online iterative DPO, as it is usually easier to manipulate the format than to improve the quality of responses. Our findings emphasize the need to disentangle format and content both for designing alignment algorithms and evaluating models.

[LG-57] Monomial Matrix Group Equivariant Neural Functional Networks

链接: https://arxiv.org/abs/2409.11697
作者: Hoang V. Tran,Thieu N. Vo,Tho H. Tran,An T. Nguyen,Tan Minh Nguyen
关键词-EN: recently gained significant, gained significant attention, significant attention due, implicit neural representation, classifying implicit neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural functional networks (NFNs) have recently gained significant attention due to their diverse applications, ranging from predicting network generalization and network editing to classifying implicit neural representation. Previous NFN designs often depend on permutation symmetries in neural networks’ weights, which traditionally arise from the unordered arrangement of neurons in hidden layers. However, these designs do not take into account the weight scaling symmetries of \operatornameReLU networks, and the weight sign flipping symmetries of \operatornamesin or \operatornametanh networks. In this paper, we extend the study of the group action on the network weights from the group of permutation matrices to the group of monomial matrices by incorporating scaling/sign-flipping symmetries. Particularly, we encode these scaling/sign-flipping symmetries by designing our corresponding equivariant and invariant layers. We name our new family of NFNs the Monomial Matrix Group Equivariant Neural Functional Networks (Monomial-NFN). Because of the expansion of the symmetries, Monomial-NFN has much fewer independent trainable parameters compared to the baseline NFNs in the literature, thus enhancing the model’s efficiency. Moreover, for fully connected and convolutional neural networks, we theoretically prove that all groups that leave these networks invariant while acting on their weight spaces are some subgroups of the monomial matrix group. We provide empirical evidences to demonstrate the advantages of our model over existing baselines, achieving competitive performance and efficiency.

[LG-58] Detecting Underdiagnosed Medical Conditions with Deep Learning-Based Opportunistic CT Imaging

链接: https://arxiv.org/abs/2409.11686
作者: Asad Aali,Andrew Johnston,Louis Blankemeier,Dave Van Veen,Laura T Derry,David Svec,Jason Hom,Robert D. Boutin,Akshay S. Chaudhari
关键词-EN: Abdominal computed tomography, Abdominal computed, computed tomography, frequently performed, clinical settings
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Abdominal computed tomography (CT) scans are frequently performed in clinical settings. Opportunistic CT involves repurposing routine CT images to extract diagnostic information and is an emerging tool for detecting underdiagnosed conditions such as sarcopenia, hepatic steatosis, and ascites. This study utilizes deep learning methods to promote accurate diagnosis and clinical documentation. We analyze 2,674 inpatient CT scans to identify discrepancies between imaging phenotypes (characteristics derived from opportunistic CT scans) and their corresponding documentation in radiology reports and ICD coding. Through our analysis, we find that only 0.5%, 3.2%, and 30.7% of scans diagnosed with sarcopenia, hepatic steatosis, and ascites (respectively) through either opportunistic imaging or radiology reports were ICD-coded. Our findings demonstrate opportunistic CT’s potential to enhance diagnostic precision and accuracy of risk adjustment models, offering advancements in precision medicine.

[LG-59] Recurrent Interpolants for Probabilistic Time Series Prediction

链接: https://arxiv.org/abs/2409.11684
作者: Yu Chen,Marin Biloš,Sarthak Mittal,Wei Deng,Kashif Rasul,Anderson Schneider
关键词-EN: tools for multivariate, range of datasets, Sequential models, wide range, multivariate time series
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Sequential models such as recurrent neural networks or transformer-based models became \textitde facto tools for multivariate time series forecasting in a probabilistic fashion, with applications to a wide range of datasets, such as finance, biology, medicine, etc. Despite their adeptness in capturing dependencies, assessing prediction uncertainty, and efficiency in training, challenges emerge in modeling high-dimensional complex distributions and cross-feature dependencies. To tackle these issues, recent works delve into generative modeling by employing diffusion or flow-based models. Notably, the integration of stochastic differential equations or probability flow successfully extends these methods to probabilistic time series imputation and forecasting. However, scalability issues necessitate a computational-friendly framework for large-scale generative model-based predictions. This work proposes a novel approach by blending the computational efficiency of recurrent neural networks with the high-quality probabilistic modeling of the diffusion model, which addresses challenges and advances generative models’ application in time series forecasting. Our method relies on the foundation of stochastic interpolants and the extension to a broader conditional generation framework with additional control features, offering insights for future developments in this dynamic field.

[LG-60] An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems

链接: https://arxiv.org/abs/2409.11678
作者: Peng Liu,Jiawei Zhu,Cong Xu,Ming Zhao,Bin Wang
关键词-EN: Recommender Systems, multiple scores predicted, combining multiple scores, Multi-Task Fusion, stage of Recommender
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2404.17589

点击查看摘要

Abstract:As the last key stage of Recommender Systems (RSs), Multi-Task Fusion (MTF) is in charge of combining multiple scores predicted by Multi-Task Learning (MTL) into a final score to maximize user satisfaction, which decides the ultimate recommendation results. In recent years, to maximize long-term user satisfaction within a recommendation session, Reinforcement Learning (RL) is widely used for MTF in large-scale RSs. However, limited by their modeling pattern, all the current RL-MTF methods can only utilize user features as the state to generate actions for each user, but unable to make use of item features and other valuable features, which leads to suboptimal results. Addressing this problem is a challenge that requires breaking through the current modeling pattern of RL-MTF. To solve this problem, we propose a novel method called Enhanced-State RL for MTF in RSs. Unlike the existing methods mentioned above, our method first defines user features, item features, and other valuable features collectively as the enhanced state; then proposes a novel actor and critic learning process to utilize the enhanced state to make much better action for each user-item pair. To the best of our knowledge, this novel modeling pattern is being proposed for the first time in the field of RL-MTF. We conduct extensive offline and online experiments in a large-scale RS. The results demonstrate that our model outperforms other models significantly. Enhanced-State RL has been fully deployed in our RS more than half a year, improving +3.84% user valid consumption and +0.58% user duration time compared to baseline.

[LG-61] Hypergraph-based Motion Generation with Multi-modal Interaction Relational Reasoning

链接: https://arxiv.org/abs/2409.11676
作者: Keshu Wu,Yang Zhou,Haotian Shi,Dominique Lord,Bin Ran,Xinyue Ye
关键词-EN: presents considerable challenges, presents considerable, future states, intricate nature, accurately predicting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The intricate nature of real-world driving environments, characterized by dynamic and diverse interactions among multiple vehicles and their possible future states, presents considerable challenges in accurately predicting the motion states of vehicles and handling the uncertainty inherent in the predictions. Addressing these challenges requires comprehensive modeling and reasoning to capture the implicit relations among vehicles and the corresponding diverse behaviors. This research introduces an integrated framework for autonomous vehicles (AVs) motion prediction to address these complexities, utilizing a novel Relational Hypergraph Interaction-informed Neural mOtion generator (RHINO). RHINO leverages hypergraph-based relational reasoning by integrating a multi-scale hypergraph neural network to model group-wise interactions among multiple vehicles and their multi-modal driving behaviors, thereby enhancing motion prediction accuracy and reliability. Experimental validation using real-world datasets demonstrates the superior performance of this framework in improving predictive accuracy and fostering socially aware automated driving in dynamic traffic scenarios.

[LG-62] Few-Shot Class-Incremental Learning with Non-IID Decentralized Data

链接: https://arxiv.org/abs/2409.11657
作者: Cuiwei Liu,Siang Xu,Huaijun Qiu,Jing Zhang,Zhi Liu,Liang Zhao
关键词-EN: adaptive intelligent systems, Few-shot class-incremental learning, minimal annotated data, previously accumulated knowledge, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Few-shot class-incremental learning is crucial for developing scalable and adaptive intelligent systems, as it enables models to acquire new classes with minimal annotated data while safeguarding the previously accumulated knowledge. Nonetheless, existing methods deal with continuous data streams in a centralized manner, limiting their applicability in scenarios that prioritize data privacy and security. To this end, this paper introduces federated few-shot class-incremental learning, a decentralized machine learning paradigm tailored to progressively learn new classes from scarce data distributed across multiple clients. In this learning paradigm, clients locally update their models with new classes while preserving data privacy, and then transmit the model updates to a central server where they are aggregated globally. However, this paradigm faces several issues, such as difficulties in few-shot learning, catastrophic forgetting, and data heterogeneity. To address these challenges, we present a synthetic data-driven framework that leverages replay buffer data to maintain existing knowledge and facilitate the acquisition of new knowledge. Within this framework, a noise-aware generative replay module is developed to fine-tune local models with a balance of new and replay data, while generating synthetic data of new classes to further expand the replay buffer for future tasks. Furthermore, a class-specific weighted aggregation strategy is designed to tackle data heterogeneity by adaptively aggregating class-specific parameters based on local models performance on synthetic data. This enables effective global model optimization without direct access to client data. Comprehensive experiments across three widely-used datasets underscore the effectiveness and preeminence of the introduced framework.

[LG-63] Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

链接: https://arxiv.org/abs/2409.11653
作者: Qian Shao,Jiangrui Kang,Qiyuan Chen,Zepeng Li,Hongxia Xu,Yiwen Cao,Jiajuan Liang,Jian Wu
关键词-EN: human labor, preferred paradigm, deep learning tasks, sample selection, Learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Under Review

点击查看摘要

Abstract:Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The sample selection task in SSL has been under-explored for a long time. To fill in this gap, we propose a Representative and Diverse Sample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm to minimise a novel criterion \alpha -Maximum Mean Discrepancy ( \alpha -MMD), RDSS samples a representative and diverse subset for annotation from the unlabeled data. We demonstrate that minimizing \alpha -MMD enhances the generalization ability of low-budget learning. Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained annotation budgets.

[LG-64] Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview

链接: https://arxiv.org/abs/2409.11650
作者: Yanshu Wang,Tong Yang,Xiyan Liang,Guoan Wang,Hanning Lu,Xu Zhe,Yaoming Li,Li Weitao
关键词-EN: quantizing large-scale neural, neural network models, large-scale neural network, comprehensive overview, neural network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper provides a comprehensive overview of the principles, challenges, and methodologies associated with quantizing large-scale neural network models. As neural networks have evolved towards larger and more complex architectures to address increasingly sophisticated tasks, the computational and energy costs have escalated significantly. We explore the necessity and impact of model size growth, highlighting the performance benefits as well as the computational challenges and environmental considerations. The core focus is on model quantization as a fundamental approach to mitigate these challenges by reducing model size and improving efficiency without substantially compromising accuracy. We delve into various quantization techniques, including both post-training quantization (PTQ) and quantization-aware training (QAT), and analyze several state-of-the-art algorithms such as LLM-QAT, PEQA(L4Q), ZeroQuant, SmoothQuant, and others. Through comparative analysis, we examine how these methods address issues like outliers, importance weighting, and activation quantization, ultimately contributing to more sustainable and accessible deployment of large-scale models.

[LG-65] Hard-Label Cryptanalytic Extraction of Neural Network Models

链接: https://arxiv.org/abs/2409.11646
作者: Yi Chen,Xiaoyang Dong,Jian Guo,Yantian Shen,Anyu Wang,Xiaoyun Wang
关键词-EN: neural networks, neural, extracting neural network, networks, machine learning problem
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by Asiacrypt 2024

点击查看摘要

Abstract:The machine learning problem of extracting neural network parameters has been proposed for nearly three decades. Functionally equivalent extraction is a crucial goal for research on this problem. When the adversary has access to the raw output of neural networks, various attacks, including those presented at CRYPTO 2020 and EUROCRYPT 2024, have successfully achieved this goal. However, this goal is not achieved when neural networks operate under a hard-label setting where the raw output is inaccessible. In this paper, we propose the first attack that theoretically achieves functionally equivalent extraction under the hard-label setting, which applies to ReLU neural networks. The effectiveness of our attack is validated through practical experiments on a wide range of ReLU neural networks, including neural networks trained on two real benchmarking datasets (MNIST, CIFAR10) widely used in computer vision. For a neural network consisting of 10^5 parameters, our attack only requires several hours on a single core. Comments: Accepted by Asiacrypt 2024 Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2409.11646 [cs.CR] (or arXiv:2409.11646v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2409.11646 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-66] DAF-Net: A Dual-Branch Feature Decomposition Fusion Network with Domain Adaptive for Infrared and Visible Image Fusion

链接: https://arxiv.org/abs/2409.11642
作者: Jian Xu,Xin He
关键词-EN: comprehensive scene understanding, combine complementary information, visible image fusion, image fusion aims, scene understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 5pages,4figures

点击查看摘要

Abstract:Infrared and visible image fusion aims to combine complementary information from both modalities to provide a more comprehensive scene understanding. However, due to the significant differences between the two modalities, preserving key features during the fusion process remains a challenge. To address this issue, we propose a dual-branch feature decomposition fusion network (DAF-Net) with domain adaptive, which introduces Multi-Kernel Maximum Mean Discrepancy (MK-MMD) into the base encoder and designs a hybrid kernel function suitable for infrared and visible image fusion. The base encoder built on the Restormer network captures global structural information while the detail encoder based on Invertible Neural Networks (INN) focuses on extracting detail texture information. By incorporating MK-MMD, the DAF-Net effectively aligns the latent feature spaces of visible and infrared images, thereby improving the quality of the fused images. Experimental results demonstrate that the proposed method outperforms existing techniques across multiple datasets, significantly enhancing both visual quality and fusion performance. The related Python code is available at this https URL.

[LG-67] Enhancing PM2.5 Data Imputation and Prediction in Air Quality Monitoring Networks Using a KNN-SINDy Hybrid Model

链接: https://arxiv.org/abs/2409.11640
作者: Yohan Choi,Boaz Choi,Jachin Choi
关键词-EN: poses significant risks, necessitating accurate prediction, air quality management, effective air quality, particulate matter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Air pollution, particularly particulate matter (PM2.5), poses significant risks to public health and the environment, necessitating accurate prediction and continuous monitoring for effective air quality management. However, air quality monitoring (AQM) data often suffer from missing records due to various technical difficulties. This study explores the application of Sparse Identification of Nonlinear Dynamics (SINDy) for imputing missing PM2.5 data by predicting, using training data from 2016, and comparing its performance with the established Soft Impute (SI) and K-Nearest Neighbors (KNN) methods.

[LG-68] Multimodal Generalized Category Discovery

链接: https://arxiv.org/abs/2409.11624
作者: Yuchang Su,Renping Zhou,Siyu Huang,Xingjian Li,Tianyang Wang,Ziyue Wang,Min Xu
关键词-EN: Generalized Category Discovery, Generalized Category, Category Discovery, open-world scientific discoveries, aims to classify
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generalized Category Discovery (GCD) aims to classify inputs into both known and novel categories, a task crucial for open-world scientific discoveries. However, current GCD methods are limited to unimodal data, overlooking the inherently multimodal nature of most real-world data. In this work, we extend GCD to a multimodal setting, where inputs from different modalities provide richer and complementary information. Through theoretical analysis and empirical validation, we identify that the key challenge in multimodal GCD lies in effectively aligning heterogeneous information across modalities. To address this, we propose MM-GCD, a novel framework that aligns both the feature and output spaces of different modalities using contrastive learning and distillation techniques. MM-GCD achieves new state-of-the-art performance on the UPMC-Food101 and N24News datasets, surpassing previous methods by 11.5% and 4.7%, respectively.

[LG-69] PieClam: A Universal Graph Autoencoder Based on Overlapping Inclusive and Exclusive Communities

链接: https://arxiv.org/abs/2409.11618
作者: Daniel Zilberg,Ron Levie
关键词-EN: Exclusive Cluster Affiliation, Cluster Affiliation Model, Inclusive Exclusive Cluster, Cluster Affiliation, Exclusive Cluster
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose PieClam (Prior Inclusive Exclusive Cluster Affiliation Model): a probabilistic graph model for representing any graph as overlapping generalized communities. Our method can be interpreted as a graph autoencoder: nodes are embedded into a code space by an algorithm that maximizes the log-likelihood of the decoded graph, given the input graph. PieClam is a community affiliation model that extends well-known methods like BigClam in two main manners. First, instead of the decoder being defined via pairwise interactions between the nodes in the code space, we also incorporate a learned prior on the distribution of nodes in the code space, turning our method into a graph generative model. Secondly, we generalize the notion of communities by allowing not only sets of nodes with strong connectivity, which we call inclusive communities, but also sets of nodes with strong disconnection, which we call exclusive communities. To model both types of communities, we propose a new type of decoder based the Lorentz inner product, which we prove to be much more expressive than standard decoders based on standard inner products or norm distances. By introducing a new graph similarity measure, that we call the log cut distance, we show that PieClam is a universal autoencoder, able to uniformly approximately reconstruct any graph. Our method is shown to obtain competitive performance in graph anomaly detection benchmarks.

[LG-70] me-Series Forecasting Knowledge Distillation and Refinement within a Multimodal PDE Foundation Model

链接: https://arxiv.org/abs/2409.11609
作者: Derek Jollie,Jingmin Sun,Zecheng Zhang,Hayden Schaeffer
关键词-EN: distinct time-series data, embed additional information, multi-operator learning, embed additional, differential equations
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic encoding has been used in multi-operator learning as a way to embed additional information for distinct time-series data. For spatiotemporal systems described by time-dependent partial differential equations, the equation itself provides an additional modality to identify the system. The utilization of symbolic expressions along side time-series samples allows for the development of multimodal predictive neural networks. A key challenge with current approaches is that the symbolic information, i.e. the equations, must be manually preprocessed (simplified, rearranged, etc.) to match and relate to the existing token library, which increases costs and reduces flexibility, especially when dealing with new differential equations. We propose a new token library based on SymPy to encode differential equations as an additional modality for time-series models. The proposed approach incurs minimal cost, is automated, and maintains high prediction accuracy for forecasting tasks. Additionally, we include a Bayesian filtering module that connects the different modalities to refine the learned equation. This improves the accuracy of the learned symbolic representation and the predicted time-series.

[LG-71] No Saved Kaleidosope: an 100% Jitted Neural Network Coding Language with Pythonic Syntax

链接: https://arxiv.org/abs/2409.11600
作者: Augusto Seben da Rosa,Marlon Daniel Angeli,Jorge Aikes Junior,Alef Iury Ferreira,Lucas Rafael Gris,Anderson da Silva Soares,Arnaldo Candido Junior,Frederico Santos de Oliveira,Gabriel Trevisan Damke,Rafael Teixeira Sousa
关键词-EN: training Artificial Neural, LLVM and Cuda, Artificial Neural Networks, training Artificial, Artificial Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: 12 pages, 3 figures and 3 tables

点击查看摘要

Abstract:We developed a jitted compiler for training Artificial Neural Networks using C++, LLVM and Cuda. It features object-oriented characteristics, strong typing, parallel workers for data pre-processing, pythonic syntax for expressions, PyTorch like model declaration and Automatic Differentiation. We implement the mechanisms of cache and pooling in order to manage VRAM, cuBLAS for high performance matrix multiplication and cuDNN for convolutional layers. Our experiments with Residual Convolutional Neural Networks on ImageNet, we reach similar speed but degraded performance. Also, the GRU network experiments show similar accuracy, but our compiler have degraded speed in that task. However, our compiler demonstrates promising results at the CIFAR-10 benchmark, in which we reach the same performance and about the same speed as PyTorch. We make the code publicly available at: this https URL

[LG-72] he Sample Complexity of Smooth Boosting and the Tightness of the Hardcore Theorem

链接: https://arxiv.org/abs/2409.11597
作者: Guy Blanc,Alexandre Hayderi,Caleb Koch,Li-Yang Tan
关键词-EN: Smooth boosters generate, boosters generate distributions, Machine Learning, Smooth boosters, smooth boosting
类目: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 46 pages, FOCS 2024

点击查看摘要

Abstract:Smooth boosters generate distributions that do not place too much weight on any given example. Originally introduced for their noise-tolerant properties, such boosters have also found applications in differential privacy, reproducibility, and quantum learning theory. We study and settle the sample complexity of smooth boosting: we exhibit a class that can be weak learned to \gamma -advantage over smooth distributions with m samples, for which strong learning over the uniform distribution requires \tilde\Omega(1/\gamma^2)\cdot m samples. This matches the overhead of existing smooth boosters and provides the first separation from the setting of distribution-independent boosting, for which the corresponding overhead is O(1/\gamma) . Our work also sheds new light on Impagliazzo’s hardcore theorem from complexity theory, all known proofs of which can be cast in the framework of smooth boosting. For a function f that is mildly hard against size- s circuits, the hardcore theorem provides a set of inputs on which f is extremely hard against size- s’ circuits. A downside of this important result is the loss in circuit size, i.e. that s’ \ll s . Answering a question of Trevisan, we show that this size loss is necessary and in fact, the parameters achieved by known proofs are the best possible. Comments: 46 pages, FOCS 2024 Subjects: Computational Complexity (cs.CC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2409.11597 [cs.CC] (or arXiv:2409.11597v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2409.11597 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-73] Self-Contrastive Forward-Forward Algorithm

链接: https://arxiv.org/abs/2409.11593
作者: Xing Chen,Dongshu Liu,Jeremie Laydevant,Julie Grollier
关键词-EN: updates weights locally, purely forward-mode learning, purely forward-mode, updates weights, weights locally
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The Forward-Forward (FF) algorithm is a recent, purely forward-mode learning method, that updates weights locally and layer-wise and supports supervised as well as unsupervised learning. These features make it ideal for applications such as brain-inspired learning, low-power hardware neural networks, and distributed learning in large models. However, while FF has shown promise on written digit recognition tasks, its performance on natural images and time-series remains a challenge. A key limitation is the need to generate high-quality negative examples for contrastive learning, especially in unsupervised tasks, where versatile solutions are currently lacking. To address this, we introduce the Self-Contrastive Forward-Forward (SCFF) method, inspired by self-supervised contrastive learning. SCFF generates positive and negative examples applicable across different datasets, surpassing existing local forward algorithms for unsupervised classification accuracy on MNIST (MLP: 98.7%), CIFAR-10 (CNN: 80.75%), and STL-10 (CNN: 77.3%). Additionally, SCFF is the first to enable FF training of recurrent neural networks, opening the door to more complex tasks and continuous-time video and text processing.

[LG-74] Advances in APPFL: A Comprehensive and Extensible Federated Learning Framework

链接: https://arxiv.org/abs/2409.11585
作者: Zilinghan Li,Shilan He,Ze Yang,Minseok Ryu,Kibaek Kim,Ravi Madduri
关键词-EN: paradigm enabling collaborative, enabling collaborative model, collaborative model training, machine learning paradigm, learning paradigm enabling
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a distributed machine learning paradigm enabling collaborative model training while preserving data privacy. In today’s landscape, where most data is proprietary, confidential, and distributed, FL has become a promising approach to leverage such data effectively, particularly in sensitive domains such as medicine and the electric grid. Heterogeneity and security are the key challenges in FL, however; most existing FL frameworks either fail to address these challenges adequately or lack the flexibility to incorporate new solutions. To this end, we present the recent advances in developing APPFL, an extensible framework and benchmarking suite for federated learning, which offers comprehensive solutions for heterogeneity and security concerns, as well as user-friendly interfaces for integrating new algorithms or adapting to new applications. We demonstrate the capabilities of APPFL through extensive experiments evaluating various aspects of FL, including communication efficiency, privacy preservation, computational performance, and resource utilization. We further highlight the extensibility of APPFL through case studies in vertical, hierarchical, and decentralized FL. APPFL is open-sourced at this https URL.

[LG-75] Preference Tuning with Human Feedback on Language Speech and Vision Tasks: A Survey

链接: https://arxiv.org/abs/2409.11564
作者: Genta Indra Winata,Hanyang Zhao,Anirban Das,Wenpin Tang,David D. Yao,Shi-Xiong Zhang,Sambit Sahu
关键词-EN: aligning deep generative, Preference tuning, deep generative models, preference tuning tasks, Preference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Survey paper

点击查看摘要

Abstract:Preference tuning is a crucial process for aligning deep generative models with human preferences. This survey offers a thorough overview of recent advancements in preference tuning and the integration of human feedback. The paper is organized into three main sections: 1) introduction and preliminaries: an introduction to reinforcement learning frameworks, preference tuning tasks, models, and datasets across various modalities: language, speech, and vision, as well as different policy approaches, 2) in-depth examination of each preference tuning approach: a detailed analysis of the methods used in preference tuning, and 3) applications, discussion, and future directions: an exploration of the applications of preference tuning in downstream tasks, including evaluation methods for different modalities, and an outlook on future research directions. Our objective is to present the latest methodologies in preference tuning and model alignment, enhancing the understanding of this field for researchers and practitioners. We hope to encourage further engagement and innovation in this area.

[LG-76] A Property Encoder for Graph Neural Networks

链接: https://arxiv.org/abs/2409.11554
作者: Anwar Said,Xenofon Koutsoukos
关键词-EN: Graph machine learning, node features, node, machine learning, fundamentally relies
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: conference paper

点击查看摘要

Abstract:Graph machine learning, particularly using graph neural networks, fundamentally relies on node features. Nevertheless, numerous real-world systems, such as social and biological networks, often lack node features due to various reasons, including privacy concerns, incomplete or missing data, and limitations in data collection. In such scenarios, researchers typically resort to methods like structural and positional encoding to construct node features. However, the length of such features is contingent on the maximum value within the property being encoded, for example, the highest node degree, which can be exceedingly large in applications like scale-free networks. Furthermore, these encoding schemes are limited to categorical data and might not be able to encode metrics returning other type of values. In this paper, we introduce a novel, universally applicable encoder, termed PropEnc, which constructs expressive node embedding from any given graph metric. PropEnc leverages histogram construction combined with reverse index encoding, offering a flexible method for node features initialization. It supports flexible encoding in terms of both dimensionality and type of input, demonstrating its effectiveness across diverse applications. PropEnc allows encoding metrics in low-dimensional space which effectively avoids the issue of sparsity and enhances the efficiency of the models. We show that \emphPropEnc can construct node features that either exactly replicate one-hot encoding or closely approximate indices under various settings. Our extensive evaluations in graph classification setting across multiple social networks that lack node features support our hypothesis. The empirical results conclusively demonstrate that PropEnc is both an efficient and effective mechanism for constructing node features from diverse set of graph metrics.

[LG-77] VALO: A Versatile Anytime Framework for LiDAR-based Object Detection Deep Neural Networks

链接: https://arxiv.org/abs/2409.11542
作者: Ahmet Soyyigit,Shuochao Yao,Heechul Yun
关键词-EN: LiDAR object detection, LiDAR object, object detection, object detection deep, object detection DNNs
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work addresses the challenge of adapting dynamic deadline requirements for LiDAR object detection deep neural networks (DNNs). The computing latency of object detection is critically important to ensure safe and efficient navigation. However, state-of-the-art LiDAR object detection DNNs often exhibit significant latency, hindering their real-time performance on resource-constrained edge platforms. Therefore, a tradeoff between detection accuracy and latency should be dynamically managed at runtime to achieve optimum results. In this paper, we introduce VALO (Versatile Anytime algorithm for LiDAR Object detection), a novel data-centric approach that enables anytime computing of 3D LiDAR object detection DNNs. VALO employs a deadline-aware scheduler to selectively process input regions, making execution time and accuracy tradeoffs without architectural modifications. Additionally, it leverages efficient forecasting of past detection results to mitigate possible loss of accuracy due to partial processing of input. Finally, it utilizes a novel input reduction technique within its detection heads to significantly accelerate execution without sacrificing accuracy. We implement VALO on state-of-the-art 3D LiDAR object detection networks, namely CenterPoint and VoxelNext, and demonstrate its dynamic adaptability to a wide range of time constraints while achieving higher accuracy than the prior state-of-the-art. Code is available athttps://github.com/CSL-KU/VALOthis http URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2409.11542 [cs.CV] (or arXiv:2409.11542v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2409.11542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-78] Balancing Optimality and Diversity: Human-Centered Decision Making through Generative Curation

链接: https://arxiv.org/abs/2409.11535
作者: Michael Lingzhi Li,Shixiang Zhu
关键词-EN: array of choices, surge in data, data availability, availability has inundated, inundated decision-makers
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The surge in data availability has inundated decision-makers with an overwhelming array of choices. While existing approaches focus on optimizing decisions based on quantifiable metrics, practical decision-making often requires balancing measurable quantitative criteria with unmeasurable qualitative factors embedded in the broader context. In such cases, algorithms can generate high-quality recommendations, but the final decision rests with the human, who must weigh both dimensions. We define the process of selecting the optimal set of algorithmic recommendations in this context as human-centered decision making. To address this challenge, we introduce a novel framework called generative curation, which optimizes the true desirability of decision options by integrating both quantitative and qualitative aspects. Our framework uses a Gaussian process to model unknown qualitative factors and derives a diversity metric that balances quantitative optimality with qualitative diversity. This trade-off enables the generation of a manageable subset of diverse, near-optimal actions that are robust to unknown qualitative preferences. To operationalize this framework, we propose two implementation approaches: a generative neural network architecture that produces a distribution \pi to efficiently sample a diverse set of near-optimal actions, and a sequential optimization method to iteratively generates solutions that can be easily incorporated into complex optimization formulations. We validate our approach with extensive datasets, demonstrating its effectiveness in enhancing decision-making processes across a range of complex environments, with significant implications for policy and management.

[LG-79] Adaptive Anomaly Detection in Network Flows with Low-Rank Tensor Decompositions and Deep Unrolling

链接: https://arxiv.org/abs/2409.11529
作者: Lukas Schynol,Marius Pesavento
关键词-EN: future communication systems, Anomaly detection, increasingly recognized, key component, component for ensuring
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:Anomaly detection (AD) is increasingly recognized as a key component for ensuring the resilience of future communication systems. While deep learning has shown state-of-the-art AD performance, its application in critical systems is hindered by concerns regarding training data efficiency, domain adaptation and interpretability. This work considers AD in network flows using incomplete measurements, leveraging a robust tensor decomposition approach and deep unrolling techniques to address these challenges. We first propose a novel block-successive convex approximation algorithm based on a regularized model-fitting objective where the normal flows are modeled as low-rank tensors and anomalies as sparse. An augmentation of the objective is introduced to decrease the computational cost. We apply deep unrolling to derive a novel deep network architecture based on our proposed algorithm, treating the regularization parameters as learnable weights. Inspired by Bayesian approaches, we extend the model architecture to perform online adaptation to per-flow and per-time-step statistics, improving AD performance while maintaining a low parameter count and preserving the problem’s permutation equivariances. To optimize the deep network weights for detection performance, we employ a homotopy optimization approach based on an efficient approximation of the area under the receiver operating characteristic curve. Extensive experiments on synthetic and real-world data demonstrate that our proposed deep network architecture exhibits a high training data efficiency, outperforms reference methods, and adapts seamlessly to varying network topologies.

[LG-80] Unlocking NACE Classification Embeddings with OpenAI for Enhanced Analysis and Processing

链接: https://arxiv.org/abs/2409.11524
作者: Andrea Vidali,Nicola Jean,Giacomo Le Pera
关键词-EN: European Community, European Union, standard classification system, NACE classification, industrial activities
类目: Machine Learning (cs.LG); General Economics (econ.GN); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:The Statistical Classification of Economic Activities in the European Community (NACE) is the standard classification system for the categorization of economic and industrial activities within the European Union. This paper proposes a novel approach to transform the NACE classification into low-dimensional embeddings, using state-of-the-art models and dimensionality reduction techniques. The primary challenge is the preservation of the hierarchical structure inherent within the original NACE classification while reducing the number of dimensions. To address this issue, we introduce custom metrics designed to quantify the retention of hierarchical relationships throughout the embedding and reduction processes. The evaluation of these metrics demonstrates the effectiveness of the proposed methodology in retaining the structural information essential for insightful analysis. This approach not only facilitates the visual exploration of economic activity relationships, but also increases the efficacy of downstream tasks, including clustering, classification, integration with other classifications, and others. Through experimental validation, the utility of our proposed framework in preserving hierarchical structures within the NACE classification is showcased, thereby providing a valuable tool for researchers and policymakers to understand and leverage any hierarchical data.

[LG-81] Partially Observable Contextual Bandits with Linear Payoffs

链接: https://arxiv.org/abs/2409.11521
作者: Sihan Zeng,Sujay Bhatt,Alec Koppel,Sumitra Ganesh
关键词-EN: framework assumes fully, bandit framework assumes, assumes fully observable, standard contextual bandit, contextual bandit framework
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The standard contextual bandit framework assumes fully observable and actionable contexts. In this work, we consider a new bandit setting with partially observable, correlated contexts and linear payoffs, motivated by the applications in finance where decision making is based on market information that typically displays temporal correlation and is not fully observed. We make the following contributions marrying ideas from statistical signal processing with bandits: (i) We propose an algorithmic pipeline named EMKF-Bandit, which integrates system identification, filtering, and classic contextual bandit algorithms into an iterative method alternating between latent parameter estimation and decision making. (ii) We analyze EMKF-Bandit when we select Thompson sampling as the bandit algorithm and show that it incurs a sub-linear regret under conditions on filtering. (iii) We conduct numerical simulations that demonstrate the benefits and practical applicability of the proposed pipeline.

[LG-82] Learning-Augmented Frequency Estimation in Sliding Windows

链接: https://arxiv.org/abs/2409.11516
作者: Rana Shahout,Ibrahim Sabek,Michael Mitzenmacher
关键词-EN: frequency estimation problems, utilize machine learning, machine learning approaches, approximate frequency estimation, sliding window
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We show how to utilize machine learning approaches to improve sliding window algorithms for approximate frequency estimation problems, under the ``algorithms with predictions’’ framework. In this dynamic environment, previous learning-augmented algorithms are less effective, since properties in sliding window resolution can differ significantly from the properties of the entire stream. Our focus is on the benefits of predicting and filtering out items with large next arrival times – that is, there is a large gap until their next appearance – from the stream, which we show improves the memory-accuracy tradeoffs significantly. We provide theorems that provide insight into how and by how much our technique can improve the sliding window algorithm, as well as experimental results using real-world data sets. Our work demonstrates that predictors can be useful in the challenging sliding window setting.

[LG-83] FedNE: Surrogate-Assisted Federated Neighbor Embedding for Dimensionality Reduction

链接: https://arxiv.org/abs/2409.11509
作者: Ziwei Li,Xiaoqi Wang,Hong-You Chen,Han-Wei Shen,Wei-Lun Chao
关键词-EN: enables collaborative model, collaborative model training, Federated learning, rapidly evolved, promising paradigm
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has rapidly evolved as a promising paradigm that enables collaborative model training across distributed participants without exchanging their local data. Despite its broad applications in fields such as computer vision, graph learning, and natural language processing, the development of a data projection model that can be effectively used to visualize data in the context of FL is crucial yet remains heavily under-explored. Neighbor embedding (NE) is an essential technique for visualizing complex high-dimensional data, but collaboratively learning a joint NE model is difficult. The key challenge lies in the objective function, as effective visualization algorithms like NE require computing loss functions among pairs of data. In this paper, we introduce \textscFedNE, a novel approach that integrates the \textscFedAvg framework with the contrastive NE technique, without any requirements of shareable data. To address the lack of inter-client repulsion which is crucial for the alignment in the global embedding space, we develop a surrogate loss function that each client learns and shares with each other. Additionally, we propose a data-mixing strategy to augment the local data, aiming to relax the problems of invisible neighbors and false neighbors constructed by the local k NN graphs. We conduct comprehensive experiments on both synthetic and real-world datasets. The results demonstrate that our \textscFedNE can effectively preserve the neighborhood data structures and enhance the alignment in the global embedding space compared to several baseline methods.

[LG-84] Chess Rating Estimation from Moves and Clock Times Using a CNN-LSTM

链接: https://arxiv.org/abs/2409.11506
作者: Michael Omori,Prasad Tadepalli
关键词-EN: player true strength, rapidly improving players, Current rating systems, update ratings incrementally, Current rating
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Current rating systems update ratings incrementally and may not always accurately reflect a player’s true strength at all times, especially for rapidly improving players or very rusty players. To overcome this, we explore a method to estimate player ratings directly from game moves and clock times. We compiled a benchmark dataset from Lichess, encompassing various time controls and including move sequences and clock times. Our model architecture comprises a CNN to learn positional features, which are then integrated with clock-time data into a bidirectional LSTM, predicting player ratings after each move. The model achieved an MAE of 182 rating points in the test data. Additionally, we applied our model to the 2024 IEEE Big Data Cup Chess Puzzle Difficulty Competition dataset, predicted puzzle ratings and achieved competitive results. This model is the first to use no hand-crafted features to estimate chess ratings and also the first to output a rating prediction for each move. Our method highlights the potential of using move-based rating estimation for enhancing rating systems and potentially other applications such as cheating detection.

[LG-85] Preventing Representational Rank Collapse in MPNNs by Splitting the Computational Graph

链接: https://arxiv.org/abs/2409.11504
作者: Andreas Roth,Franka Bause,Nils M. Kriege,Thomas Liebig
关键词-EN: fit complex functions, message-passing neural networks, simple makes representations, rank collapse, special case
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ability of message-passing neural networks (MPNNs) to fit complex functions over graphs is limited each iteration of message-passing over a simple makes representations more similar, a phenomenon known as rank collapse, and over-smoothing as a special case. Most approaches to mitigate over-smoothing extend common message-passing schemes, e.g., the graph convolutional network, by utilizing residual connections, gating mechanisms, normalization, or regularization techniques. Our work contrarily proposes to directly tackle the cause of this issue by modifying the message-passing scheme and exchanging different types of messages using multi-relational graphs. We identify the necessary and sufficient condition to ensure linearly independent node representations. As one instantion, we show that operating on multiple directed acyclic graphs always satisfies our condition and propose to obtain these by defining a strict partial ordering of the nodes. We conduct comprehensive experiments that confirm the benefits of operating on multi-relational graphs to achieve more informative node representations.

[LG-86] Super Resolution On Global Weather Forecasts

链接: https://arxiv.org/abs/2409.11502
作者: Bryan Zhang,Dhruv Rao,Adam Yang,Lawrence Zhang,Rodz Andrie Amor
关键词-EN: disaster response planning, vitally important tool, planning day, day activities, response planning
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Weather forecasting is a vitally important tool for tasks ranging from planning day to day activities to disaster response planning. However, modeling weather has proven to be challenging task due to its chaotic and unpredictable nature. Each variable, from temperature to precipitation to wind, all influence the path the environment will take. As a result, all models tend to rapidly lose accuracy as the temporal range of their forecasts increase. Classical forecasting methods use a myriad of physics-based, numerical, and stochastic techniques to predict the change in weather variables over time. However, such forecasts often require a very large amount of data and are extremely computationally expensive. Furthermore, as climate and global weather patterns change, classical models are substantially more difficult and time-consuming to update for changing environments. Fortunately, with recent advances in deep learning and publicly available high quality weather datasets, deploying learning methods for estimating these complex systems has become feasible. The current state-of-the-art deep learning models have comparable accuracy to the industry standard numerical models and are becoming more ubiquitous in practice due to their adaptability. Our group seeks to improve upon existing deep learning based forecasting methods by increasing spatial resolutions of global weather predictions. Specifically, we are interested in performing super resolution (SR) on GraphCast temperature predictions by increasing the global precision from 1 degree of accuracy to 0.5 degrees, which is approximately 111km and 55km respectively.

[LG-87] Beyond Algorithmic Fairness: A Guide to Develop and Deploy Ethical AI-Enabled Decision-Support Tools

链接: https://arxiv.org/abs/2409.11489
作者: Rosemarie Santa Gonzalez,Ryan Piansky,Sue M Bae,Justin Biddle,Daniel Molzahn
关键词-EN: hold substantial promise, optimization hold substantial, artificial intelligence, improving the efficiency, integration of artificial
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of artificial intelligence (AI) and optimization hold substantial promise for improving the efficiency, reliability, and resilience of engineered systems. Due to the networked nature of many engineered systems, ethically deploying methodologies at this intersection poses challenges that are distinct from other AI settings, thus motivating the development of ethical guidelines tailored to AI-enabled optimization. This paper highlights the need to go beyond fairness-driven algorithms to systematically address ethical decisions spanning the stages of modeling, data curation, results analysis, and implementation of optimization-based decision support tools. Accordingly, this paper identifies ethical considerations required when deploying algorithms at the intersection of AI and optimization via case studies in power systems as well as supply chain and logistics. Rather than providing a prescriptive set of rules, this paper aims to foster reflection and awareness among researchers and encourage consideration of ethical implications at every step of the decision-making process.

[LG-88] wo Stage Segmentation of Cervical Tumors using PocketNet

链接: https://arxiv.org/abs/2409.11456
作者: Awj Twam,Megan Jacobsen,Rachel Glenn,Ann Klopp,Aradhana M. Venkatesan,David Fuentes
关键词-EN: includes external beam, external beam radiation, locally advanced cervical, definitive treatment regimen, mainstay definitive treatment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cervical cancer remains the fourth most common malignancy amongst women worldwide.1 Concurrent chemoradiotherapy (CRT) serves as the mainstay definitive treatment regimen for locally advanced cervical cancers and includes external beam radiation followed by brachytherapy.2 Integral to radiotherapy treatment planning is the routine contouring of both the target tumor at the level of the cervix, associated gynecologic anatomy and the adjacent organs at risk (OARs). However, manual contouring of these structures is both time and labor intensive and associated with known interobserver variability that can impact treatment outcomes. While multiple tools have been developed to automatically segment OARs and the high-risk clinical tumor volume (HR-CTV) using computed tomography (CT) images,3,4,5,6 the development of deep learning-based tumor segmentation tools using routine T2-weighted (T2w) magnetic resonance imaging (MRI) addresses an unmet clinical need to improve the routine contouring of both anatomical structures and cervical cancers, thereby increasing quality and consistency of radiotherapy planning. This work applied a novel deep-learning model (PocketNet) to segment the cervix, vagina, uterus, and tumor(s) on T2w MRI. The performance of the PocketNet architecture was evaluated, when trained on data via 5-fold cross validation. PocketNet achieved a mean Dice-Sorensen similarity coefficient (DSC) exceeding 70% for tumor segmentation and 80% for organ segmentation. These results suggest that PocketNet is robust to variations in contrast protocols, providing reliable segmentation of the ROIs.

[LG-89] Golden Ratio Search: A Low-Power Adversarial Attack for Deep Learning based Modulation Classification

链接: https://arxiv.org/abs/2409.11454
作者: Deepsayan Sadhukhan,Nitin Priyadarshini Shankar,Sheetal Kalyani
关键词-EN: Automatic Modulation Classification, Deep Learning based, Learning based Automatic, based Automatic Modulation, Modulation Classification
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5 pages, 1 figure, 3 tables

点击查看摘要

Abstract:We propose a minimal power white box adversarial attack for Deep Learning based Automatic Modulation Classification (AMC). The proposed attack uses the Golden Ratio Search (GRS) method to find powerful attacks with minimal power. We evaluate the efficacy of the proposed method by comparing it with existing adversarial attack approaches. Additionally, we test the robustness of the proposed attack against various state-of-the-art architectures, including defense mechanisms such as adversarial training, binarization, and ensemble methods. Experimental results demonstrate that the proposed attack is powerful, requires minimal power, and can be generated in less time, significantly challenging the resilience of current AMC methods.

[LG-90] Learning a Terrain- and Robot-Aware Dynamics Model for Autonomous Mobile Robot Navigation

链接: https://arxiv.org/abs/2409.11452
作者: Jan Achterhold,Suresh Guttikonda,Jens U. Kreber,Haolong Li,Joerg Stueckler
关键词-EN: navigation, terrain, robot, properties, variations
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to Robotics and Autonomous Systems. arXiv admin note: substantial text overlap with arXiv:2307.09206

点击查看摘要

Abstract:Mobile robots should be capable of planning cost-efficient paths for autonomous navigation. Typically, the terrain and robot properties are subject to variations. For instance, properties of the terrain such as friction may vary across different locations. Also, properties of the robot may change such as payloads or wear and tear, e.g., causing changing actuator gains or joint friction. Autonomous navigation approaches should thus be able to adapt to such variations. In this article, we propose a novel approach for learning a probabilistic, terrain- and robot-aware forward dynamics model (TRADYN) which can adapt to such variations and demonstrate its use for navigation. Our learning approach extends recent advances in meta-learning forward dynamics models based on Neural Processes for mobile robot navigation. We evaluate our method in simulation for 2D navigation of a robot with uni-cycle dynamics with varying properties on terrain with spatially varying friction coefficients. In our experiments, we demonstrate that TRADYN has lower prediction error over long time horizons than model ablations which do not adapt to robot or terrain variations. We also evaluate our model for navigation planning in a model-predictive control framework and under various sources of noise. We demonstrate that our approach yields improved performance in planning control-efficient paths by taking robot and terrain properties into account.

[LG-91] Evaluation of pretrained language models on music understanding

链接: https://arxiv.org/abs/2409.11449
作者: Yannis Vasilakis,Rachel Bittner,Johan Pauwels
关键词-EN: Music Information Research, Music-text multimodal systems, Information Research, text-based song generation, Music-text multimodal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Music-text multimodal systems have enabled new approaches to Music Information Research (MIR) applications such as audio-to-text and text-to-audio retrieval, text-based song generation, and music captioning. Despite the reported success, little effort has been put into evaluating the musical knowledge of Large Language Models (LLM). In this paper, we demonstrate that LLMs suffer from 1) prompt sensitivity, 2) inability to model negation (e.g. ‘rock song without guitar’), and 3) sensitivity towards the presence of specific words. We quantified these properties as a triplet-based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology. We leveraged the Audioset ontology to generate triplets consisting of an anchor, a positive (relevant) label, and a negative (less relevant) label for the genre and instruments sub-tree. We evaluated the triplet-based musical knowledge for six general-purpose Transformer-based models. The triplets obtained through this methodology required filtering, as some were difficult to judge and therefore relatively uninformative for evaluation purposes. Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off-the-shelf LLMs need adaptation to music before use.

[LG-92] Volvo Discovery Challenge at ECML-PKDD 2024 ECML KDD2024

链接: https://arxiv.org/abs/2409.11446
作者: Mahmoud Rahat,Peyman Sheikholharam Mashhadi,Sławomir Nowaczyk,Shamik Choudhury,Leo Petrin,Thorsteinn Rognvaldsson,Andreas Voskou,Carlo Metta,Claudio Savelli
关键词-EN: Volvo Discovery Challenge, Volvo Discovery, Discovery Challenge, paper presents, presents an overview
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ECML/PKDD 2024, Discovery Challenge

点击查看摘要

Abstract:This paper presents an overview of the Volvo Discovery Challenge, held during the ECML-PKDD 2024 conference. The challenge’s goal was to predict the failure risk of an anonymized component in Volvo trucks using a newly published dataset. The test data included observations from two generations (gen1 and gen2) of the component, while the training data was provided only for gen1. The challenge attracted 52 data scientists from around the world who submitted a total of 791 entries. We provide a brief description of the problem definition, challenge setup, and statistics about the submissions. In the section on winning methodologies, the first, second, and third-place winners of the competition briefly describe their proposed methods and provide GitHub links to their implemented code. The shared code can be interesting as an advanced methodology for researchers in the predictive maintenance domain. The competition was hosted on the Codabench platform.

[LG-93] Jailbreaking Large Language Models with Symbolic Mathematics

链接: https://arxiv.org/abs/2409.11445
作者: Emet Bethany,Mazal Bethany,Juan Arturo Nolazco Flores,Sumit Kumar Jha,Peyman Najafirad
关键词-EN: unsafe content generation, mitigate unsafe content, Recent advancements, large language models, content generation
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs’ advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack’s success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.

[LG-94] Fault Detection and Identification via Monitoring Modules Based on Clusters of Interacting Measurements

链接: https://arxiv.org/abs/2409.11444
作者: Enrique Luna Villagomez,Vladimir Mahalec
关键词-EN: monitoring methodology based, control-aware distributed process, Principal Component Analysis, Tennessee Eastman Process, process monitoring methodology
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Submitted to CACE 19/08/2024

点击查看摘要

Abstract:This work introduces a novel control-aware distributed process monitoring methodology based on modules comprised of clusters of interacting measurements. The methodology relies on the process flow diagram (PFD) and control system structure without requiring cross-correlation data to create monitoring modules. The methodology is validated on the Tennessee Eastman Process benchmark using full Principal Component Analysis (f-PCA) in the monitoring modules. The results are comparable to nonlinear techniques implemented in a centralized manner such as Kernel PCA (KPCA), Autoencoders (AE), and Recurrent Neural Networks (RNN), or distributed techniques like the Distributed Canonical Correlation Analysis (DCCA). Temporal plots of fault detection by different modules show clearly the magnitude and propagation of the fault through each module, pinpointing the module where the fault originates, and separating controllable faults from other faults. This information, combined with PCA contribution plots, helps detection and identification as effectively as more complex nonlinear centralized or distributed methods.

[LG-95] A Green Multi-Attribute Client Selection for Over-The-Air Federated Learning: A Grey-Wolf-Optimizer Approach

链接: https://arxiv.org/abs/2409.11442
作者: Maryam Ben Driss,Essaid Sabir,Halima Elbiaze,Abdoulaye Baniré Diallo,Mohamed Sadik
关键词-EN: train machine learning, centralizing sensitive data, Federated Learning, machine learning models, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has gained attention across various industries for its capability to train machine learning models without centralizing sensitive data. While this approach offers significant benefits such as privacy preservation and decreased communication overhead, it presents several challenges, including deployment complexity and interoperability issues, particularly in heterogeneous scenarios or resource-constrained environments. Over-the-air (OTA) FL was introduced to tackle these challenges by disseminating model updates without necessitating direct device-to-device connections or centralized servers. However, OTA-FL brought forth limitations associated with heightened energy consumption and network latency. In this paper, we propose a multi-attribute client selection framework employing the grey wolf optimizer (GWO) to strategically control the number of participants in each round and optimize the OTA-FL process while considering accuracy, energy, delay, reliability, and fairness constraints of participating devices. We evaluate the performance of our multi-attribute client selection approach in terms of model loss minimization, convergence time reduction, and energy efficiency. In our experimental evaluation, we assessed and compared the performance of our approach against the existing state-of-the-art methods. Our results demonstrate that the proposed GWO-based client selection outperforms these baselines across various metrics. Specifically, our approach achieves a notable reduction in model loss, accelerates convergence time, and enhances energy efficiency while maintaining high fairness and reliability indicators.

[LG-96] Continual Learning of Conjugated Visual Representations through Higher-order Motion Flows

链接: https://arxiv.org/abs/2409.11441
作者: Simone Marullo,Matteo Tiezzi,Marco Gori,Stefano Melacci
关键词-EN: visual information presents, presents several challenges, challenges due, visual information, information presents
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Currently under review

点击查看摘要

Abstract:Learning with neural networks from a continuous stream of visual information presents several challenges due to the non-i.i.d. nature of the data. However, it also offers novel opportunities to develop representations that are consistent with the information flow. In this paper we investigate the case of unsupervised continual learning of pixel-wise features subject to multiple motion-induced constraints, therefore named motion-conjugated feature representations. Differently from existing approaches, motion is not a given signal (either ground-truth or estimated by external modules), but is the outcome of a progressive and autonomous learning process, occurring at various levels of the feature hierarchy. Multiple motion flows are estimated with neural networks and characterized by different levels of abstractions, spanning from traditional optical flow to other latent signals originating from higher-level features, hence called higher-order motions. Continuously learning to develop consistent multi-order flows and representations is prone to trivial solutions, which we counteract by introducing a self-supervised contrastive loss, spatially-aware and based on flow-induced similarity. We assess our model on photorealistic synthetic streams and real-world videos, comparing to pre-trained state-of-the art feature extractors (also based on Transformers) and to recent unsupervised learning models, significantly outperforming these alternatives.

[LG-97] Machine listening in a neonatal intensive care unit

链接: https://arxiv.org/abs/2409.11439
作者: Modan Tailleur(LS2N, Nantes Univ - ECN, LS2N - équipe SIMS),Vincent Lostanlen(LS2N, LS2N - équipe SIMS, Nantes Univ - ECN),Jean-Philippe Rivière(Nantes Univ, Nantes Univ - UFR FLCE, LS2N, LS2N - équipe PACCE),Pierre Aumond
关键词-EN: common sound sources, alarm devices, common sound, sound sources, Oxygenators
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Oxygenators, alarm devices, and footsteps are some of the most common sound sources in a hospital. Detecting them has scientific value for environmental psychology but comes with challenges of its own: namely, privacy preservation and limited labeled data. In this paper, we address these two challenges via a combination of edge computing and cloud computing. For privacy preservation, we have designed an acoustic sensor which computes third-octave spectrograms on the fly instead of recording audio waveforms. For sample-efficient machine learning, we have repurposed a pretrained audio neural network (PANN) via spectral transcoding and label space adaptation. A small-scale study in a neonatological intensive care unit (NICU) confirms that the time series of detected events align with another modality of measurement: i.e., electronic badges for parents and healthcare professionals. Hence, this paper demonstrates the feasibility of polyphonic machine listening in a hospital ward while guaranteeing privacy by design.

[LG-98] owards Opinion Shaping: A Deep Reinforcement Learning Approach in Bot-User Interactions

链接: https://arxiv.org/abs/2409.11426
作者: Farbod Siahkali,Saba Samadi,Hamed Kebriaei
关键词-EN: Bounded Confidence Model, Stochastic Bounded Confidence, Confidence Model, Stochastic Bounded, Bounded Confidence
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 5 pages, 3 figures, 2 tables

点击查看摘要

Abstract:This paper aims to investigate the impact of interference in social network algorithms via user-bot interactions, focusing on the Stochastic Bounded Confidence Model (SBCM). This paper explores two approaches: positioning bots controlled by agents into the network and targeted advertising under various circumstances, operating with an advertising budget. This study integrates the Deep Deterministic Policy Gradient (DDPG) algorithm and its variants to experiment with different Deep Reinforcement Learning (DRL). Finally, experimental results demonstrate that this approach can result in efficient opinion shaping, indicating its potential in deploying advertising resources on social platforms.

[LG-99] Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data

链接: https://arxiv.org/abs/2409.11423
作者: Atilla Akkus,Mingjie Li,Junjie Chu,Michael Backes,Yang Zhang,Sinem Sav
关键词-EN: Large language models, shown considerable success, Large language, data, fine-tuning
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown considerable success in a range of domain-specific tasks, especially after fine-tuning. However, fine-tuning with real-world data usually leads to privacy risks, particularly when the fine-tuning samples exist in the pre-training data. To avoid the shortcomings of real data, developers often employ methods to automatically generate synthetic data for fine-tuning, as data generated by traditional models are often far away from the real-world pertaining data. However, given the advanced capabilities of LLMs, the distinction between real data and LLM-generated data has become negligible, which may also lead to privacy risks like real data. In this paper, we present an empirical analysis of this underexplored issue by investigating a key question: “Does fine-tuning with LLM-generated data enhance privacy, or does it pose additional privacy risks?” Based on the structure of LLM’s generated data, our research focuses on two primary approaches to fine-tuning with generated data: supervised fine-tuning with unstructured generated data and self-instruct tuning. The number of successful Personal Information Identifier (PII) extractions for Pythia after fine-tuning our generated data raised over 20% . Furthermore, the ROC-AUC score of membership inference attacks for Pythia-6.9b after self-instruct methods also achieves more than 40% improvements on ROC-AUC score than base models. The results indicate the potential privacy risks in LLMs when fine-tuning with the generated data.

[LG-100] hree Pillars Towards Next-Generation Routing System

链接: https://arxiv.org/abs/2409.11412
作者: Lei Li,Mengxuan Zhang,Zizhuo Xu,Yehong Xu,XIaofang Zhou
关键词-EN: increasingly important role, traffic congestion unintentionally, generate traffic congestion, routing results, traffic
类目: Networking and Internet Architecture (cs.NI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The routing results are playing an increasingly important role in transportation efficiency, but they could generate traffic congestion unintentionally. This is because the traffic condition and routing system are disconnected components in the current routing paradigm. In this paper, we propose a next-generation routing paradigm that could reduce traffic congestion by considering the influence of the routing results in real-time. Specifically, we regard the routing results as the root cause of the future traffic flow, which at the same time is identified as the root cause of traffic conditions. To implement such a system, we identify three essential components: 1) the traffic condition simulation that establishes the relation between traffic flow and traffic condition with guaranteed accuracy; 2) the future route management that supports efficient simulation with dynamic route update; 3) the global routing optimization that improves the overall transportation system efficiency. Preliminary design and experimental results will be presented, and the corresponding challenges and research directions will also be discussed.

[LG-101] AIvril: AI-Driven RTL Generation With Verification In-The-Loop

链接: https://arxiv.org/abs/2409.11411
作者: Mubashir ul Islam,Humza Sami,Pierre-Emmanuel Gaillardon,Valerio Tenace
关键词-EN: Large Language Models, computational models capable, performing complex natural, complex natural language, natural language processing
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are computational models capable of performing complex natural language processing tasks. Leveraging these capabilities, LLMs hold the potential to transform the entire hardware design stack, with predictions suggesting that front-end and back-end tasks could be fully automated in the near future. Currently, LLMs show great promise in streamlining Register Transfer Level (RTL) generation, enhancing efficiency, and accelerating innovation. However, their probabilistic nature makes them prone to inaccuracies - a significant drawback in RTL design, where reliability and precision are essential. To address these challenges, this paper introduces AIvril, an advanced framework designed to enhance the accuracy and reliability of RTL-aware LLMs. AIvril employs a multi-agent, LLM-agnostic system for automatic syntax correction and functional verification, significantly reducing - and in many cases, completely eliminating - instances of erroneous code generation. Experimental results conducted on the VerilogEval-Human dataset show that our framework improves code quality by nearly 2x when compared to previous works, while achieving an 88.46% success rate in meeting verification objectives. This represents a critical step toward automating and optimizing hardware design workflows, offering a more dependable methodology for AI-driven RTL design. Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2409.11411 [cs.AI] (or arXiv:2409.11411v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2409.11411 Focus to learn more arXiv-issued DOI via DataCite

[LG-102] CyberNFTs: Conceptualizing a decentralized and reward-driven intrusion detection system with ML

链接: https://arxiv.org/abs/2409.11409
作者: Synim Selimi,Blerim Rexha,Kamer Vishi
关键词-EN: rapid evolution, people interact, interact and share, share data, Internet
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, 1 table, 1 algorithm, 1 listing, journal article

点击查看摘要

Abstract:The rapid evolution of the Internet, particularly the emergence of Web3, has transformed the ways people interact and share data. Web3, although still not well defined, is thought to be a return to the decentralization of corporations’ power over user data. Despite the obsolescence of the idea of building systems to detect and prevent cyber intrusions, this is still a topic of interest. This paper proposes a novel conceptual approach for implementing decentralized collaborative intrusion detection networks (CIDN) through a proof-of-concept. The study employs an analytical and comparative methodology, examining the synergy between cutting-edge Web3 technologies and information security. The proposed model incorporates blockchain concepts, cyber non-fungible token (cyberNFT) rewards, machine learning algorithms, and publish/subscribe architectures. Finally, the paper discusses the strengths and limitations of the proposed system, offering insights into the potential of decentralized cybersecurity models.

[LG-103] owards Signal Processing In Large Language Models

链接: https://arxiv.org/abs/2406.10254
作者: Prateek Verma,Mert Pilanci
关键词-EN: Large Language Model, Large Language, Language Model, applying signal processing, signal processing inside
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:This paper introduces the idea of applying signal processing inside a Large Language Model (LLM). With the recent explosion of generative AI, our work can help bridge two fields together, namely the field of signal processing and large language models. We draw parallels between classical Fourier-Transforms and Fourier Transform-like learnable time-frequency representations for every intermediate activation signal of an LLM. Once we decompose every activation signal across tokens into a time-frequency representation, we learn how to filter and reconstruct them, with all components learned from scratch, to predict the next token given the previous context. We show that for GPT-like architectures, our work achieves faster convergence and significantly increases performance by adding a minuscule number of extra parameters when trained for the same epochs. We hope this work paves the way for algorithms exploring signal processing inside the signals found in neural architectures like LLMs and beyond.

[LG-104] Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

链接: https://arxiv.org/abs/2105.00335
作者: Prateek Verma,Jonathan Berger
关键词-EN: learning hierarchical organizations, produced compelling models, CNN architectures, perception and cognition, learning hierarchical
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 4 figures; Under review WASPAA 2021

点击查看摘要

Abstract:Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.

[LG-105] Scale-covariant and scale-invariant Gaussian derivative networks

链接: https://arxiv.org/abs/2011.14759
作者: Tony Lindeberg
关键词-EN: deep learning architecture, deep learning, coupling parameterized scale-space, parameterized scale-space operations, multiple scale channels
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages, 10 figures

点击查看摘要

Abstract:This paper presents a hybrid approach between scale-space theory and deep learning, where a deep learning architecture is constructed by coupling parameterized scale-space operations in cascade. By sharing the learnt parameters between multiple scale channels, and by using the transformation properties of the scale-space primitives under scaling transformations, the resulting network becomes provably scale covariant. By in addition performing max pooling over the multiple scale channels, a resulting network architecture for image classification also becomes provably scale invariant. We investigate the performance of such networks on the MNISTLargeScale dataset, which contains rescaled images from original MNIST over a factor of 4 concerning training data and over a factor of 16 concerning testing data. It is demonstrated that the resulting approach allows for scale generalization, enabling good performance for classifying patterns at scales not present in the training data.

[LG-106] Understanding when spatial transformer networks do not support invariance and what to do about it

链接: https://arxiv.org/abs/2004.11678
作者: Lukas Finnveden,Ylva Jansson,Tony Lindeberg
关键词-EN: enable convolutional neural, CNN feature maps, convolutional neural networks, CNN feature, convolutional neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Spatial transformer networks (STNs) were designed to enable convolutional neural networks (CNNs) to learn invariance to image transformations. STNs were originally proposed to transform CNN feature maps as well as input images. This enables the use of more complex features when predicting transformation parameters. However, since STNs perform a purely spatial transformation, they do not, in the general case, have the ability to align the feature maps of a transformed image with those of its original. STNs are therefore unable to support invariance when transforming CNN feature maps. We present a simple proof for this and study the practical implications, showing that this inability is coupled with decreased classification accuracy. We therefore investigate alternative STN architectures that make use of complex features. We find that while deeper localization networks are difficult to train, localization networks that share parameters with the classification network remain stable as they grow deeper, which allows for higher classification accuracy on difficult datasets. Finally, we explore the interaction between localization network complexity and iterative image alignment.

[LG-107] Denoising diffusion models for high-resolution microscopy image restoration

链接: https://arxiv.org/abs/2409.12078
作者: Pamela Osuna-Vargas,Maren H. Wehrheim,Lucas Zinz,Johanna Rahm,Ashwin Balakrishnan,Alexandra Kaminer,Mike Heilemann,Matthias Kaschube
关键词-EN: unraveling intricate details, Advances in microscopy, imaging enable researchers, microscopy imaging enable, microscopy imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advances in microscopy imaging enable researchers to visualize structures at the nanoscale level thereby unraveling intricate details of biological organization. However, challenges such as image noise, photobleaching of fluorophores, and low tolerability of biological samples to high light doses remain, restricting temporal resolutions and experiment durations. Reduced laser doses enable longer measurements at the cost of lower resolution and increased noise, which hinders accurate downstream analyses. Here we train a denoising diffusion probabilistic model (DDPM) to predict high-resolution images by conditioning the model on low-resolution information. Additionally, the probabilistic aspect of the DDPM allows for repeated generation of images that tend to further increase the signal-to-noise ratio. We show that our model achieves a performance that is better or similar to the previously best-performing methods, across four highly diverse datasets. Importantly, while any of the previous methods show competitive performance for some, but not all datasets, our method consistently achieves high performance across all four data sets, suggesting high generalizability.

[LG-108] Fitting Multilevel Factor Models

链接: https://arxiv.org/abs/2409.12067
作者: Tetiana Parshakova,Trevor Hastie,Stephen Boyd
关键词-EN: multilevel low rank, multilevel factor model, MLR matrix, multilevel factor, PSD MLR matrix
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Mathematical Software (cs.MS); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We examine a special case of the multilevel factor model, with covariance given by multilevel low rank (MLR) matrix~\citeparshakova2023factor. We develop a novel, fast implementation of the expectation-maximization (EM) algorithm, tailored for multilevel factor models, to maximize the likelihood of the observed data. This method accommodates any hierarchical structure and maintains linear time and storage complexities per iteration. This is achieved through a new efficient technique for computing the inverse of the positive definite MLR matrix. We show that the inverse of an invertible PSD MLR matrix is also an MLR matrix with the same sparsity in factors, and we use the recursive Sherman-Morrison-Woodbury matrix identity to obtain the factors of the inverse. Additionally, we present an algorithm that computes the Cholesky factorization of an expanded matrix with linear time and space complexities, yielding the covariance matrix as its Schur complement. This paper is accompanied by an open-source package that implements the proposed methods.

[LG-109] Cartan moving frames and the data manifolds

链接: https://arxiv.org/abs/2409.12057
作者: Eliot Tron,Rita Fioresi,Nicolas Couellan,Stéphane Puechmorel
关键词-EN: Cartan moving frames, data information metric, Riemannian structure, language of Cartan, Cartan moving
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注:

点击查看摘要

Abstract:The purpose of this paper is to employ the language of Cartan moving frames to study the geometry of the data manifolds and its Riemannian structure, via the data information metric and its curvature at data points. Using this framework and through experiments, explanations on the response of a neural network are given by pointing out the output classes that are easily reachable from a given input. This emphasizes how the proposed mathematical relationship between the output of the network and the geometry of its inputs can be exploited as an explainable artificial intelligence tool.

[LG-110] All-in-one foundational models learning across quantum chemical levels

链接: https://arxiv.org/abs/2409.12015
作者: Yuxinxin Chen,Pavlo O. Dral
关键词-EN: potentials typically target, single quantum chemical, provide scalable solutions, Machine learning, potentials typically
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) potentials typically target a single quantum chemical (QC) level while the ML models developed for multi-fidelity learning have not been shown to provide scalable solutions for foundational models. Here we introduce the all-in-one (AIO) ANI model architecture based on multimodal learning which can learn an arbitrary number of QC levels. Our all-in-one learning approach offers a more general and easier-to-use alternative to transfer learning. We use it to train the AIO-ANI-UIP foundational model with the generalization capability comparable to semi-empirical GFN2-xTB and DFT with a double-zeta basis set for organic molecules. We show that the AIO-ANI model can learn across different QC levels ranging from semi-empirical to density functional theory to coupled cluster. We also use AIO models to design the foundational model \Delta-AIO-ANI based on \Delta-learning with increased accuracy and robustness compared to AIO-ANI-UIP. The code and the foundational models are available at this https URL they will be integrated into the universal and updatable AI-enhanced QM (UAIQM) library and made available in the MLatom package so that they can be used online at the XACS cloud computing platform (see this https URL for updates).

[LG-111] Accelerating the Training and Improving the Reliability of Machine-Learned Interatomic Potentials for Strongly Anharmonic Materials through Active Learning

链接: https://arxiv.org/abs/2409.11808
作者: Kisung Kang,Thomas A. R. Purcell,Christian Carbogno,Matthias Scheffler
关键词-EN: initio molecular dynamics, employing machine-learned interatomic, urgently needed complement, machine-learned interatomic potentials, Molecular dynamics
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 15 pages, 13 figures

点击查看摘要

Abstract:Molecular dynamics (MD) employing machine-learned interatomic potentials (MLIPs) serve as an efficient, urgently needed complement to ab initio molecular dynamics (aiMD). By training these potentials on data generated from ab initio methods, their averaged predictions can exhibit comparable performance to ab initio methods at a fraction of the cost. However, insufficient training sets might lead to an improper description of the dynamics in strongly anharmonic materials, because critical effects might be overlooked in relevant cases, or only incorrectly captured, or hallucinated by the MLIP when they are not actually present. In this work, we show that an active learning scheme that combines MD with MLIPs (MLIP-MD) and uncertainty estimates can avoid such problematic predictions. In short, efficient MLIP-MD is used to explore configuration space quickly, whereby an acquisition function based on uncertainty estimates and on energetic viability is employed to maximize the value of the newly generated data and to focus on the most unfamiliar but reasonably accessible regions of phase space. To verify our methodology, we screen over 112 materials and identify 10 examples experiencing the aforementioned problems. Using CuI and AgGaSe _2 as archetypes for these problematic materials, we discuss the physical implications for strongly anharmonic effects and demonstrate how the developed active learning scheme can address these issues.

[LG-112] Symmetry-Based Structured Matrices for Efficient Approximately Equivariant Networks

链接: https://arxiv.org/abs/2409.11772
作者: Ashwin Samudre,Mircea Petrache,Brian D. Nord,Shubhendu Trivedi
关键词-EN: designing symmetry-aware neural, symmetry-aware neural networks, recent interest, interest in designing, designing symmetry-aware
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:There has been much recent interest in designing symmetry-aware neural networks (NNs) exhibiting relaxed equivariance. Such NNs aim to interpolate between being exactly equivariant and being fully flexible, affording consistent performance benefits. In a separate line of work, certain structured parameter matrices – those with displacement structure, characterized by low displacement rank (LDR) – have been used to design small-footprint NNs. Displacement structure enables fast function and gradient evaluation, but permits accurate approximations via compression primarily to classical convolutional neural networks (CNNs). In this work, we propose a general framework – based on a novel construction of symmetry-based structured matrices – to build approximately equivariant NNs with significantly reduced parameter counts. Our framework integrates the two aforementioned lines of work via the use of so-called Group Matrices (GMs), a forgotten precursor to the modern notion of regular representations of finite groups. GMs allow the design of structured matrices – resembling LDR matrices – which generalize the linear operations of a classical CNN from cyclic groups to general finite groups and their homogeneous spaces. We show that GMs can be employed to extend all the elementary operations of CNNs to general discrete groups. Further, the theory of structured matrices based on GMs provides a generalization of LDR theory focussed on matrices with cyclic structure, providing a tool for implementing approximate equivariance for discrete groups. We test GM-based architectures on a variety of tasks in the presence of relaxed symmetry. We report that our framework consistently performs competitively compared to approximately equivariant NNs, and other structured matrix-based compression frameworks, sometimes with a one or two orders of magnitude lower parameter count.

[LG-113] From exponential to finite/fixed-time stability: Applications to optimization

链接: https://arxiv.org/abs/2409.11713
作者: Ibrahim K. Ozaslan,Mihailo R. Jovanović
关键词-EN: typically involves study, algorithms typically involves, specific problem instances, optimization algorithms typically, fixed-time stable algorithm
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
*备注: 6 pages; 1 figure

点击查看摘要

Abstract:The development of finite/fixed-time stable optimization algorithms typically involves study of specific problem instances. The lack of a unified framework hinders understanding of more sophisticated algorithms, e.g., primal-dual gradient flow dynamics. The purpose of this paper is to address the following question: Given an exponentially stable optimization algorithm, can it be modified to obtain a finite/fixed-time stable algorithm? We provide an affirmative answer, demonstrate how the solution can be computed on a finite-time interval via a simple scaling of the right-hand-side of the original dynamics, and certify the desired properties of the modified algorithm using the Lyapunov function that proves exponential stability of the original system. Finally, we examine nonsmooth composite optimization problems and smooth problems with linear constraints to demonstrate the merits of our approach.

[LG-114] How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities

链接: https://arxiv.org/abs/2409.11654
作者: Charlotte Bunne,Yusuf Roohani,Yanay Rosen,Ankit Gupta,Xikun Zhang,Marcel Roed,Theo Alexandrov,Mohammed AlQuraishi,Patricia Brennan,Daniel B. Burkhardt,Andrea Califano,Jonah Cool,Abby F. Dernburg,Kirsty Ewing,Emily B. Fox,Matthias Haury,Amy E. Herr,Eric Horvitz,Patrick D. Hsu,Viren Jain,Gregory R. Johnson,Thomas Kalil,David R. Kelley,Shana O. Kelley,Anna Kreshuk,Tim Mitchison,Stephani Otte,Jay Shendure,Nicholas J. Sofroniew,Fabian Theis,Christina V. Theodoris,Srigokul Upadhyayula,Marc Valer,Bo Wang,Eric Xing,Serena Yeung-Levy,Marinka Zitnik,Theofanis Karaletsos,Aviv Regev,Emma Lundberg,Jure Leskovec,Stephen R. Quake
关键词-EN: Virtual Cells, arguably the smallest, smallest unit, unit of life, cells
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The cell is arguably the smallest unit of life and is central to understanding biology. Accurate modeling of cells is important for this understanding as well as for determining the root causes of disease. Recent advances in artificial intelligence (AI), combined with the ability to generate large-scale experimental data, present novel opportunities to model cells. Here we propose a vision of AI-powered Virtual Cells, where robust representations of cells and cellular systems under different conditions are directly learned from growing biological data across measurements and scales. We discuss desired capabilities of AI Virtual Cells, including generating universal representations of biological entities across scales, and facilitating interpretable in silico experiments to predict and understand their behavior using Virtual Instruments. We further address the challenges, opportunities and requirements to realize this vision including data needs, evaluation strategies, and community standards and engagement to ensure biological accuracy and broad utility. We envision a future where AI Virtual Cells help identify new drug targets, predict cellular responses to perturbations, as well as scale hypothesis exploration. With open science collaborations across the biomedical ecosystem that includes academia, philanthropy, and the biopharma and AI industries, a comprehensive predictive understanding of cell mechanisms and interactions is within reach.

[LG-115] DiffESM: Conditional Emulation of Temperature and Precipitation in Earth System Models with 3D Diffusion Models

链接: https://arxiv.org/abs/2409.11601
作者: Seth Bassetti,Brian Hutchinson,Claudia Tebaldi,Ben Kravitz
关键词-EN: Earth System Models, Earth System, Earth climate, System Models, Earth
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Geophysics (physics.geo-ph)
*备注: Accepted for publication in Journal of Advances in Modeling Earth Systems

点击查看摘要

Abstract:Earth System Models (ESMs) are essential for understanding the interaction between human activities and the Earth’s climate. However, the computational demands of ESMs often limit the number of simulations that can be run, hindering the robust analysis of risks associated with extreme weather events. While low-cost climate emulators have emerged as an alternative to emulate ESMs and enable rapid analysis of future climate, many of these emulators only provide output on at most a monthly frequency. This temporal resolution is insufficient for analyzing events that require daily characterization, such as heat waves or heavy precipitation. We propose using diffusion models, a class of generative deep learning models, to effectively downscale ESM output from a monthly to a daily frequency. Trained on a handful of ESM realizations, reflecting a wide range of radiative forcings, our DiffESM model takes monthly mean precipitation or temperature as input, and is capable of producing daily values with statistical characteristics close to ESM output. Combined with a low-cost emulator providing monthly means, this approach requires only a small fraction of the computational resources needed to run a large ensemble. We evaluate model behavior using a number of extreme metrics, showing that DiffESM closely matches the spatio-temporal behavior of the ESM output it emulates in terms of the frequency and spatial characteristics of phenomena such as heat waves, dry spells, or rainfall intensity.

[LG-116] Outlier Detection with Cluster Catch Digraphs

链接: https://arxiv.org/abs/2409.11596
作者: Rui Shi,Nedret Billor,Elvan Ceyhan
关键词-EN: Mutual Catch Graph, Cluster Catch Digraphs, Mutual Catch, Catch Graph, varying cluster shapes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 73 pages, 146 figures

点击查看摘要

Abstract:This paper introduces a novel family of outlier detection algorithms based on Cluster Catch Digraphs (CCDs), specifically tailored to address the challenges of high dimensionality and varying cluster shapes, which deteriorate the performance of most traditional outlier detection methods. We propose the Uniformity-Based CCD with Mutual Catch Graph (U-MCCD), the Uniformity- and Neighbor-Based CCD with Mutual Catch Graph (UN-MCCD), and their shape-adaptive variants (SU-MCCD and SUN-MCCD), which are designed to detect outliers in data sets with arbitrary cluster shapes and high dimensions. We present the advantages and shortcomings of these algorithms and provide the motivation or need to define each particular algorithm. Through comprehensive Monte Carlo simulations, we assess their performance and demonstrate the robustness and effectiveness of our algorithms across various settings and contamination levels. We also illustrate the use of our algorithms on various real-life data sets. The U-MCCD algorithm efficiently identifies outliers while maintaining high true negative rates, and the SU-MCCD algorithm shows substantial improvement in handling non-uniform clusters. Additionally, the UN-MCCD and SUN-MCCD algorithms address the limitations of existing methods in high-dimensional spaces by utilizing Nearest Neighbor Distances (NND) for clustering and outlier detection. Our results indicate that these novel algorithms offer substantial advancements in the accuracy and adaptability of outlier detection, providing a valuable tool for various real-world applications. Keyword: Outlier detection, Graph-based clustering, Cluster catch digraphs, k -nearest-neighborhood, Mutual catch graphs, Nearest neighbor distance. Comments: 73 pages, 146 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2409.11596 [stat.ML] (or arXiv:2409.11596v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2409.11596 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-117] Automating proton PBS treatment planning for head and neck cancers using policy gradient-based deep reinforcement learning

链接: https://arxiv.org/abs/2409.11576
作者: Qingqing Wang,Chang Chang
关键词-EN: pencil beam scanning, Proton pencil beam, planning, planning objectives, beam scanning
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Proton pencil beam scanning (PBS) treatment planning for head and neck (HN) cancers is a time-consuming and experience-demanding task where a large number of planning objectives are involved. Deep reinforcement learning (DRL) has recently been introduced to the planning processes of intensity-modulated radiation therapy and brachytherapy for prostate, lung, and cervical cancers. However, existing approaches are built upon the Q-learning framework and weighted linear combinations of clinical metrics, suffering from poor scalability and flexibility and only capable of adjusting a limited number of planning objectives in discrete action spaces. We propose an automatic treatment planning model using the proximal policy optimization (PPO) algorithm and a dose distribution-based reward function for proton PBS treatment planning of HN cancers. Specifically, a set of empirical rules is used to create auxiliary planning structures from target volumes and organs-at-risk (OARs), along with their associated planning objectives. These planning objectives are fed into an in-house optimization engine to generate the spot monitor unit (MU) values. A decision-making policy network trained using PPO is developed to iteratively adjust the involved planning objective parameters in a continuous action space and refine the PBS treatment plans using a novel dose distribution-based reward function. Proton HN treatment plans generated by the model show improved OAR sparing with equal or superior target coverage when compared with human-generated plans. Moreover, additional experiments on liver cancer demonstrate that the proposed method can be successfully generalized to other treatment sites. To the best of our knowledge, this is the first DRL-based automatic treatment planning model capable of achieving human-level performance for HN cancers.

[LG-118] Discrete Unit based Masking for Improving Disentanglement in Voice Conversion

链接: https://arxiv.org/abs/2409.11560
作者: Philip H. Lee,Ismail Rasim Ulgen,Berrak Sisman
关键词-EN: Voice conversion, speaker identity, speaker, Voice, speaker features
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to IEEE SLT 2024

点击查看摘要

Abstract:Voice conversion (VC) aims to modify the speaker’s identity while preserving the linguistic content. Commonly, VC methods use an encoder-decoder architecture, where disentangling the speaker’s identity from linguistic information is crucial. However, the disentanglement approaches used in these methods are limited as the speaker features depend on the phonetic content of the utterance, compromising disentanglement. This dependency is amplified with attention-based methods. To address this, we introduce a novel masking mechanism in the input before speaker encoding, masking certain discrete speech units that correspond highly with phoneme classes. Our work aims to reduce the phonetic dependency of speaker features by restricting access to some phonetic information. Furthermore, since our approach is at the input level, it is applicable to any encoder-decoder based VC framework. Our approach improves disentanglement and conversion performance across multiple VC methods, showing significant effectiveness, particularly in attention-based method, with 44% relative improvement in objective intelligibility.

[LG-119] Machine Learning for Analyzing Atomic Force Microscopy (AFM) Images Generated from Polymer Blends

链接: https://arxiv.org/abs/2409.11438
作者: Aanish Paruchuri,Yunfei Wang,Xiaodan Gu,Arthi Jayaraman
关键词-EN: atomic force microscopy, force microscopy images, microscopy images obtained, unsupervised learning techniques, AFM images
类目: Image and Video Processing (eess.IV); Materials Science (cond-mat.mtrl-sci); Soft Condensed Matter (cond-mat.soft); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 39 pages, 13 figures, 4 tables

点击查看摘要

Abstract:In this paper we present a new machine learning workflow with unsupervised learning techniques to identify domains within atomic force microscopy images obtained from polymer films. The goal of the workflow is to identify the spatial location of the two types of polymer domains with little to no manual intervention and calculate the domain size distributions which in turn can help qualify the phase separated state of the material as macrophase or microphase ordered or disordered domains. We briefly review existing approaches used in other fields, computer vision and signal processing that can be applicable for the above tasks that happen frequently in the field of polymer science and engineering. We then test these approaches from computer vision and signal processing on the AFM image dataset to identify the strengths and limitations of each of these approaches for our first task. For our first domain segmentation task, we found that the workflow using discrete Fourier transform or discrete cosine transform with variance statistics as the feature works the best. The popular ResNet50 deep learning approach from computer vision field exhibited relatively poorer performance in the domain segmentation task for our AFM images as compared to the DFT and DCT based workflows. For the second task, for each of 144 input AFM images, we then used an existing porespy python package to calculate the domain size distribution from the output of that image from DFT based workflow. The information and open source codes we share in this paper can serve as a guide for researchers in the polymer and soft materials fields who need ML modeling and workflows for automated analyses of AFM images from polymer samples that may have crystalline or amorphous domains, sharp or rough interfaces between domains, or micro or macrophase separated domains.

[LG-120] Federated Learning with Quantum Computing and Fully Homomorphic Encryption: A Novel Computing Paradigm Shift in Privacy-Preserving ML

链接: https://arxiv.org/abs/2409.11430
作者: Siddhant Dutta,Pavana P Karanth,Pedro Maciel Xavier,Iago Leal de Freitas,Nouhaila Innan,Sadok Ben Yahia,Muhammad Shafique,David E. Bernal Neira
关键词-EN: information security worldwide, Fully Homomorphic Encryption, widespread deployment, deployment of products, products powered
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The widespread deployment of products powered by machine learning models is raising concerns around data privacy and information security worldwide. To address this issue, Federated Learning was first proposed as a privacy-preserving alternative to conventional methods that allow multiple learning clients to share model knowledge without disclosing private data. A complementary approach known as Fully Homomorphic Encryption (FHE) is a quantum-safe cryptographic system that enables operations to be performed on encrypted weights. However, implementing mechanisms such as these in practice often comes with significant computational overhead and can expose potential security threats. Novel computing paradigms, such as analog, quantum, and specialized digital hardware, present opportunities for implementing privacy-preserving machine learning systems while enhancing security and mitigating performance loss. This work instantiates these ideas by applying the FHE scheme to a Federated Learning Neural Network architecture that integrates both classical and quantum layers.

信息检索

[IR-0] Generalized compression and compressive search of large datasets

链接: https://arxiv.org/abs/2409.12161
作者: Morgan E. Prior,Thomas Howard III,Emily Light,Najib Ishaq,Noah M. Daniels
关键词-EN: Big Data explosion, Toggle, Big Data, search, Data
类目: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The Big Data explosion has necessitated the development of search algorithms that scale sub-linearly in time and memory. While compression algorithms and search algorithms do exist independently, few algorithms offer both, and those which do are domain-specific. We present panCAKES, a novel approach to compressive search, i.e., a way to perform k -NN and \rho -NN search on compressed data while only decompressing a small, relevant, portion of the data. panCAKES assumes the manifold hypothesis and leverages the low-dimensional structure of the data to compress and search it efficiently. panCAKES is generic over any distance function for which the distance between two points is proportional to the memory cost of storing an encoding of one in terms of the other. This property holds for many widely-used distance functions, e.g. string edit distances (Levenshtein, Needleman-Wunsch, etc.) and set dissimilarity measures (Jaccard, Dice, etc.). We benchmark panCAKES on a variety of datasets, including genomic, proteomic, and set data. We compare compression ratios to gzip, and search performance between the compressed and uncompressed versions of the same dataset. panCAKES achieves compression ratios close to those of gzip, while offering sub-linear time performance for k -NN and \rho -NN search. We conclude that panCAKES is an efficient, general-purpose algorithm for exact compressive search on large datasets that obey the manifold hypothesis. We provide an open-source implementation of panCAKES in the Rust programming language. Subjects: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR) MSC classes: 68P20 ACMclasses: E.2; E.4; H.3.2; H.3.3; H.1.1 Cite as: arXiv:2409.12161 [cs.DS] (or arXiv:2409.12161v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2409.12161 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Noah Daniels [view email] [v1] Wed, 18 Sep 2024 17:25:31 UTC (2,068 KB) Full-text links: Access Paper: View a PDF of the paper titled Generalized compression and compressive search of large datasets, by Morgan E. Prior and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.DS prev | next new | recent | 2024-09 Change to browse by: cs cs.IR References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[IR-1] Decoding Style: Efficient Fine-Tuning of LLMs for Image-Guided Outfit Recommendation with Preference CIKM2024

链接: https://arxiv.org/abs/2409.12150
作者: Najmeh Forouzandehmehr,Nima Farrokhsiar,Ramin Giahi,Evren Korpeoglu,Kannan Achan
关键词-EN: large language models, fashion compatibility understanding, Multimodal Large Language, large language, complex challenge
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CIKM 2024

点击查看摘要

Abstract:Personalized outfit recommendation remains a complex challenge, demanding both fashion compatibility understanding and trend awareness. This paper presents a novel framework that harnesses the expressive power of large language models (LLMs) for this task, mitigating their “black box” and static nature through fine-tuning and direct feedback integration. We bridge the item visual-textual gap in items descriptions by employing image captioning with a Multimodal Large Language Model (MLLM). This enables the LLM to extract style and color characteristics from human-curated fashion images, forming the basis for personalized recommendations. The LLM is efficiently fine-tuned on the open-source Polyvore dataset of curated fashion images, optimizing its ability to recommend stylish outfits. A direct preference mechanism using negative examples is employed to enhance the LLM’s decision-making process. This creates a self-enhancing AI feedback loop that continuously refines recommendations in line with seasonal fashion trends. Our framework is evaluated on the Polyvore dataset, demonstrating its effectiveness in two key tasks: fill-in-the-blank, and complementary item retrieval. These evaluations underline the framework’s ability to generate stylish, trend-aligned outfit suggestions, continuously improving through direct feedback. The evaluation results demonstrated that our proposed framework significantly outperforms the base LLM, creating more cohesive outfits. The improved performance in these tasks underscores the proposed framework’s potential to enhance the shopping experience with accurate suggestions, proving its effectiveness over the vanilla LLM based outfit generation.

[IR-2] Skill matching at scale: freelancer-project alignment for efficient multilingual candidate retrieval

链接: https://arxiv.org/abs/2409.12097
作者: Warren Jouanneau,Marc Palyart,Emma Jouffroy
关键词-EN: Finding the perfect, perform at scale, perfect match, job proposal, easy task
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Finding the perfect match between a job proposal and a set of freelancers is not an easy task to perform at scale, especially in multiple languages. In this paper, we propose a novel neural retriever architecture that tackles this problem in a multilingual setting. Our method encodes project descriptions and freelancer profiles by leveraging pre-trained multilingual language models. The latter are used as backbone for a custom transformer architecture that aims to keep the structure of the profiles and project. This model is trained with a contrastive loss on historical data. Thanks to several experiments, we show that this approach effectively captures skill matching similarity and facilitates efficient matching, outperforming traditional methods.

[IR-3] Understanding the Effects of the Baidu-ULTR Logging Policy on Two-Tower Models RECSYS’24

链接: https://arxiv.org/abs/2409.12043
作者: Morris de Haan,Philipp Hager
关键词-EN: recent work suggests, logging policy confounding, learning to rank, recent work, industry applications
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted at the CONSEQUENCES '24 workshop, co-located with ACM RecSys '24

点击查看摘要

Abstract:Despite the popularity of the two-tower model for unbiased learning to rank (ULTR) tasks, recent work suggests that it suffers from a major limitation that could lead to its collapse in industry applications: the problem of logging policy confounding. Several potential solutions have even been proposed; however, the evaluation of these methods was mostly conducted using semi-synthetic simulation experiments. This paper bridges the gap between theory and practice by investigating the confounding problem on the largest real-world dataset, Baidu-ULTR. Our main contributions are threefold: 1) we show that the conditions for the confounding problem are given on Baidu-ULTR, 2) the confounding problem bears no significant effect on the two-tower model, and 3) we point to a potential mismatch between expert annotations, the golden standard in ULTR, and user click behavior.

[IR-4] AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots

链接: https://arxiv.org/abs/2409.11905
作者: Zhaxizhuoma,Pengan Chen,Ziniu Wu,Jiawei Sun,Dong Wang,Peng Zhou,Nieqing Cao,Yan Ding,Bin Zhao,Xuelong Li
关键词-EN: paper presents AlignBot, paper presents, framework designed, designed to optimize, robots by effectively
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminders. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders-such as personalized preferences, corrective guidance, and contextual assistance-into structured instruction-formatted cues that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects task-relevant historical successes as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in real-world household environments, which are constructed within the laboratory to replicate typical household settings. A multimodal dataset with over 1,500 entries derived from volunteer reminders is used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting a 65% improvement and over four times greater effectiveness. Supplementary materials are available at: this https URL

[IR-5] Retrieve Annotate Evaluate Repeat: Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation

链接: https://arxiv.org/abs/2409.11860
作者: Kasra Hosseini,Thomas Kober,Josip Krapac,Roland Vollgraf,Weiwei Cheng,Ana Peleteiro Ramallo
关键词-EN: Evaluating production-level retrieval, well-trained human annotators, challenging task due, Large Language Models, production-level retrieval systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: 13 pages, 5 figures, 4 Tables

点击查看摘要

Abstract:Evaluating production-level retrieval systems at scale is a crucial yet challenging task due to the limited availability of a large pool of well-trained human annotators. Large Language Models (LLMs) have the potential to address this scaling issue and offer a viable alternative to humans for the bulk of annotation tasks. In this paper, we propose a framework for assessing the product search engines in a large-scale e-commerce setting, leveraging Multimodal LLMs for (i) generating tailored annotation guidelines for individual queries, and (ii) conducting the subsequent annotation task. Our method, validated through deployment on a large e-commerce platform, demonstrates comparable quality to human annotations, significantly reduces time and cost, facilitates rapid problem discovery, and provides an effective solution for production-level quality control at scale.

[IR-6] he Factuality of Large Language Models in the Legal Domain CIKM2024

链接: https://arxiv.org/abs/2409.11798
作者: Rajaa El Hamdani,Thomas Bonald,Fragkiskos Malliaros,Nils Holzenberger,Fabian Suchanek
关键词-EN: large language models, realistic usage scenario, language models, model abstain, usage scenario
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: CIKM 2024, short paper

点击查看摘要

Abstract:This paper investigates the factuality of large language models (LLMs) as knowledge bases in the legal domain, in a realistic usage scenario: we allow for acceptable variations in the answer, and let the model abstain from answering when uncertain. First, we design a dataset of diverse factual questions about case law and legislation. We then use the dataset to evaluate several LLMs under different evaluation methods, including exact, alias, and fuzzy matching. Our results show that the performance improves significantly under the alias and fuzzy matching methods. Further, we explore the impact of abstaining and in-context examples, finding that both strategies enhance precision. Finally, we demonstrate that additional pre-training on legal documents, as seen with SaulLM, further improves factual precision from 63% to 81%.

[IR-7] Active Reconfigurable Intelligent Surface Empowered Synthetic Aperture Radar Imaging

链接: https://arxiv.org/abs/2409.11728
作者: Yifan Sun,Rang Liu,Zhiping Lu,Honghao Luo,Ming Li,Qian Liu
关键词-EN: Synthetic Aperture Radar, achieve higher spatial, higher spatial resolution, spatial resolution imaging, Synthetic Aperture
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) utilizes the movement of the radar antenna over a specific area of interest to achieve higher spatial resolution imaging. In this paper, we aim to investigate the realization of SAR imaging for a stationary radar system with the assistance of active reconfigurable intelligent surface (ARIS) mounted on an unmanned aerial vehicle (UAV). As the UAV moves along the stationary trajectory, the ARIS can not only build a high-quality virtual line-of-sight (LoS) propagation path, but its mobility can also effectively create a much larger virtual aperture, which can be utilized to realize a SAR system. In this paper, we first present a range-Doppler (RD) imaging algorithm to obtain imaging results for the proposed ARIS-empowered SAR system. Then, to further improve the SAR imaging performance, we attempt to optimize the reflection coefficients of ARIS to maximize the signal-to-noise ratio (SNR) at the stationary radar receiver under the constraints of ARIS maximum power and amplification factor. An effective algorithm based on fractional programming (FP) and majorization minimization (MM) methods is developed to solve the resulting non-convex problem. Simulation results validate the effectiveness of ARIS-assisted SAR imaging and our proposed RD imaging and ARIS optimization algorithms.

[IR-8] FLARE: Fusing Language Models and Collaborative Architectures for Recommender Enhancement

链接: https://arxiv.org/abs/2409.11699
作者: Liam Hebert,Marialena Kyriakidi,Hubert Pham,Krishna Sayana,James Pine,Sukhdeep Sodhi,Ambarish Jash
关键词-EN: Hybrid recommender systems, combining item IDs, Hybrid recommender, textual descriptions, offer potential
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Hybrid recommender systems, combining item IDs and textual descriptions, offer potential for improved accuracy. However, previous work has largely focused on smaller datasets and model architectures. This paper introduces Flare (Fusing Language models and collaborative Architectures for Recommender Enhancement), a novel hybrid recommender that integrates a language model (mT5) with a collaborative filtering model (Bert4Rec) using a Perceiver network. This architecture allows Flare to effectively combine collaborative and content information for enhanced recommendations. We conduct a two-stage evaluation, first assessing Flare’s performance against established baselines on smaller datasets, where it demonstrates competitive accuracy. Subsequently, we evaluate Flare on a larger, more realistic dataset with a significantly larger item vocabulary, introducing new baselines for this setting. Finally, we showcase Flare’s inherent ability to support critiquing, enabling users to provide feedback and refine recommendations. We further leverage critiquing as an evaluation method to assess the model’s language understanding and its transferability to the recommendation task. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2409.11699 [cs.IR] (or arXiv:2409.11699v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2409.11699 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-9] Basket-Enhanced Heterogenous Hypergraph for Price-Sensitive Next Basket Recommendation

链接: https://arxiv.org/abs/2409.11695
作者: Yuening Zhou,Yulin Wang,Qian Cui,Xinyu Guan,Francisco Cisternas
关键词-EN: Basket Recommendation, Existing NBR models, type of recommender, recommender system, system that predicts
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Next Basket Recommendation (NBR) is a new type of recommender system that predicts combinations of items users are likely to purchase together. Existing NBR models often overlook a crucial factor, which is price, and do not fully capture item-basket-user interactions. To address these limitations, we propose a novel method called Basket-augmented Dynamic Heterogeneous Hypergraph (BDHH). BDHH utilizes a heterogeneous multi-relational graph to capture the intricate relationships among item features, with price as a critical factor. Moreover, our approach includes a basket-guided dynamic augmentation network that could dynamically enhances item-basket-user interactions. Experiments on real-world datasets demonstrate that BDHH significantly improves recommendation accuracy, providing a more comprehensive understanding of user behavior.

[IR-10] LLM-Powered Text Simulation Attack Against ID-Free Recommender Systems

链接: https://arxiv.org/abs/2409.11690
作者: Zongwei Wang,Min Gao,Junliang Yu,Xinyi Gao,Quoc Viet Hung Nguyen,Shazia Sadiq,Hongzhi Yin
关键词-EN: ID-free recommender systems, ID-free recommendation paradigm, traditional recommender systems, recommender systems struggle, model cold-start users
类目: Information Retrieval (cs.IR)
*备注: 12 pages

点击查看摘要

Abstract:The ID-free recommendation paradigm has been proposed to address the limitation that traditional recommender systems struggle to model cold-start users or items with new IDs. Despite its effectiveness, this study uncovers that ID-free recommender systems are vulnerable to the proposed Text Simulation attack (TextSimu) which aims to promote specific target items. As a novel type of text poisoning attack, TextSimu exploits large language models (LLM) to alter the textual information of target items by simulating the characteristics of popular items. It operates effectively in both black-box and white-box settings, utilizing two key components: a unified popularity extraction module, which captures the essential characteristics of popular items, and an N-persona consistency simulation strategy, which creates multiple personas to collaboratively synthesize refined promotional textual descriptions for target items by simulating the popular items. To withstand TextSimu-like attacks, we further explore the detection approach for identifying LLM-generated promotional text. Extensive experiments conducted on three datasets demonstrate that TextSimu poses a more significant threat than existing poisoning attacks, while our defense method can detect malicious text of target items generated by TextSimu. By identifying the vulnerability, we aim to advance the development of more robust ID-free recommender systems.

[IR-11] An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems

链接: https://arxiv.org/abs/2409.11678
作者: Peng Liu,Jiawei Zhu,Cong Xu,Ming Zhao,Bin Wang
关键词-EN: Recommender Systems, multiple scores predicted, combining multiple scores, Multi-Task Fusion, stage of Recommender
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2404.17589

点击查看摘要

Abstract:As the last key stage of Recommender Systems (RSs), Multi-Task Fusion (MTF) is in charge of combining multiple scores predicted by Multi-Task Learning (MTL) into a final score to maximize user satisfaction, which decides the ultimate recommendation results. In recent years, to maximize long-term user satisfaction within a recommendation session, Reinforcement Learning (RL) is widely used for MTF in large-scale RSs. However, limited by their modeling pattern, all the current RL-MTF methods can only utilize user features as the state to generate actions for each user, but unable to make use of item features and other valuable features, which leads to suboptimal results. Addressing this problem is a challenge that requires breaking through the current modeling pattern of RL-MTF. To solve this problem, we propose a novel method called Enhanced-State RL for MTF in RSs. Unlike the existing methods mentioned above, our method first defines user features, item features, and other valuable features collectively as the enhanced state; then proposes a novel actor and critic learning process to utilize the enhanced state to make much better action for each user-item pair. To the best of our knowledge, this novel modeling pattern is being proposed for the first time in the field of RL-MTF. We conduct extensive offline and online experiments in a large-scale RS. The results demonstrate that our model outperforms other models significantly. Enhanced-State RL has been fully deployed in our RS more than half a year, improving +3.84% user valid consumption and +0.58% user duration time compared to baseline.

[IR-12] Designing Interfaces for Multimodal Vector Search Applications CIKM2024

链接: https://arxiv.org/abs/2409.11629
作者: Owen Pendrigh Elliott,Tom Hamer,Jesse Clark
关键词-EN: Multimodal vector search, exposing numerous pieces, Multimodal vector, vector search offers, vector search
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: 12 pages, 8 figures, CIKM 2024 MMSR Workshop

点击查看摘要

Abstract:Multimodal vector search offers a new paradigm for information retrieval by exposing numerous pieces of functionality which are not possible in traditional lexical search engines. While multimodal vector search can be treated as a drop in replacement for these traditional systems, the experience can be significantly enhanced by leveraging the unique capabilities of multimodal search. Central to any information retrieval system is a user who expresses an information need, traditional user interfaces with a single search bar allow users to interact with lexical search systems effectively however are not necessarily optimal for multimodal vector search. In this paper we explore novel capabilities of multimodal vector search applications utilising CLIP models and present implementations and design patterns which better allow users to express their information needs and effectively interact with these systems in an information retrieval context.

[IR-13] owards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2409.11598
作者: To Eun Kim,Fernando Diaz
关键词-EN: RAG systems, RAG, language models, models now enhance, enhance their responses
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Many language models now enhance their responses with retrieval capabilities, leading to the widespread adoption of retrieval-augmented generation (RAG) systems. However, despite retrieval being a core component of RAG, much of the research in this area overlooks the extensive body of work on fair ranking, neglecting the importance of considering all stakeholders involved. This paper presents the first systematic evaluation of RAG systems integrated with fair rankings. We focus specifically on measuring the fair exposure of each relevant item across the rankings utilized by RAG systems (i.e., item-side fairness), aiming to promote equitable growth for relevant item providers. To gain a deep understanding of the relationship between item-fairness, ranking quality, and generation quality in the context of RAG, we analyze nine different RAG systems that incorporate fair rankings across seven distinct datasets. Our findings indicate that RAG systems with fair rankings can maintain a high level of generation quality and, in many cases, even outperform traditional RAG systems, despite the general trend of a tradeoff between ensuring fairness and maintaining system-effectiveness. We believe our insights lay the groundwork for responsible and equitable RAG systems and open new avenues for future research. We publicly release our codebase and dataset at this https URL.

[IR-14] A Framework for Ranking Content Providers Using Prompt Engineering and Self-Attention Network

链接: https://arxiv.org/abs/2409.11511
作者: Gosuddin Kamaruddin Siddiqi,Deven Santhosh Shah,Radhika Bansal,Askar Kamalov
关键词-EN: Content Recommendation System, Recommendation System, Content Providers, rank Content Providers, Content Recommendation
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of ranking Content Providers for Content Recommendation System. Content Providers are the sources of news and other types of content, such as lifestyle, travel, gardening. We propose a framework that leverages explicit user feedback, such as clicks and reactions, and content-based features, such as writing style and frequency of publishing, to rank Content Providers for a given topic. We also use language models to engineer prompts that help us create a ground truth dataset for the previous unsupervised ranking problem. Using this ground truth, we expand with a self-attention based network to train on Learning to Rank ListWise task. We evaluate our framework using online experiments and show that it can improve the quality, credibility, and diversity of the content recommended to users.

[IR-15] Perceptions of Edinburgh: Capturing Neighbourhood Characteristics by Clustering Geoparsed Local News

链接: https://arxiv.org/abs/2409.11505
作者: Andreas Grivas,Claire Grover,Richard Tobin,Clare Llewellyn,Eleojo Oluwaseun Abubakar,Chunyu Zheng,Chris Dibben,Alan Marshall,Jamie Pearce,Beatrice Alex
关键词-EN: hard to define, complex and hard, health, Natural Language Processing, articles
类目: Information Retrieval (cs.IR)
*备注: Preprint - paper under submission

点击查看摘要

Abstract:The communities that we live in affect our health in ways that are complex and hard to define. Moreover, our understanding of the place-based processes affecting health and inequalities is limited. This undermines the development of robust policy interventions to improve local health and well-being. News media provides social and community information that may be useful in health studies. Here we propose a methodology for characterising neighbourhoods by using local news articles. More specifically, we show how we can use Natural Language Processing (NLP) to unlock further information about neighbourhoods by analysing, geoparsing and clustering news articles. Our work is novel because we combine street-level geoparsing tailored to the locality with clustering of full news articles, enabling a more detailed examination of neighbourhood characteristics. We evaluate our outputs and show via a confluence of evidence, both from a qualitative and a quantitative perspective, that the themes we extract from news articles are sensible and reflect many characteristics of the real world. This is significant because it allows us to better understand the effects of neighbourhoods on health. Our findings on neighbourhood characterisation using news data will support a new generation of place-based research which examines a wider set of spatial processes and how they affect health, enabling new epidemiological research.

[IR-16] Evaluation of pretrained language models on music understanding

链接: https://arxiv.org/abs/2409.11449
作者: Yannis Vasilakis,Rachel Bittner,Johan Pauwels
关键词-EN: Music Information Research, Music-text multimodal systems, Information Research, text-based song generation, Music-text multimodal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Music-text multimodal systems have enabled new approaches to Music Information Research (MIR) applications such as audio-to-text and text-to-audio retrieval, text-based song generation, and music captioning. Despite the reported success, little effort has been put into evaluating the musical knowledge of Large Language Models (LLM). In this paper, we demonstrate that LLMs suffer from 1) prompt sensitivity, 2) inability to model negation (e.g. ‘rock song without guitar’), and 3) sensitivity towards the presence of specific words. We quantified these properties as a triplet-based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology. We leveraged the Audioset ontology to generate triplets consisting of an anchor, a positive (relevant) label, and a negative (less relevant) label for the genre and instruments sub-tree. We evaluated the triplet-based musical knowledge for six general-purpose Transformer-based models. The triplets obtained through this methodology required filtering, as some were difficult to judge and therefore relatively uninformative for evaluation purposes. Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off-the-shelf LLMs need adaptation to music before use.

附件下载

点击下载今日全部论文列表