今日(2024-08-16)Arxiv最新论文

本篇博文主要展示 2024-08-16 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱，同样每天10:30左右邮件定时自动发送。

链接: https://arxiv.org/abs/2408.08313
作者: Zeju Qiu,Weiyang Liu,Haiwen Feng,Zhen Liu,Tim Z. Xiao,Katherine M. Collins,Joshua B. Tenenbaum,Adrian Weller,Michael J. Black,Bernhard Schölkopf
关键词-EN: symbolic graphics programs, Assessing the capabilities, symbolic graphics, graphics content, graphics
关键词-ZH: 符号图形程序、评估功能、符号图形、图形内容、图形
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report v1 (44 pages, 23 figures, project page: this https URL )

点击查看摘要

Abstract:Assessing the capabilities of large language models (LLMs) is often challenging, in part, because it is hard to find tasks to which they have not been exposed during training. We take one step to address this challenge by turning to a new task: focusing on symbolic graphics programs, which are a popular representation for graphics content that procedurally generates visual data. LLMs have shown exciting promise towards program synthesis, but do they understand symbolic graphics programs? Unlike conventional programs, symbolic graphics programs can be translated to graphics content. Here, we characterize an LLM’s understanding of symbolic programs in terms of their ability to answer questions related to the graphics content. This task is challenging as the questions are difficult to answer from the symbolic programs alone – yet, they would be easy to answer from the corresponding graphics content as we verify through a human experiment. To understand symbolic programs, LLMs may need to possess the ability to imagine how the corresponding graphics content would look without directly accessing the rendered visual content. We use this task to evaluate LLMs by creating a large benchmark for the semantic understanding of symbolic graphics programs. This benchmark is built via program-graphics correspondence, hence requiring minimal human efforts. We evaluate current LLMs on our benchmark to elucidate a preliminary assessment of their ability to reason about visual scenes from programs. We find that this task distinguishes existing LLMs and models considered good at reasoning perform better. Lastly, we introduce Symbolic Instruction Tuning (SIT) to improve this ability. Specifically, we query GPT4-o with questions and images generated by symbolic programs. Such data are then used to finetune an LLM. We also find that SIT data can improve the general instruction following ability of LLMs.
摘要：评估大型语言模型(LLM)的能力通常是具有挑战性的，部分原因是很难找到他们在培训期间没有接触过的任务。我们采取了一个步骤来解决这一挑战，转向一项新的任务：专注于符号图形程序，这是一种流行的图形内容表示法，可以按程序生成可视数据。LLM在程序合成方面表现出了令人兴奋的前景，但他们理解符号图形程序吗？与传统程序不同，符号图形程序可以转换为图形内容。在这里，我们根据LLM回答与图形内容相关的问题的能力来描述LLM对符号程序的理解。这项任务具有挑战性，因为仅从符号程序很难回答这些问题–然而，通过人类实验验证，通过相应的图形内容很容易回答这些问题。要理解符号程序，LLM可能需要具备想象相应图形内容将是什么样子的能力，而无需直接访问呈现的可视内容。我们使用这个任务通过为符号图形程序的语义理解创建一个大型基准来评估LLMS。这个基准是通过程序-图形通信建立的，因此只需要最少的人工工作。我们根据我们的基准对当前的LLM进行评估，以阐明对它们对节目中的视觉场景进行推理的能力的初步评估。我们发现，这项任务区分了现有的LLM，被认为擅长推理的模型表现得更好。最后，我们引入了符号指令调优(SIT)来提高这种能力。具体地说，我们用符号程序生成的问题和图像来查询GPT4-o。然后，这些数据被用来微调LLM。我们还发现，SIT数据可以提高LLMS的一般指令跟踪能力。

[NLP-1] ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws
[NLP-1] Scaling滤镜：通过逆利用缩放定律来评估数据质量

链接: https://arxiv.org/abs/2408.08310
作者: Ruihang Li,Yixuan Wei,Miaosen Zhang,Nenghai Yu,Han Hu,Houwen Peng
关键词-EN: large language models, large language, High-quality data, models, High-quality
关键词-ZH: 大型语言模型，大型语言，高质量数据，模型，高质量
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-quality data is crucial for the pre-training performance of large language models. Unfortunately, existing quality filtering methods rely on a known high-quality dataset as reference, which can introduce potential bias and compromise diversity. In this paper, we propose ScalingFilter, a novel approach that evaluates text quality based on the perplexity difference between two language models trained on the same data, thereby eliminating the influence of the reference dataset in the filtering process. An theoretical analysis shows that ScalingFilter is equivalent to an inverse utilization of scaling laws. Through training models with 1.3B parameters on the same data source processed by various quality filters, we find ScalingFilter can improve zero-shot performance of pre-trained models in downstream tasks. To assess the bias introduced by quality filtering, we introduce semantic diversity, a metric of utilizing text embedding models for semantic representations. Extensive experiments reveal that semantic diversity is a reliable indicator of dataset diversity, and ScalingFilter achieves an optimal balance between downstream performance and semantic diversity.
摘要：高质量的数据对大型语言模型的预训练性能至关重要。遗憾的是，现有的质量过滤方法依赖于已知的高质量数据集作为参考，这可能会引入潜在的偏差并损害多样性。本文提出了一种新的文本质量评价方法ScalingFilter，该方法基于同一数据上训练的两个语言模型之间的困惑差异来评价文本质量，从而消除了参考数据集在过滤过程中的影响。理论分析表明，ScalingFilter等价于对标度定律的反向利用。通过对经过不同质量过滤器处理的同一数据源上的1.3B参数的模型进行训练，我们发现ScalingFilter可以提高预先训练的模型在下游任务中的零击性能。为了评估质量过滤带来的偏差，我们引入了语义多样性，这是一种利用文本嵌入模型来表示语义的度量。大量实验表明，语义多样性是数据集多样性的可靠指标，ScalingFilter在下游性能和语义多样性之间实现了最佳平衡。

[NLP-2] Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy Consistency and Reasoning Behaviors
[NLP-2] 交通系统工程中大型语言模型的能力基准：准确性一致性和推理行为

链接: https://arxiv.org/abs/2408.08302
作者: Usman Syed,Ethan Light,Xingang Guo,Huan Zhang,Lianhui Qin,Yanfeng Ouyang,Bin Hu
关键词-EN: large language models, transportation engineering problems, selected undergraduate-level transportation, undergraduate-level transportation engineering, transportation engineering
关键词-ZH: 大语言模型、交通工程问题、精选本科水平交通、本科水平交通工程、交通工程
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, GPT-4o, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, Llama 3, and Llama 3.1 in solving some selected undergraduate-level transportation engineering problems. We introduce TransportBench, a benchmark dataset that includes a sample of transportation engineering problems on a wide range of subjects in the context of planning, design, management, and control of transportation systems. This dataset is used by human experts to evaluate the capabilities of various commercial and open-sourced LLMs, especially their accuracy, consistency, and reasoning behaviors, in solving transportation engineering problems. Our comprehensive analysis uncovers the unique strengths and limitations of each LLM, e.g. our analysis shows the impressive accuracy and some unexpected inconsistent behaviors of Claude 3.5 Sonnet in solving TransportBench problems. Our study marks a thrilling first step toward harnessing artificial general intelligence for complex transportation challenges.
摘要：本文探讨了GPT-4、GPT-40、Claude 3.5十四行诗、Claude 3 Opus、Gemini 1.5 Pro、Llama 3和Llama 3.1等先进的大型语言模型(LLM)在解决一些本科交通工程问题中的能力。我们介绍了TransportBch，这是一个基准数据集，包括在交通系统的规划、设计、管理和控制的背景下的广泛主题的交通工程问题的样本。该数据集被人类专家用来评估各种商业和开源的LLM在解决交通工程问题方面的能力，特别是它们的准确性、一致性和推理行为。我们的综合分析揭示了每个LLM的独特优势和局限性，例如，我们的分析显示了Claude 3.5十四行诗在解决TransportBitch问题时令人印象深刻的准确性和一些意想不到的不一致行为。我们的研究标志着在利用人工通用智能应对复杂的交通挑战方面迈出了激动人心的第一步。

[NLP-3] he ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community
[NLP-3] ShareLM Collection和插件：为社区的利益贡献人类模型聊天

链接: https://arxiv.org/abs/2408.08291
作者: Shachar Don-Yehiya,Leshem Choshen,Omri Abend
关键词-EN: users’ real-world scenarios, real-world scenarios, Human-model conversations provide, provide a window, window into users’
关键词-ZH: 用户的现实世界场景，现实世界场景，人体模型对话提供，提供一个窗口，进入用户的窗口
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human-model conversations provide a window into users’ real-world scenarios, behavior, and needs, and thus are a valuable resource for model development and research. While for-profit companies collect user data through the APIs of their models, using it internally to improve their own models, the open source and research community lags behind. We introduce the ShareLM collection, a unified set of human conversations with large language models, and its accompanying plugin, a Web extension for voluntarily contributing user-model conversations. Where few platforms share their chats, the ShareLM plugin adds this functionality, thus, allowing users to share conversations from most platforms. The plugin allows the user to rate their conversations, both at the conversation and the response levels, and delete conversations they prefer to keep private before they ever leave the user’s local storage. We release the plugin conversations as part of the ShareLM collection, and call for more community effort in the field of open human-model data. The code, plugin, and data are available. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2408.08291 [cs.CL] (or arXiv:2408.08291v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.08291 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要：真人模型对话提供了了解用户真实世界场景、行为和需求的窗口，因此是模型开发和研究的宝贵资源。虽然营利性公司通过其模型的API收集用户数据，并在内部使用这些数据来改进自己的模型，但开源和研究社区却落后了。我们介绍了ShareLM集合，这是一组具有大型语言模型的统一的人类对话，以及它附带的插件，这是一个用于自愿提供用户模型对话的Web扩展。在很少有平台共享聊天的情况下，ShareLM插件增加了这一功能，从而允许用户在大多数平台上共享对话。该插件允许用户在对话和响应级别对他们的对话进行评级，并在离开用户的本地存储之前删除他们希望保密的对话。我们将插件对话作为ShareLM集合的一部分发布，并呼吁社区在开放的人体模型数据领域做出更多努力。代码、插件和数据都是可用的。主题：计算和语言(cs.CL)引用为：arxiv：2408.08291cs.CL https://doi.org/10.48550/arXiv.2408.08291 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-4] mhGPT: A Lightweight Generative Pre-Trained Transformer for Mental Health Text Analysis
[NLP-4] mhGPT：用于心理健康文本分析的轻量级生成预训练Transformer

链接: https://arxiv.org/abs/2408.08261
作者: Dae-young Kim(1),Rebecca Hwa(2),Muhammad Mahbubur Rahman(1) ((1) Children’s National Hospital, Washington, DC, (2) George Washington University, Washington, DC)
关键词-EN: lightweight generative pre-trained, generative pre-trained transformer, health-related social media, pre-trained transformer trained, paper introduces mhGPT
关键词-ZH: 轻量级生成预训练、生成预训练Transformer、健康相关社交媒体、预训练Transformer、介绍mhGPT的论文
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces mhGPT, a lightweight generative pre-trained transformer trained on mental health-related social media and PubMed articles. Fine-tuned for specific mental health tasks, mhGPT was evaluated under limited hardware constraints and compared with state-of-the-art models like MentaLLaMA and Gemma. Despite having only 1.98 billion parameters and using just 5% of the dataset, mhGPT outperformed larger models and matched the performance of models trained on significantly more data. The key contributions include integrating diverse mental health data, creating a custom tokenizer, and optimizing a smaller architecture for low-resource settings. This research could advance AI-driven mental health care, especially in areas with limited computing power.
摘要：本文介绍了mhGPT，这是一种轻量级的生成性预训练Transformer，经过心理健康相关社交媒体和PubMed文章的训练。mhGPT针对特定的心理健康任务进行了微调，在有限的硬件限制下进行了评估，并与MentaLLaMA和Gemma等最先进的模型进行了比较。尽管只有19.8亿个参数，并且仅使用了5%的数据集，但mhGPT的性能优于大型模型，并且与在更多数据上训练的模型的性能相匹配。主要贡献包括集成不同的心理健康数据、创建自定义代币化器以及针对低资源环境优化较小的架构。这项研究可以推进人工智能驱动的心理健康护理，特别是在计算能力有限的地区。

[NLP-5] Covert Bias: The Severity of Social Views Unalignment Towards Implicit and Explicit Opinion
[NLP-5] 隐性偏见：社会观点与隐性和隐性观点不一致的严重性

链接: https://arxiv.org/abs/2408.08212
作者: Abeer Aldayel,Areej Alokaili,Rehab Alahmadi
关键词-EN: http URL examine, affects bias amplification, http URL, URL examine, viewpoint affects bias
关键词-ZH: http URL检查，影响偏见放大，http URL，URL检查，观点影响偏见
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: This work is under-review

点击查看摘要

Abstract:While various approaches have recently been studied for bias identification, little is known about how implicit language that does not explicitly convey a viewpoint affects bias amplification in large language this http URL examine the severity of bias toward a view, we evaluated the performance of two downstream tasks where the implicit and explicit knowledge of social groups were used. First, we present a stress test evaluation by using a biased model in edge cases of excessive bias scenarios. Then, we evaluate how LLMs calibrate linguistically in response to both implicit and explicit opinions when they are aligned with conflicting viewpoints. Our findings reveal a discrepancy in LLM performance in identifying implicit and explicit opinions, with a general tendency of bias toward explicit opinions of opposing stances. Moreover, the bias-aligned models generate more cautious responses using uncertainty phrases compared to the unaligned (zero-shot) base models. The direct, incautious responses of the unaligned models suggest a need for further refinement of decisiveness by incorporating uncertainty markers to enhance their reliability, especially on socially nuanced topics with high subjectivity.
摘要：虽然最近人们研究了各种方法来识别偏见，但对于没有明确表达观点的内隐语言如何影响大型语言中的偏见放大，我们知之甚少。这个http URL检查了倾向于观点的偏见的严重程度，我们评估了两个下游任务的表现，其中使用了社会群体的隐性和显性知识。首先，我们提出了一个压力测试评估，通过使用有偏模型在边缘情况下的过度偏倚情景。然后，我们评估了当LLM与相互冲突的观点一致时，它们是如何在语言上对含蓄和明确的观点做出反应的。我们的发现揭示了LLM在识别内隐和外显观点上的差异，总体上倾向于对相反立场的显性观点的偏向。此外，与未对齐(零射击)基本模型相比，偏差对齐模型使用不确定短语产生更谨慎的反应。不结盟模型的直接、不谨慎的反应表明，有必要通过纳入不确定性标记来进一步完善决断性，以提高其可靠性，特别是在主观性较高的社会微妙主题上。

[NLP-6] DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
[NLP-6] DeepSeek-Prover-V1.5：利用强化学习和蒙特卡洛树搜索的证明助理反馈

链接: https://arxiv.org/abs/2408.08152
作者: Huajian Xin,Z.Z. Ren,Junxiao Song,Zhihong Shao,Wanjia Zhao,Haocheng Wang,Bo Liu,Liyue Zhang,Xuan Lu,Qiushi Du,Wenjun Gao,Qihao Zhu,Dejian Yang,Zhibin Gou,Z.F. Wu,Fuli Luo,Chong Ruan
关键词-EN: open-source language model, language model designed, inference processes, optimizing both training, training and inference
关键词-ZH: 开源语言模型、设计的语言模型、推理过程、优化训练、训练和推理
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:We introduce DeepSeek-Prover-V1.5, an open-source language model designed for theorem proving in Lean 4, which enhances DeepSeek-Prover-V1 by optimizing both training and inference processes. Pre-trained on DeepSeekMath-Base with specialization in formal mathematical languages, the model undergoes supervised fine-tuning using an enhanced formal theorem proving dataset derived from DeepSeek-Prover-V1. Further refinement is achieved through reinforcement learning from proof assistant feedback (RLPAF). Beyond the single-pass whole-proof generation approach of DeepSeek-Prover-V1, we propose RMaxTS, a variant of Monte-Carlo tree search that employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths. DeepSeek-Prover-V1.5 demonstrates significant improvements over DeepSeek-Prover-V1, achieving new state-of-the-art results on the test set of the high school level miniF2F benchmark ( 63.5% ) and the undergraduate level ProofNet benchmark ( 25.3% ).
摘要：我们引入了DeepSeek-Prover-V1.5，这是一种为Lean 4中的定理证明而设计的开源语言模型，它通过优化训练和推理过程来增强DeepSeek-Prover-V1。该模型在DeepSeekMath-Base上预先训练，专门从事形式数学语言，使用源自DeepSeek-Prover-V1的增强型形式定理证明数据集进行监督微调。进一步的细化是通过来自证明辅助反馈的强化学习（WLPFA）来实现的。除了DeepSeek-Prover-V1的单程完整证明生成方法之外，我们还提出了RMaxTS，这是蒙特卡罗树搜索的变体，采用本质回报驱动的探索策略来生成多样化的证明路径。DeepSeek-Prover-V1.5比DeepSeek-Prover-V1进行了显着改进，在高中水平miniF 2F基准测试集（63.5%）和本科水平ProofNet基准测试集（25.3%）上实现了新的最先进结果。

[NLP-7] P/D-Serve: Serving Disaggregated Large Language Model at Scale
[NLP-7] P/D-Serve：大规模服务分解大型语言模型

链接: https://arxiv.org/abs/2408.08147
作者: Yibo Jin,Tao Wang,Huimin Lin,Mingyang Song,Peiyang Li,Yipeng Ma,Yicheng Shan,Zhengfan Yuan,Cailong Li,Yajing Sun,Tiandeng Wu,Xing Chu,Ruizhi Huan,Li Ma,Xiao You,Wenting Zhou,Yunpeng Ye,Wen Liu,Xiangkun Xu,Yongsheng Zhang,Tiantian Dong,Jiawei Zhu,Zhe Wang,Xijian Ju,Jianxun Song,Haoliang Cheng,Xiaojing Li,Jiandong Ding,Hefei Guo,Zhengyong Zhang
关键词-EN: Serving disaggregated large, faces multiple challenges, disaggregated large language, Serving disaggregated, large language models
关键词-ZH: 为分散的大型语言服务，面临多重挑战，分散的大型语言，为分散的大型语言模型服务
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Serving disaggregated large language models (LLMs) over tens of thousands of xPU devices (GPUs or NPUs) with reliable performance faces multiple challenges. 1) Ignoring the diversity (various prefixes and tidal requests), treating all the prompts in a mixed pool is inadequate. To facilitate the similarity per scenario and minimize the inner mismatch on P/D (prefill and decoding) processing, fine-grained organization is required, dynamically adjusting P/D ratios for better performance. 2) Due to inaccurate estimation on workload (queue status or maintained connections), the global scheduler easily incurs unnecessary timeouts in prefill. 3) Block-fixed device-to-device (D2D) KVCache transfer over cluster-level RDMA (remote direct memory access) fails to achieve desired D2D utilization as expected. To overcome previous problems, this paper proposes an end-to-end system P/D-Serve, complying with the paradigm of MLOps (machine learning operations), which models end-to-end (E2E) P/D performance and enables: 1) fine-grained P/D organization, mapping the service with RoCE (RDMA over converged ethernet) as needed, to facilitate similar processing and dynamic adjustments on P/D ratios; 2) on-demand forwarding upon rejections for idle prefill, decoupling the scheduler from regular inaccurate reports and local queues, to avoid timeouts in prefill; and 3) efficient KVCache transfer via optimized D2D access. P/D-Serve is implemented upon Ascend and MindSpore, has been deployed over tens of thousands of NPUs for more than eight months in commercial use, and further achieves 60%, 42% and 46% improvements on E2E throughput, time-to-first-token (TTFT) SLO (service level objective) and D2D transfer time. As the E2E system with optimizations, P/D-Serve achieves 6.7x increase on throughput, compared with aggregated LLMs.
摘要：在数以万计的XPU设备(GPU或NPU)上提供具有可靠性能的分散的大型语言模型(LLM)面临着多重挑战。1)忽略多样性(各种前缀和潮汐请求)，将所有提示放在一个混合池中处理是不够的。为了促进每个场景的相似性，并最大限度地减少P/D(预填充和解码)处理的内部不匹配，需要细粒度组织，动态调整P/D比率以获得更好的性能。2)由于对工作负载(队列状态或保持的连接)估计不准确，全局调度器在预填充时容易导致不必要的超时。3)通过群集级RDMA(远程直接内存访问)进行的数据块固定设备到设备(D2D)KVCache传输无法达到预期的D2D利用率。为了克服上述问题，本文提出了一种符合机器学习操作(MLOPS)范式的端到端P/D-SERVE系统，该系统模拟端到端(E2E)P/D性能，并实现：1)细粒度P/D组织，根据需要将服务映射到ROCE(融合以太网上的RDMA)，以便于类似的处理和P/D比的动态调整；2)按需转发空闲预填充，将调度器从常规不准确的报告和本地队列中分离出来，以避免预填充超时；3)通过优化的D2D访问来高效地进行KVCache传输。P/D-Serve是在Ascend和MindSporter上实现的，已经在数万个NPU上部署了8个多月的商用，在E2E吞吐量、第一次令牌到达时间(TTFT)、SLO(服务级别目标)和D2D传输时间方面分别获得了60%、42%和46%的改善。作为经过优化的E2E系统，P/D-SERVE的吞吐量比聚合LLMS提高了6.7倍。

[NLP-8] KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning
[NLP-8] KOALA：通过具有对抗学习的多层草稿头增强LLM的推测解码

链接: https://arxiv.org/abs/2408.08146
作者: Kaiqi Zhang,Jing Zhao,Rui Chen
关键词-EN: Large Language Models, Large Language, Language Models, exhibit high inference, draft head
关键词-ZH: 大型语言模型，大型语言，语言模型，表现出高推理，草稿头
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit high inference latency due to their autoregressive decoding nature. While the draft head in speculative decoding mitigates this issue, its full potential remains unexplored. In this paper, we introduce KOALA (K-layer Optimized Adversarial Learning Architecture), an orthogonal approach to the draft head. By transforming the conventional single-layer draft head into a multi-layer architecture and incorporating adversarial learning into the traditional supervised training, KOALA significantly improves the accuracy of the draft head in predicting subsequent tokens, thus more closely mirroring the functionality of LLMs. Although this improvement comes at the cost of slightly increased drafting overhead, KOALA substantially unlocks the draft head’s potential, greatly enhancing speculative decoding. We conducted comprehensive evaluations of KOALA, including both autoregressive and non-autoregressive draft heads across various tasks, demonstrating a latency speedup ratio improvement of 0.24x-0.41x, which is 10.57%-14.09% faster than the original draft heads.
摘要：大语言模型因其自回归解码的性质而表现出较高的推理延迟。虽然投机性译码的草案标题缓解了这个问题，但其全部潜力仍未被发掘。在本文中，我们介绍了一种K层优化对抗性学习结构–Koala(K-Layer OPTIMIZED ADAPSARIAL LISTING)，它是一种正交头训练方法。通过将传统的单层草稿头部转换为多层结构，并将对抗性学习融入传统的有监督训练中，考拉显著提高了草稿头部对后续令牌的预测精度，从而更接近地反映了LLMS的功能。虽然这一改进是以略微增加选秀开销为代价的，但考拉实质上释放了选秀头部的潜力，极大地增强了投机性解码。我们对考拉进行了全面的评估，包括自回归和非自回归不同任务的抽头，延迟加速比提高了0.24倍-0.41倍，比原来的抽头快10.57%-14.09%。

[NLP-9] MIDAS: Multi-level Intent Domain And Slot Knowledge Distillation for Multi-turn NLU
[NLP-9] MIDAS：多回合NLU的多级别意图领域和插槽知识提炼

链接: https://arxiv.org/abs/2408.08144
作者: Yan Li,So-Eon Kim,Seong-Bae Park,Soyeon Caren Han
关键词-EN: Large Language Models, contextually relevant text, Large Language, human user query, Natural Language Understanding
关键词-ZH: 大型语言模型、上下文相关文本、大型语言、人类用户查询、自然语言理解
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although Large Language Models(LLMs) can generate coherent and contextually relevant text, they often struggle to recognise the intent behind the human user’s query. Natural Language Understanding (NLU) models, however, interpret the purpose and key information of user’s input to enable responsive interactions. Existing NLU models generally map individual utterances to a dual-level semantic frame, involving sentence-level intent and word-level slot labels. However, real-life conversations primarily consist of multi-turn conversations, involving the interpretation of complex and extended dialogues. Researchers encounter challenges addressing all facets of multi-turn dialogue conversations using a unified single NLU model. This paper introduces a novel approach, MIDAS, leveraging a multi-level intent, domain, and slot knowledge distillation for multi-turn NLU. To achieve this, we construct distinct teachers for varying levels of conversation knowledge, namely, sentence-level intent detection, word-level slot filling, and conversation-level domain classification. These teachers are then fine-tuned to acquire specific knowledge of their designated levels. A multi-teacher loss is proposed to facilitate the combination of these multi-level teachers, guiding a student model in multi-turn dialogue tasks. The experimental results demonstrate the efficacy of our model in improving the overall multi-turn conversation understanding, showcasing the potential for advancements in NLU models through the incorporation of multi-level dialogue knowledge distillation techniques.
摘要：虽然大型语言模型(LLM)可以生成连贯且与上下文相关的文本，但它们往往难以识别人类用户查询背后的意图。然而，自然语言理解(NLU)模型解释了用户输入的目的和关键信息，以实现响应交互。现有的自然语言理解模型一般将单个话语映射到一个双层语义框架，涉及句子级意图和词级槽标签。然而，现实生活中的对话主要由多轮对话组成，涉及对复杂和延伸对话的解释。研究人员遇到了使用统一的单一自然语言理解模式解决多轮对话对话的方方面面的挑战。本文介绍了一种新的方法MIDAS，该方法利用多层次意图、领域和槽的知识提炼来实现多轮自然语言理解。为此，我们为不同层次的会话知识构建了不同的教师，即句子级别的意图检测、单词级别的空位填充和会话级别的领域分类。然后对这些教师进行微调，以获得他们指定水平的具体知识。为了促进这些多层次教师的结合，在多轮对话任务中指导学生模式，提出了一种多教师流失的方法。实验结果表明，该模型能够有效地提高多话轮会话的整体理解能力，并通过引入多层次对话知识提炼技术，展示了在自然语言理解模型中的改进潜力。

[NLP-10] Agent Court: Simulating Court with Adversarial Evolvable Lawyer Agents
[NLP-10] 代理法庭：用对抗性进化律师代理模拟法庭

链接: https://arxiv.org/abs/2408.08089
作者: Guhong Chen,Liyang Fan,Zihan Gong,Nan Xie,Zixuan Li,Ziqiang Liu,Chengming Li,Qiang Qu,Shiwen Ni,Min Yang
关键词-EN: simulation system called, system called AgentCourt, entire courtroom process, system called, lawyer agents
关键词-ZH: 模拟系统被调用，系统被称为AgentCourt，整个法庭流程，系统被调用，律师代理
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present a simulation system called AgentCourt that simulates the entire courtroom process. The judge, plaintiff’s lawyer, defense lawyer, and other participants are autonomous agents driven by large language models (LLMs). Our core goal is to enable lawyer agents to learn how to argue a case, as well as improving their overall legal skills, through courtroom process simulation. To achieve this goal, we propose an adversarial evolutionary approach for the lawyer-agent. Since AgentCourt can simulate the occurrence and development of court hearings based on a knowledge base and LLM, the lawyer agents can continuously learn and accumulate experience from real court cases. The simulation experiments show that after two lawyer-agents have engaged in a thousand adversarial legal cases in AgentCourt (which can take a decade for real-world lawyers), compared to their pre-evolutionary state, the evolved lawyer agents exhibit consistent improvement in their ability to handle legal tasks. To enhance the credibility of our experimental results, we enlisted a panel of professional lawyers to evaluate our simulations. The evaluation indicates that the evolved lawyer agents exhibit notable advancements in responsiveness, as well as expertise and logical rigor. This work paves the way for advancing LLM-driven agent technology in legal scenarios. Code is available at this https URL.
摘要：在本文中，我们介绍了一个模拟整个法庭过程的模拟系统AgentCourt。法官、原告律师、辩护律师和其他参与者是由大型语言模型(LLM)驱动的自主代理。我们的核心目标是使律师代理人能够学习如何辩论案件，以及提高他们的整体法律技能，通过法庭程序模拟。为了实现这一目标，我们提出了一种针对律师-代理人的对抗性进化方法。由于AgentCourt可以基于知识库和LLM来模拟法庭听证的发生和发展，律师代理人可以不断地从真实的法庭案例中学习和积累经验。模拟实验表明，在两个律师-代理人在AgentCourt中参与了1000个对抗性法律案件后(对于现实世界的律师来说，这可能需要十年的时间)，与进化前的状态相比，进化后的律师代理人在处理法律任务的能力方面表现出持续的提高。为了提高实验结果的可信度，我们聘请了一个由专业律师组成的小组来评估我们的模拟。评估表明，进化后的律师代理在响应能力、专业知识和逻辑严谨性方面表现出显著的进步。这项工作为在法律场景中推进LLM驱动的代理技术铺平了道路。代码可在此HTTPS URL上找到。

[NLP-11] Extracting Sentence Embeddings from Pretrained Transformer Models
[NLP-11] 从预训练的Transformer模型中提取句子嵌入

链接: https://arxiv.org/abs/2408.08073
作者: Lukas Stankevičius,Mantas Lukoševičius
关键词-EN: Pre-trained transformer models, Pre-trained transformer, natural language processing, transformer models shine, expected to bear
关键词-ZH: 预训练的Transformer模型，预训练的Transformer，自然语言处理，Transformer模型闪耀，有望承受
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Background/introduction: Pre-trained transformer models shine in many natural language processing tasks and therefore are expected to bear the representation of the input sentence or text meaning. These sentence-level embeddings are also important in retrieval-augmented generation. But do commonly used plain averaging or prompt templates surface it enough? Methods: Given 110M parameters BERT’s hidden representations from multiple layers and multiple tokens we tried various ways to extract optimal sentence representations. We tested various token aggregation and representation post-processing techniques. We also tested multiple ways of using a general Wikitext dataset to complement BERTs sentence representations. All methods were tested on 8 Semantic Textual Similarity (STS), 6 short text clustering, and 12 classification tasks. We also evaluated our representation-shaping techniques on other static models, including random token representations. Results: Proposed representation extraction methods improved the performance on STS and clustering tasks for all models considered. Very high improvements for static token-based models, especially random embeddings for STS tasks almost reach the performance of BERT-derived representations. Conclusions: Our work shows that for multiple tasks simple baselines with representation shaping techniques reach or even outperform more complex BERT-based models or are able to contribute to their performance. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 68T07, 68T50, 68T05 ACMclasses: I.2.6; I.2.7 Cite as: arXiv:2408.08073 [cs.CL] (or arXiv:2408.08073v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.08073 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mantas Lukoševičius [view email] [v1] Thu, 15 Aug 2024 10:54:55 UTC (335 KB)
摘要：背景/简介：预先训练好的转换器模型在许多自然语言处理任务中大放异彩，因此有望承担输入句子或文本意义的表示。这些句子级嵌入在检索增强生成中也很重要。但常用的普通平均或提示性模板是否足够充分呢？方法：在给定110M个参数的BERT多层多表征的隐含表征的情况下，我们尝试了各种方法来提取最佳句子表征。我们测试了各种令牌聚合和表示后处理技术。我们还测试了使用通用维基文本数据集来补充BERTS句子表示的多种方法。所有方法在8个语义文本相似度(STS)、6个短文本聚类和12个分类任务上进行了测试。我们还在其他静态模型上评估了我们的表示成形技术，包括随机令牌表示。结果：提出的表征提取方法提高了所有模型在STS和聚类任务上的性能。基于静态令牌的模型，特别是STS任务的随机嵌入，其性能几乎达到了基于ERT的表示法的性能。结论：我们的工作表明，对于多个任务，具有表征塑造技术的简单基线达到甚至超过了更复杂的基于BERT的模型，或者能够对它们的性能做出贡献。科目：计算与语言(cs.CL)；信息检索(cs.IR)；机器学习(cs.LG)；机器学习(stat.ML)MSC课程：68T07、68T50、68T05；I.2.7引用为：arxiv：2408.08073cs.CL https://doi.org/10.48550/arXiv.2408.08073 Focus通过DataCite(待注册)了解更多arxiv发布的文件提交历史来自：mantas Lukoševičius[查看电子邮件][v1]清华，2024年8月15日10：54：55 UTC(335 KB)

[NLP-12] I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm
[NLP-12] I-SHEEP：通过迭代自我增强范式从Scratch实现LLM的自我调整

链接: https://arxiv.org/abs/2408.08072
作者: Yiming Liang,Ge Zhang,Xingwei Qu,Tianyu Zheng,Jiawei Guo,Xinrun Du,Zhenzhu Yang,Jiaheng Liu,Chenghua Lin,Lei Ma,Wenhao Huang,Jiajun Zhang
关键词-EN: Large Language Models, Large Language, achieved significant advancements, passive information repositories, common learning paradigm
关键词-ZH: 大型语言模型，大型语言，取得了重大进步，被动信息库，通用学习范式
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved significant advancements, however, the common learning paradigm treats LLMs as passive information repositories, neglecting their potential for active learning and alignment. Some approaches train LLMs using their own generated synthetic data, exploring the possibility of active alignment. However, there is still a huge gap between these one-time alignment methods and the continuous automatic alignment of humans. In this paper, we introduce \textbfI-SHEEP, an \textbfIterative \textbfSelf-En\textbfHanc\textbfEm\textbfEnt \textbfParadigm.This human-like paradigm enables LLMs to \textbfcontinuously self-align from scratch with nothing. Compared to the one-time alignment method Dromedary \citesun2023principledriven, which refers to the first iteration in this paper, I-SHEEP can significantly enhance capacities on both Qwen and Llama models. I-SHEEP achieves a maximum relative improvement of 78.2% in the Alpaca Eval, 24.0% in the MT Bench, and an absolute increase of 8.88% in the IFEval accuracy over subsequent iterations in Qwen-1.5 72B model. Additionally, I-SHEEP surpasses the base model in various standard benchmark generation tasks, achieving an average improvement of 24.77% in code generation tasks, 12.04% in TrivialQA, and 20.29% in SQuAD. We also provide new insights based on the experiment results. Our codes, datasets, and models are available at \textbfhttps://anonymous.4open.science/r/I-SHEEP.
摘要：大语言模型已经取得了显著的进展，然而，常见的学习范式将大语言模型视为被动的信息库，忽视了它们在主动学习和对齐方面的潜力。一些方法使用它们自己生成的合成数据来训练LLM，探索主动对准的可能性。然而，这些一次性的比对方法与人类的连续自动比对仍然存在巨大的差距。在本文中，我们引入了一种新的模式与本文中第一次迭代的一次比对方法Dromedary\Citesun2023 Prirst riven相比，I-SHEEP在Qwen和Llama模型上的性能都有显著的提高。在QWEN-1.572B模型中，I-SHEEP在羊驼进化模型中的最大相对改进为78.2%，在MT模型中为24.0%，在后续迭代中IFEval的精度绝对提高了8.88%。此外，I-SHEEP在各种标准基准测试生成任务上都超过了基本模型，代码生成任务平均提高了24.77%，TrivialQA平均提高了12.04%，团队平均提高了20.29%。我们还在实验结果的基础上提供了新的见解。我们的代码、数据集和模型可在\textbfhttps://anonymous.4open.science/r/I-SHEEP.上获得

[NLP-13] RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation
[NLP-13] RAGSYS：诊断检索增强一代的细粒度框架

链接: https://arxiv.org/abs/2408.08067
作者: Dongyu Ru,Lin Qiu,Xiangkun Hu,Tianhang Zhang,Peng Shi,Shuaichen Chang,Jiayang Cheng,Cunxiang Wang,Shichao Sun,Huanyu Li,Zizhao Zhang,Binjie Wang,Jiarong Jiang,Tong He,Zhiguo Wang,Pengfei Liu,Yue Zhang,Zheng Zhang
关键词-EN: leveraging external knowledge, shown promising capability, RAG systems, external knowledge, reliability of measurements
关键词-ZH: 利用外部知识、表现出有前途的能力、RAG系统、外部知识、测量的可靠性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Despite Retrieval-Augmented Generation (RAG) has shown promising capability in leveraging external knowledge, a comprehensive evaluation of RAG systems is still challenging due to the modular nature of RAG, evaluation of long-form responses and reliability of measurements. In this paper, we propose a fine-grained evaluation framework, RAGChecker, that incorporates a suite of diagnostic metrics for both the retrieval and generation modules. Meta evaluation verifies that RAGChecker has significantly better correlations with human judgments than other evaluation metrics. Using RAGChecker, we evaluate 8 RAG systems and conduct an in-depth analysis of their performance, revealing insightful patterns and trade-offs in the design choices of RAG architectures. The metrics of RAGChecker can guide researchers and practitioners in developing more effective RAG systems.
摘要：尽管检索增强一代（RAG）在利用外部知识方面表现出了良好的能力，但由于RAG的模块化性质、长式响应的评估和测量的可靠性，对RAG系统的全面评估仍然具有挑战性。在本文中，我们提出了一个细粒度的评估框架RAGSYS，它包含了一套用于检索和生成模块的诊断指标。Meta评估验证RAGSYS与人类判断的相关性明显优于其他评估指标。使用RAGSYS，我们评估了8个RAG系统，并对其性能进行了深入分析，揭示了RAG架构设计选择中的富有洞察力的模式和权衡。RAGSYS的指标可以指导研究人员和从业者开发更有效的RAG系统。

[NLP-14] xt2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework
[NLP-14] xt2 BMI：使用基于大型语言模型的多代理框架生成建筑模型

链接: https://arxiv.org/abs/2408.08054
作者: Changyu Du,Sebastian Esser,Stavros Nousias,André Borrmann
关键词-EN: typically requires designers, process typically requires, conventional BIM authoring, BIM authoring process, BIM authoring
关键词-ZH: 通常需要设计师，流程通常需要，传统的BMI创作，BMI创作过程，BMI创作
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The conventional BIM authoring process typically requires designers to master complex and tedious modeling commands in order to materialize their design intentions within BIM authoring tools. This additional cognitive burden complicates the design process and hinders the adoption of BIM and model-based design in the AEC (Architecture, Engineering, and Construction) industry. To facilitate the expression of design intentions more intuitively, we propose Text2BIM, an LLM-based multi-agent framework that can generate 3D building models from natural language instructions. This framework orchestrates multiple LLM agents to collaborate and reason, transforming textual user input into imperative code that invokes the BIM authoring tool’s APIs, thereby generating editable BIM models with internal layouts, external envelopes, and semantic information directly in the software. Furthermore, a rule-based model checker is introduced into the agentic workflow, utilizing predefined domain knowledge to guide the LLM agents in resolving issues within the generated models and iteratively improving model quality. Extensive experiments were conducted to compare and analyze the performance of three different LLMs under the proposed framework. The evaluation results demonstrate that our approach can effectively generate high-quality, structurally rational building models that are aligned with the abstract concepts specified by user input. Finally, an interactive software prototype was developed to integrate the framework into the BIM authoring software Vectorworks, showcasing the potential of modeling by chatting.
摘要：传统的BIM创作过程通常需要设计者掌握复杂而繁琐的建模命令，以便在BIM创作工具中实现他们的设计意图。这种额外的认知负担使设计过程复杂化，并阻碍了AEC(建筑、工程和建筑)行业采用BIM和基于模型的设计。为了更直观地表达设计意图，我们提出了一种基于LLM的多智能体框架Text2BIM，它可以从自然语言指令生成3D建筑模型。该框架协调多个LLM代理进行协作和推理，将文本用户输入转换为调用BIM创作工具的API的命令性代码，从而直接在软件中生成具有内部布局、外部信封和语义信息的可编辑BIM模型。此外，在代理工作流中引入了基于规则的模型检查器，利用预定义的领域知识来指导LLM代理解决生成的模型中的问题，并迭代地提高模型质量。通过大量的实验，比较和分析了三种不同的LLMS在该框架下的性能。评估结果表明，我们的方法可以有效地生成高质量的、结构合理的建筑模型，这些模型与用户输入的抽象概念保持一致。最后，开发了一个交互式软件原型，将该框架集成到BIM创作软件Vectorworks中，展示了通过聊天进行建模的潜力。

[NLP-15] Leveraging Web-Crawled Data for High-Quality Fine-Tuning
[NLP-15] 利用网络抓取数据进行高质量微调

链接: https://arxiv.org/abs/2408.08003
作者: Jing Zhou,Chenglin Jiang,Wei Shen,Xiao Zhou,Xiaonan He
关键词-EN: expensive human-annotated data, expensive human-annotated, guarantee performance, large language models, data
关键词-ZH: 昂贵的人类注释数据，昂贵的人类注释，保证性能，大型语言模型，数据
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most large language models are fine-tuned using either expensive human-annotated data or GPT-4 generated data which cannot guarantee performance in certain domains. We argue that although the web-crawled data often has formatting errors causing semantic inaccuracies, it can still serve as a valuable source for high-quality supervised fine-tuning in specific domains without relying on advanced models like GPT-4. To this end, we create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data. By training a language model on this dataset, we can convert web data with irregular formats into high-quality ones. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems. Additionally, our 7B model outperforms several open-source models larger than 32B and surpasses well-known closed-source models such as GPT-3.5, highlighting the efficacy of our approach.
摘要：大多数大型语言模型都是使用昂贵的人工注释数据或GPT-4生成的数据进行微调的，而GPT-4生成的数据无法保证某些领域的性能。我们认为，尽管网络爬行的数据经常存在导致语义不准确的格式错误，但它仍然可以作为特定领域高质量监督微调的宝贵来源，而不依赖于GPT-4等高级模型。为此，我们通过将网络爬行的数据与较小的高质量数据集对齐来自动创建成对的训练数据集。通过在该数据集上训练语言模型，我们可以将不规则格式的Web数据转换为高质量的数据。实验表明，使用模型转换后的数据进行训练可以得到更好的结果，在语文数学题中的平均得分比只使用高质量数据的训练高出9.4%。此外，我们的7B模型的性能超过了几个大于32B的开源模型，也超过了著名的闭源模型，如GPT-3.5，突显了我们方法的有效性。

[NLP-16] FuseChat: Knowledge Fusion of Chat Models
[NLP-16] FueChat：聊天模型的知识融合

链接: https://arxiv.org/abs/2408.07990
作者: Fanqi Wan,Longguang Zhong,Ziyi Yang,Ruijun Chen,Xiaojun Quan
关键词-EN: incurs substantial costs, training large language, lead to redundancy, large language models, LLMs
关键词-ZH: 产生巨额成本、训练大型语言、导致冗余、大型语言模型、LLM
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target LLMs within the parameter space, where we propose a novel method for determining the merging coefficients based on the magnitude of parameter updates before and after fine-tuning. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales, including OpenChat-3.5-7B, Starling-LM-7B-alpha, NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct, and Qwen-1.5-Chat-72B. Experimental results on two instruction-following benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of FuseChat-7B over baselines of various sizes. Our model is even comparable to the larger Mixtral-8x7B-Instruct and approaches GPT-3.5-Turbo-1106 on MT-Bench. Our code, model weights, and data are public at \urlthis https URL.
摘要：虽然从头开始训练大型语言模型确实可以产生具有不同能力和优势的模型，但它会产生大量成本，并可能导致能力冗余。知识融合旨在通过轻量级的持续培训，将不同架构和能力的现有LLM集成到更强大的LLM中，从而减少对昂贵的LLM开发的需求。在这项工作中，我们提出了一个新的框架，通过两个主要阶段来进行聊天LLMS的知识融合，从而产生FuseChat。首先，通过轻量级的微调，对不同结构和规模的源聊天LLM进行成对知识融合，生成多个结构和大小相同的目标LLM。在这个过程中，引入了一种基于统计的令牌对齐方法，作为融合不同结构的LLM的基石。其次，我们在参数空间内对这些目标最小二乘法进行合并，提出了一种新的基于微调前后参数更新幅度来确定合并系数的方法。我们使用六个具有不同架构和规模的著名聊天LLM来实现和验证FuseChat，包括OpenChat-3.5-7B、Starling-LM-7B-Alpha、NH2-Solar-10.7B、InternLM2-Chat-20B、Mixtral-8x7B-Indict和Qwen-1.5-Chat-72B。在AlpacaEval2.0和MT-BENCH两个指令跟随基准上的实验结果表明，FuseChat-7B比不同规模的基线具有更好的性能。我们的模型甚至可以与更大的Mixtral-8x7B-Indict相媲美，并在MT-BASE上接近GPT-3.5-Turbo-1106。我们的代码、模型权重和数据在此HTTPS URL上公开。

[NLP-17] ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models
[NLP-17] ArabLegalEval：评估大型语言模型中阿拉伯语法律知识的多任务基准

链接: https://arxiv.org/abs/2408.07983
作者: Faris Hijazi(1),Somayah AlHarbi(1),Abdulaziz AlHussein(1),Harethah Abu Shairah(2),Reem AlZahrani(2),Hebah AlShamlan(1),Omar Knio(2),George Turkiyyah(2) ((1) THIQAH, (2) KAUST)
关键词-EN: Large Language Models, natural language processing, Language Models, Large Language, advancements in Large
关键词-ZH: 大型语言模型、自然语言处理、语言模型、大型语言、大型进步
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The rapid advancements in Large Language Models (LLMs) have led to significant improvements in various natural language processing tasks. However, the evaluation of LLMs’ legal knowledge, particularly in non-English languages such as Arabic, remains under-explored. To address this gap, we introduce ArabLegalEval, a multitask benchmark dataset for assessing the Arabic legal knowledge of LLMs. Inspired by the MMLU and LegalBench datasets, ArabLegalEval consists of multiple tasks sourced from Saudi legal documents and synthesized questions. In this work, we aim to analyze the capabilities required to solve legal problems in Arabic and benchmark the performance of state-of-the-art LLMs. We explore the impact of in-context learning and investigate various evaluation methods. Additionally, we explore workflows for generating questions with automatic validation to enhance the dataset’s quality. We benchmark multilingual and Arabic-centric LLMs, such as GPT-4 and Jais, respectively. We also share our methodology for creating the dataset and validation, which can be generalized to other domains. We hope to accelerate AI research in the Arabic Legal domain by releasing the ArabLegalEval dataset and code: this https URL
摘要：随着大型语言模型的快速发展，各种自然语言处理任务都有了很大的改进。然而，对小岛屿发展中国家法律知识的评价，特别是对阿拉伯语等非英语语言的评价，仍未得到充分探讨。为了弥补这一差距，我们引入了ArabLegalEval，这是一个多任务基准数据集，用于评估小岛屿发展中国家的阿拉伯法律知识。受MMLU和LegalBtch数据集的启发，ArabLegalEval由多个任务组成，这些任务来自沙特的法律文件和综合问题。在这项工作中，我们旨在分析用阿拉伯语解决法律问题所需的能力，并对最先进的LLMS的性能进行基准测试。我们探讨了情境学习的影响，并调查了各种评估方法。此外，我们还探索了使用自动验证来生成问题的工作流程，以提高数据集的质量。我们分别对多语言和以阿拉伯语为中心的LLMS进行基准测试，如GPT-4和JAIS。我们还分享了我们创建数据集和验证的方法，可以推广到其他领域。我们希望通过发布ArabLegalEval数据集和代码来加速阿拉伯法律领域的人工智能研究：这个HTTPS URL

[NLP-18] Coupling without Communication and Drafter-Invariant Speculative Decoding
[NLP-18] 无沟通的耦合与起草者不变的推测解码

链接: https://arxiv.org/abs/2408.07978
作者: Majid Daliri,Christopher Musco,Ananda Theertha Suresh
关键词-EN: Suppose Alice, Alice and Bob, Bob, Alice, Suppose
关键词-ZH: 假设爱丽丝，爱丽丝和鲍勃，鲍勃，爱丽丝，假设
类目: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages

点击查看摘要

Abstract:Suppose Alice has a distribution P and Bob has a distribution Q . Alice wants to generate a sample a\sim P and Bob a sample b \sim Q such that a = b with has as high of probability as possible. It is well-known that, by sampling from an optimal coupling between the distributions, Alice and Bob can achieve Pr[a = b] = 1 - D_TV(P,Q) , where D_TV(P,Q) is the total variation distance. What if Alice and Bob must solve this same problem without communicating at all? Perhaps surprisingly, with access to public randomness, they can still achieve Pr[a = b] \geq \frac1 - D_TV(P,Q)1 + D_TV(P,Q) \geq 1-2D_TV(P,Q) . In fact, this bound can be obtained using a simple protocol based on the Weighted MinHash algorithm. In this work, we explore the communication-free coupling in greater depth. First, we show that an equally simple protocol based on Gumbel sampling matches the worst-case guarantees of the Weighted MinHash approach, but tends to perform better in practice. Conversely, we prove that both approaches are actually sharp: no communication-free protocol can achieve Pr[a=b]\frac1 - D_TV(P,Q)1 + D_TV(P,Q) in the worst-case. Finally, we prove that, for distributions over n items, there exists a scheme that uses just O(\log(n/\epsilon)) bits of communication to achieve Pr[a = b] = 1 - D_TV(P,Q) - \epsilon , i.e. to essentially match optimal coupling. Beyond our theoretical results, we demonstrate an application of communication-free coupling to speculative decoding, a recent method for accelerating autoregressive large language models [Leviathan, Kalman, Matias, ICML 2023]. We show that communication-free protocols yield a variant of speculative decoding that we call Drafter-Invariant Speculative Decoding, which has the desirable property that the output of the method is fixed given a fixed random seed, regardless of what drafter is used for speculation.
摘要：假设爱丽丝的分布是P，鲍勃的分布是Q。Alice想要生成样本a\sim P，Bob生成样本b\sim Q，使得a=b具有尽可能高的概率。众所周知，通过从分布之间的最优耦合抽样，Alice和Bob可以得到Pr[a=b]=1-D_TV(P，Q)，其中D_TV(P，Q)是总的变化距离。如果爱丽丝和鲍勃必须在完全不交流的情况下解决同样的问题，那该怎么办？也许令人惊讶的是，在获得公共随机性的情况下，它们仍然可以实现Pr[a=b]\geq\fr1-D_TV(P，Q)1+D_TV(P，Q)\geq 1-2D_TV(P，Q)。事实上，这个界限可以使用基于加权MinHash算法的简单协议来获得。在这项工作中，我们更深入地探讨了无通信耦合。首先，我们证明了基于Gumbel抽样的同样简单的协议符合加权MinHash方法的最坏情况保证，但在实践中往往执行得更好。相反，我们证明了这两种方法实际上都是尖锐的：在最坏的情况下，任何无通信协议都不能达到Pr[a=b]\Fr1-D_TV(P，Q)1+D_TV(P，Q)。最后，我们证明了对于n项分布，存在一个仅用O(\log(n/\epsilon))比特通信就能达到Pr[a=b]=1-D_TV(P，Q)-\epsilon的方案，即本质上匹配最优耦合。除了我们的理论结果，我们还展示了一种将无通信耦合应用于推测解码的方法，这是一种最近用于加速自回归大型语言模型的方法[Leviathan，Kalman，Matias，ICML 2023]。我们证明了无通信协议产生了一种推测解码的变体，我们称之为草稿不变推测解码，它具有理想的性质，即在给定固定的随机种子的情况下，方法的输出是固定的，无论使用什么草稿进行推测。

[NLP-19] Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models IROS2024
[NLP-19] Polaris：通过Syn2Real视觉基础和大型语言模型进行开放式交互式机器人操纵

链接: https://arxiv.org/abs/2408.07975
作者: Tianyu Wang,Haitao Lin,Junqiu Yu,Yanwei Fu
关键词-EN: Large Language Models, recent Large Language, open-ended interactive robotic, interactive robotic manipulation, paper investigates
关键词-ZH: 大型语言模型，最近的大型语言，开放式交互式机器人，交互式机器人操纵，论文调查
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IROS 2024. 8 pages, 5 figures. See this https URL

点击查看摘要

Abstract:This paper investigates the task of the open-ended interactive robotic manipulation on table-top scenarios. While recent Large Language Models (LLMs) enhance robots’ comprehension of user instructions, their lack of visual grounding constrains their ability to physically interact with the environment. This is because the robot needs to locate the target object for manipulation within the physical workspace. To this end, we introduce an interactive robotic manipulation framework called Polaris, which integrates perception and interaction by utilizing GPT-4 alongside grounded vision models. For precise manipulation, it is essential that such grounded vision models produce detailed object pose for the target object, rather than merely identifying pixels belonging to them in the image. Consequently, we propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline. This pipeline utilizes rendered synthetic data for training and is then transferred to real-world manipulation tasks. The real-world performance demonstrates the efficacy of our proposed pipeline and underscores its potential for extension to more general categories. Moreover, real-robot experiments have showcased the impressive performance of our framework in grasping and executing multiple manipulation tasks. This indicates its potential to generalize to scenarios beyond the tabletop. More information and video results are available here: this https URL
摘要：研究了桌面场景下的开放式交互机器人操作任务。虽然最近的大型语言模型(LLM)增强了机器人对用户指令的理解，但它们缺乏视觉基础，限制了它们与环境进行物理交互的能力。这是因为机器人需要在物理工作空间内定位用于操作的目标对象。为此，我们引入了一种称为Polaris的交互式机器人操作框架，它通过使用GPT-4和接地视觉模型集成了感知和交互。对于精确的操作，这种接地的视觉模型必须为目标对象产生详细的对象姿势，而不仅仅是识别图像中属于它们的像素。因此，我们提出了一种新的合成到真实(Syn2Real)姿态估计流水线。此管道使用渲染的合成数据进行训练，然后传输到真实世界的操作任务。现实世界的表现证明了我们提议的渠道的有效性，并强调了其扩展到更一般类别的潜力。此外，真实机器人实验表明，该框架在抓取和执行多个操作任务方面具有令人印象深刻的性能。这表明它有可能推广到桌面以外的场景。更多信息和视频结果可在此处查看：此HTTPS URL

[NLP-20] Predicting Lung Cancer Patient Prognosis with Large Language Models
[NLP-20] 利用大型语言模型预测肺癌患者预后

链接: https://arxiv.org/abs/2408.07971
作者: Danqing Hu,Bing Liu,Xiang Li,Xiaofeng Zhu,Nan Wu
关键词-EN: determining optimal treatment, optimal treatment plans, crucial for determining, determining optimal, optimal treatment
关键词-ZH: 确定最佳治疗、最佳治疗计划，对于确定、确定最佳、最佳治疗至关重要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prognosis prediction is crucial for determining optimal treatment plans for lung cancer patients. Traditionally, such predictions relied on models developed from retrospective patient data. Recently, large language models (LLMs) have gained attention for their ability to process and generate text based on extensive learned knowledge. In this study, we evaluate the potential of GPT-4o mini and GPT-3.5 in predicting the prognosis of lung cancer patients. We collected two prognosis datasets, i.e., survival and post-operative complication datasets, and designed multiple tasks to assess the models’ performance comprehensively. Logistic regression models were also developed as baselines for comparison. The experimental results demonstrate that LLMs can achieve competitive, and in some tasks superior, performance in lung cancer prognosis prediction compared to data-driven logistic regression models despite not using additional patient data. These findings suggest that LLMs can be effective tools for prognosis prediction in lung cancer, particularly when patient data is limited or unavailable.
摘要：预后预测是确定肺癌患者最佳治疗方案的关键。传统上，这样的预测依赖于从回顾患者数据发展而来的模型。最近，大型语言模型(LLM)因其基于广泛学习的知识处理和生成文本的能力而受到关注。在本研究中，我们评估了GPT-4Omini和GPT-3.5在预测肺癌患者预后方面的潜力。我们收集了两个预后数据集，即生存数据集和术后并发症数据集，并设计了多个任务来综合评估模型的性能。Logistic回归模型也被开发作为比较的基线。实验结果表明，与数据驱动的Logistic回归模型相比，LLMS在肺癌预后预测中具有竞争力，并且在某些任务中表现更好，尽管不使用额外的患者数据。这些发现表明，LLMS是预测肺癌预后的有效工具，特别是在患者数据有限或不可用的情况下。

[NLP-21] GERestaurant: A German Dataset of Annotated Restaurant Reviews for Aspect-Based Sentiment Analysis
[NLP-21] GERestaurant：德国带有注释的餐厅评论数据集，用于基于酒精的情绪分析

链接: https://arxiv.org/abs/2408.07955
作者: Nils Constantin Hellwig,Jakob Fehle,Markus Bink,Christian Wolff
关键词-EN: Aspect-Based Sentiment Analysis, reviews manually annotated, Category Sentiment Analysis, Sentiment Analysis, present GERestaurant
关键词-ZH: 基于议程的情感分析，手动注释评论，类别情感分析，情感分析，当前GERestaurant
类目: Computation and Language (cs.CL)
备注: Accepted in KONVENS 2024. Camera Ready submission

点击查看摘要

Abstract:We present GERestaurant, a novel dataset consisting of 3,078 German language restaurant reviews manually annotated for Aspect-Based Sentiment Analysis (ABSA). All reviews were collected from Tripadvisor, covering a diverse selection of restaurants, including regional and international cuisine with various culinary styles. The annotations encompass both implicit and explicit aspects, including all aspect terms, their corresponding aspect categories, and the sentiments expressed towards them. Furthermore, we provide baseline scores for the four ABSA tasks Aspect Category Detection, Aspect Category Sentiment Analysis, End-to-End ABSA and Target Aspect Sentiment Detection as a reference point for future advances. The dataset fills a gap in German language resources and facilitates exploration of ABSA in the restaurant domain.
摘要：我们介绍了GERestaurant，这是一个新颖的数据集，由3，078条德语餐厅评论组成，用于手动注释，用于基于Aspects的情绪分析（ABSA）。所有评论均来自Tripadviser，涵盖各种餐厅，包括具有各种烹饪风格的地区和国际美食。注释涵盖隐含和显式方面，包括所有方面术语、其相应的方面类别以及对其表达的情感。此外，我们还为四项ABSA任务（方面类别检测、方面类别情感分析、端到端ABSA和目标方面情感检测）提供基线分数，作为未来进步的参考点。该数据集填补了德语资源的空白，并促进了ABSA在餐厅领域的探索。

[NLP-22] MAG-SQL: Multi-Agent Generative Approach with Soft Schema Linking and Iterative Sub-SQL Refinement for Text-to-SQL
[NLP-22] MAG-SQL：具有软模式链接和文本到SQL迭代子SQL细化的多代理生成方法

链接: https://arxiv.org/abs/2408.07930
作者: Wenxuan Xie,Gaochen Wu,Bowen Zhou
关键词-EN: Recent In-Context Learning, In-Context Learning based, Learning based methods, achieved remarkable success, Recent In-Context
关键词-ZH: 最近的上下文学习，基于上下文学习，基于学习的方法，取得了显着的成功，最近的上下文
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 14 figures

点击查看摘要

Abstract:Recent In-Context Learning based methods have achieved remarkable success in Text-to-SQL task. However, there is still a large gap between the performance of these models and human performance on datasets with complex database schema and difficult questions, such as BIRD. Besides, existing work has neglected to supervise intermediate steps when solving questions iteratively with question decomposition methods, and the schema linking methods used in these works are very rudimentary. To address these issues, we propose MAG-SQL, a multi-agent generative approach with soft schema linking and iterative Sub-SQL refinement. In our framework, an entity-based method with tables’ summary is used to select the columns in database, and a novel targets-conditions decomposition method is introduced to decompose those complex questions. Additionally, we build a iterative generating module which includes a Sub-SQL Generator and Sub-SQL Refiner, introducing external oversight for each step of generation. Through a series of ablation studies, the effectiveness of each agent in our framework has been demonstrated. When evaluated on the BIRD benchmark with GPT-4, MAG-SQL achieves an execution accuracy of 61.08%, compared to the baseline accuracy of 46.35% for vanilla GPT-4 and the baseline accuracy of 57.56% for MAC-SQL. Besides, our approach makes similar progress on Spider.
摘要：最近基于上下文学习的方法在文本到SQL的任务中取得了显着的成功。然而，在具有复杂数据库模式和复杂问题(如BIRD)的数据集上，这些模型的性能与人类的性能仍有很大差距。此外，已有的工作忽略了在用问题分解方法迭代求解问题时对中间步骤的监督，而且这些工作中使用的模式链接方法非常初级。为了解决这些问题，我们提出了MAG-SQL，这是一种具有软模式链接和迭代式Sub-SQL求精的多代理生成式方法。在该框架中，采用基于实体的表汇总方法来选择数据库中的列，并引入了一种新的目标-条件分解方法来分解复杂的问题。此外，我们构建了一个迭代生成模块，其中包括Sub-SQL生成器和Sub-SQL精化器，为生成的每个步骤引入了外部监督。通过一系列的烧蚀研究，我们已经证明了我们框架中每种试剂的有效性。当使用GPT-4在BIRD基准上进行评估时，MAG-SQL的执行精度为61.08%，而普通GPT-4的基准精度为46.35%，MAC-SQL的基准精度为57.56%。此外，我们的方法在Spider上也取得了类似的进展。

[NLP-23] DM2RM: Dual-Mode Multimodal Ranking for Target Objects and Receptacles Based on Open-Vocabulary Instructions
[NLP-23] DM 2 RM：基于开放词汇指令的目标对象和接收器的双模式多模式排名

链接: https://arxiv.org/abs/2408.07910
作者: Ryosuke Korekata,Kanta Kaneda,Shunya Nagashima,Yuto Imai,Komei Sugiura
关键词-EN: domestic service robot, carry everyday objects, service robot, pieces of furniture, open-vocabulary instructions
关键词-ZH: 家庭服务机器人、携带日常物品、服务机器人、家具、开放词汇指令
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this study, we aim to develop a domestic service robot (DSR) that, guided by open-vocabulary instructions, can carry everyday objects to the specified pieces of furniture. Few existing methods handle mobile manipulation tasks with open-vocabulary instructions in the image retrieval setting, and most do not identify both the target objects and the receptacles. We propose the Dual-Mode Multimodal Ranking model (DM2RM), which enables images of both the target objects and receptacles to be retrieved using a single model based on multimodal foundation models. We introduce a switching mechanism that leverages a mode token and phrase identification via a large language model to switch the embedding space based on the prediction target. To evaluate the DM2RM, we construct a novel dataset including real-world images collected from hundreds of building-scale environments and crowd-sourced instructions with referring expressions. The evaluation results show that the proposed DM2RM outperforms previous approaches in terms of standard metrics in image retrieval settings. Furthermore, we demonstrate the application of the DM2RM on a standardized real-world DSR platform including fetch-and-carry actions, where it achieves a task success rate of 82% despite the zero-shot transfer setting. Demonstration videos, code, and more materials are available at this https URL.
摘要：在这项研究中，我们的目标是开发一种家用服务机器人(DSR)，它可以在开放词汇指令的指导下，将日常物品运送到指定的家具上。现有的方法很少在图像检索设置中使用开放词汇表指令来处理移动操作任务，并且大多数方法不能同时识别目标对象和容器。我们提出了双模式多模式排序模型(DM2RM)，该模型能够在多模式基础模型的基础上使用单一模型来检索目标对象和容器的图像。我们引入了一种切换机制，该机制通过一个大的语言模型来利用模式标记和短语识别来切换基于预测目标的嵌入空间。为了评估DM2RM，我们构建了一个新的数据集，包括从数百个建筑规模的环境中收集的真实世界图像和带有指代表达式的众包指令。评估结果表明，在图像检索环境下，提出的DM2RM在标准度量方面优于以往的方法。此外，我们还展示了DM2RM在一个标准化的真实DSR平台上的应用，其中包括取数和进位操作，在该平台上，尽管设置了零射击传输，但它的任务成功率达到了82%。有关演示视频、代码和更多材料，请访问此HTTPS URL。

[NLP-24] Assessing Language Models Worldview for Fiction Generation
[NLP-24] 评估小说生成的语言模型世界观

链接: https://arxiv.org/abs/2408.07904
作者: Aisha Khatun,Daniel G. Brown
关键词-EN: Large Language Models, Large Language, Language Models, computational creativity, Large
关键词-ZH: 大型语言模型，大型语言，语言模型，计算创造力，大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Short paper

点击查看摘要

Abstract:The use of Large Language Models (LLMs) has become ubiquitous, with abundant applications in computational creativity. One such application is fictional story generation. Fiction is a narrative that occurs in a story world that is slightly different than ours. With LLMs becoming writing partners, we question how suitable they are to generate fiction. This study investigates the ability of LLMs to maintain a state of world essential to generate fiction. Through a series of questions to nine LLMs, we find that only two models exhibit consistent worldview, while the rest are self-conflicting. Subsequent analysis of stories generated by four models revealed a strikingly uniform narrative pattern. This uniformity across models further suggests a lack of `state’ necessary for fiction. We highlight the limitations of current LLMs in fiction writing and advocate for future research to test and create story worlds for LLMs to reside in. All code, dataset, and the generated responses can be found in this https URL.
摘要：大型语言模型的使用已经变得无处不在，在计算创造力方面有着丰富的应用。一个这样的应用是虚构故事生成。小说是一种发生在与我们的故事世界略有不同的故事世界中的叙事。随着LLMS成为写作伙伴，我们质疑他们是否适合创作小说。这项研究调查了LLMS维持世界状态的能力，这是产生小说所必需的。通过对九个LLM的一系列问题，我们发现只有两个模型表现出一致的世界观，而其余的模型是自我冲突的。随后对四个模型生成的故事进行了分析，发现了一个惊人的统一的叙事模式。模型之间的这种一致性进一步表明，小说缺乏必要的“状态”。我们强调了当前LLM在小说创作中的局限性，并倡导未来的研究，以测试和创建LLM驻留的故事世界。所有代码、数据集和生成的响应都可以在此HTTPS URL中找到。

[NLP-25] Fine-tuning Large Language Models with Human-inspired Learning Strategies in Medical Question Answering
[NLP-25] 医学问题解答中采用人性化学习策略微调大型语言模型

链接: https://arxiv.org/abs/2408.07888
作者: Yushi Yang,Andrew M. Bean,Robert McCraith,Adam Mahdi
关键词-EN: Training Large Language, incurs substantial data-related, substantial data-related costs, Large Language Models, data-efficient training methods
关键词-ZH: 培训大型语言，会产生大量数据相关、大量数据相关成本、大型语言模型、数据高效的培训方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training Large Language Models (LLMs) incurs substantial data-related costs, motivating the development of data-efficient training methods through optimised data ordering and selection. Human-inspired learning strategies, such as curriculum learning, offer possibilities for efficient training by organising data according to common human learning practices. Despite evidence that fine-tuning with curriculum learning improves the performance of LLMs for natural language understanding tasks, its effectiveness is typically assessed using a single model. In this work, we extend previous research by evaluating both curriculum-based and non-curriculum-based learning strategies across multiple LLMs, using human-defined and automated data labels for medical question answering. Our results indicate a moderate impact of using human-inspired learning strategies for fine-tuning LLMs, with maximum accuracy gains of 1.77% per model and 1.81% per dataset. Crucially, we demonstrate that the effectiveness of these strategies varies significantly across different model-dataset combinations, emphasising that the benefits of a specific human-inspired strategy for fine-tuning LLMs do not generalise. Additionally, we find evidence that curriculum learning using LLM-defined question difficulty outperforms human-defined difficulty, highlighting the potential of using model-generated measures for optimal curriculum design.
摘要：训练大型语言模型(LLM)会产生大量与数据相关的成本，促使人们通过优化数据排序和选择来开发数据效率高的训练方法。受人类启发的学习策略，如课程学习，通过根据常见的人类学习实践组织数据，为有效培训提供了可能性。尽管有证据表明，通过课程学习进行微调可以提高LLMS在自然语言理解任务中的表现，但其有效性通常是使用单一模型进行评估的。在这项工作中，我们通过评估基于课程的和非基于课程的学习策略跨越多个LLM，使用人类定义的和自动的医疗问题回答数据标签来扩展先前的研究。我们的结果表明，使用人类启发的学习策略对LLMS进行微调的影响是适度的，每个模型和每个数据集的最大准确率分别提高了1.77%和1.81%。至关重要的是，我们证明了这些策略的有效性在不同的模型-数据集组合中差异很大，强调了特定的人类启发策略对微调LLM的好处并不是一概而论的。此外，我们发现使用LLM定义的问题难度的课程学习优于人类定义的难度的课程学习，突出了使用模型生成的测量来优化课程设计的潜力。

[NLP-26] Instruct Large Language Models to Generate Scientific Literature Survey Step by Step NLPCC2024
[NLP-26] 指导大型语言模型逐步生成科学文献调查

链接: https://arxiv.org/abs/2408.07884
作者: Yuxuan Lai,Yupeng Wu,Yidan Wang,Wenpeng Hu,Chen Zheng
关键词-EN: literature survey, scientific literature surveys, Literature Survey Generation, literature, literature survey poses
关键词-ZH: 文学调查，科学文献调查，文学调查一代，文学，文学调查构成
类目: Computation and Language (cs.CL)
备注: NLPCC 2024

点击查看摘要

Abstract:Abstract. Automatically generating scientific literature surveys is a valuable task that can significantly enhance research efficiency. However, the diverse and complex nature of information within a literature survey poses substantial challenges for generative models. In this paper, we design a series of prompts to systematically leverage large language models (LLMs), enabling the creation of comprehensive literature surveys through a step-by-step approach. Specifically, we design prompts to guide LLMs to sequentially generate the title, abstract, hierarchical headings, and the main content of the literature survey. We argue that this design enables the generation of the headings from a high-level perspective. During the content generation process, this design effectively harnesses relevant information while minimizing costs by restricting the length of both input and output content in LLM queries. Our implementation with Qwen-long achieved third place in the NLPCC 2024 Scientific Literature Survey Generation evaluation task, with an overall score only 0.03% lower than the second-place team. Additionally, our soft heading recall is 95.84%, the second best among the submissions. Thanks to the efficient prompt design and the low cost of the Qwen-long API, our method reduces the expense for generating each literature survey to 0.1 RMB, enhancing the practical value of our method.
摘要：摘要。自动生成科学文献调查是一项可以显著提高研究效率的有价值的任务。然而，文献调查中信息的多样性和复杂性给生成模型带来了巨大的挑战。在本文中，我们设计了一系列提示，以系统地利用大型语言模型(LLM)，从而能够通过循序渐进的方法创建全面的文献调查。具体地说，我们设计了提示，引导LLMS依次生成文献调查的标题、摘要、层次标题和主要内容。我们认为，这种设计能够从高层次的角度生成标题。在内容生成过程中，该设计通过限制LLM查询中输入和输出内容的长度，有效地利用了相关信息，同时最大限度地降低了成本。我们与启文龙的实施在NLPCC 2024科学文献调查世代评估任务中获得第三名，总体得分仅比第二名团队低0.03%。此外，我们的软标题召回率为95.84%，在提交的材料中排名第二。由于Qwen-Long API的高效快速设计和低成本，该方法将每份文献调查的生成费用降低到0.1元人民币，提高了该方法的实用价值。

[NLP-27] Words Matter: Reducing Stigma in Online Conversations about Substance Use with Large Language Models
[NLP-27] 言语很重要：通过大型语言模型减少有关药物使用的在线对话中的耻辱

链接: https://arxiv.org/abs/2408.07873
作者: Layla Bouzoubaa,Elham Aghakhani,Shadi Rezapour
关键词-EN: treatment engagement rates, significantly lower treatment, lower treatment engagement, engagement rates, lower treatment
关键词-ZH: 治疗参与率，治疗明显降低，治疗参与率较低，治疗参与率，治疗较低
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stigma is a barrier to treatment for individuals struggling with substance use disorders (SUD), which leads to significantly lower treatment engagement rates. With only 7% of those affected receiving any form of help, societal stigma not only discourages individuals with SUD from seeking help but isolates them, hindering their recovery journey and perpetuating a cycle of shame and self-doubt. This study investigates how stigma manifests on social media, particularly Reddit, where anonymity can exacerbate discriminatory behaviors. We analyzed over 1.2 million posts, identifying 3,207 that exhibited stigmatizing language towards people who use substances (PWUS). Using Informed and Stylized LLMs, we develop a model for de-stigmatization of these expressions into empathetic language, resulting in 1,649 reformed phrase pairs. Our paper contributes to the field by proposing a computational framework for analyzing stigma and destigmatizing online content, and delving into the linguistic features that propagate stigma towards PWUS. Our work not only enhances understanding of stigma’s manifestations online but also provides practical tools for fostering a more supportive digital environment for those affected by SUD. Code and data will be made publicly available upon acceptance.
摘要：耻辱是与物质使用障碍(SUD)作斗争的个人治疗的障碍，这导致治疗参与率显著降低。受影响的人中只有7%得到了任何形式的帮助，社会耻辱不仅阻碍了患有自闭症的人寻求帮助，而且使他们孤立，阻碍了他们的康复之旅，并使羞愧和自我怀疑的循环永久化。这项研究调查了污名如何在社交媒体上表现出来，特别是Reddit，在社交媒体上，匿名可能会加剧歧视行为。我们分析了120多万条帖子，确定了3,207条对吸毒者使用污名化语言的帖子(PWU)。使用知情和风格化的LLMS，我们开发了一个模型，将这些表达去污名化为移情语言，产生1,649个改革的短语对。我们的论文提出了一个分析污名和去污名化在线内容的计算框架，并深入研究了将污名传播到PWUS的语言特征，从而为这一领域做出了贡献。我们的工作不仅增强了对耻辱在网上的表现的理解，还为那些受到SUD影响的人提供了培养更具支持性的数字环境的实用工具。代码和数据将在验收后公开提供。

[NLP-28] raining Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability
[NLP-28] 知识图谱上的语言模型：幻觉及其可检测性的见解

链接: https://arxiv.org/abs/2408.07852
作者: Jiri Hron,Laura Culp,Gamaleldin Elsayed,Rosanne Liu,Ben Adlam,Maxwell Bileschi,Bernd Bohnet,JD Co-Reyes,Noah Fiedel,C. Daniel Freeman,Izzeddin Gur,Kathleen Kenealy,Jaehoon Lee,Peter J. Liu,Gaurav Mishra,Igor Mordatch,Azade Nova,Roman Novak,Aaron Parisi,Jeffrey Pennington,Alex Rizkowsky,Isabelle Simpson,Hanie Sedghi,Jascha Sohl-dickstein,Kevin Swersky,Sharad Vikram,Tris Warkentin,Lechao Xiao,Kelvin Xu,Jasper Snoek,Simon Kornblith
关键词-EN: increased training budget, capabilities of language, training budget, fully understood, increased training
关键词-ZH: 增加培训预算、语言能力、培训预算、充分理解、增加培训
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published at COLM 2024. 16 pages, 11 figures

点击查看摘要

Abstract:While many capabilities of language models (LMs) improve with increased training budget, the influence of scale on hallucinations is not yet fully understood. Hallucinations come in many forms, and there is no universally accepted definition. We thus focus on studying only those hallucinations where a correct answer appears verbatim in the training set. To fully control the training data content, we construct a knowledge graph (KG)-based dataset, and use it to train a set of increasingly large LMs. We find that for a fixed dataset, larger and longer-trained LMs hallucinate less. However, hallucinating on \leq5 % of the training data requires an order of magnitude larger model, and thus an order of magnitude more compute, than Hoffmann et al. (2022) reported was optimal. Given this costliness, we study how hallucination detectors depend on scale. While we see detector size improves performance on fixed LM’s outputs, we find an inverse relationship between the scale of the LM and the detectability of its hallucinations.
摘要：虽然语言模型(LMS)的许多能力随着培训预算的增加而提高，但规模对幻觉的影响尚不完全清楚。幻觉有多种形式，没有一个普遍接受的定义。因此，我们只专注于研究那些正确答案出现在训练集中的幻觉。为了充分控制训练数据的内容，我们构造了一个基于知识图(KG)的数据集，并用它来训练一组越来越大的LMS。我们发现，对于固定的数据集，更大、训练时间更长的LMS幻觉更少。然而，对5%的训练数据产生幻觉需要一个更大数量级的模型，因此比Hoffmann等人多一个数量级的计算。(2022)是最理想的。考虑到这种昂贵的成本，我们研究了幻觉探测器是如何依赖于规模的。虽然我们看到探测器的大小提高了固定LM输出的性能，但我们发现LM的规模与其幻觉的可检测性之间存在反向关系。

[NLP-29] SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition INTERSPEECH2024
[NLP-29] BER Evals：语音情感识别领域内和领域外基准

链接: https://arxiv.org/abs/2408.07851
作者: Mohamed Osman,Daniel Z. Kaplan,Tamer Nadeem
关键词-EN: powerful self-supervised learning, made significant strides, self-supervised learning, made significant, significant strides
关键词-ZH: 强大的自我监督学习，取得了重大进展，自我监督学习，取得了重大进展
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at INTERSPEECH 2024

点击查看摘要

Abstract:Speech emotion recognition (SER) has made significant strides with the advent of powerful self-supervised learning (SSL) models. However, the generalization of these models to diverse languages and emotional expressions remains a challenge. We propose a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. Our benchmark includes a diverse set of multilingual datasets, focusing on less commonly used corpora to assess generalization to new data. We employ logit adjustment to account for varying class distributions and establish a single dataset cluster for systematic evaluation. Surprisingly, we find that the Whisper model, primarily designed for automatic speech recognition, outperforms dedicated SSL models in cross-lingual SER. Our results highlight the need for more robust and generalizable SER models, and our benchmark serves as a valuable resource to drive future research in this direction.
摘要：随着强大的自我监督学习模型的出现，语音情感识别(SER)取得了长足的进步。然而，将这些模型推广到不同的语言和情感表达仍然是一个挑战。我们提出了一个大规模的基准来评估最新的SER模型在域内和域外环境下的稳健性和适应性。我们的基准包括一组不同的多语言数据集，重点放在不太常用的语料库上，以评估对新数据的泛化。我们使用Logit调整来考虑不同的类别分布，并建立单个数据集聚类来进行系统评估。令人惊讶的是，我们发现，主要为自动语音识别设计的Whisper模型在跨语言SER中的表现优于专用的SSL模型。我们的结果强调了对更健壮和可推广的SER模型的需求，我们的基准是推动未来这一方向研究的宝贵资源。

[NLP-30] ONSEP: A Novel Online Neural-Symbolic Framework for Event Prediction Based on Large Language Model ACL2024
[NLP-30] ONSDP：一种基于大型语言模型的事件预测在线神经符号框架

链接: https://arxiv.org/abs/2408.07840
作者: Xuanqing Yu,Wangtao Sun,Jingwei Li,Kang Liu,Chengbao Liu,Jie Tan
关键词-EN: temporal knowledge graph, knowledge graph forecasting, temporal knowledge, graph forecasting, pivotal technique
关键词-ZH: 时态知识图，知识图预测，时态知识，图预测，关键技术
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: 16 pages, ACL 2024 Findings

点击查看摘要

Abstract:In the realm of event prediction, temporal knowledge graph forecasting (TKGF) stands as a pivotal technique. Previous approaches face the challenges of not utilizing experience during testing and relying on a single short-term history, which limits adaptation to evolving data. In this paper, we introduce the Online Neural-Symbolic Event Prediction (ONSEP) framework, which innovates by integrating dynamic causal rule mining (DCRM) and dual history augmented generation (DHAG). DCRM dynamically constructs causal rules from real-time data, allowing for swift adaptation to new causal relationships. In parallel, DHAG merges short-term and long-term historical contexts, leveraging a bi-branch approach to enrich event prediction. Our framework demonstrates notable performance enhancements across diverse datasets, with significant Hit@k (k=1,3,10) improvements, showcasing its ability to augment large language models (LLMs) for event prediction without necessitating extensive retraining. The ONSEP framework not only advances the field of TKGF but also underscores the potential of neural-symbolic approaches in adapting to dynamic data environments.
摘要：在事件预测领域，时态知识图预测(TKGF)是一种关键技术。以前的方法面临着在测试期间不利用经验和依赖单一的短期历史的挑战，这限制了对不断变化的数据的适应。本文介绍了在线神经符号事件预测(ONSEP)框架，该框架将动态因果规则挖掘(DCRM)和双重历史扩充生成(DHAG)相结合。DCRM根据实时数据动态构建因果规则，允许快速适应新的因果关系。同时，DHAG合并了短期和长期历史背景，利用双分支方法来丰富事件预测。我们的框架在不同的数据集上显示了显著的性能增强，其中Hit@k(k=1，3，10)有显著的改进，展示了它在不需要大量再培训的情况下增强大型语言模型(LLM)进行事件预测的能力。ONSEP框架不仅推动了TKGF领域的发展，而且强调了神经-符号方法在适应动态数据环境方面的潜力。

[NLP-31] Language Driven Slice Discovery and Error Rectification
[NLP-31] 语言驱动的切片发现和错误纠正

链接: https://arxiv.org/abs/2408.07832
作者: Shantanu Ghosh,Chenyu Wang,Kayhan Batmanghelich
关键词-EN: discovery associates structured, associates structured patterns, slice discovery associates, discover error slices, Error slice discovery
关键词-ZH: 发现关联结构化、关联结构化模式、切片发现关联、发现错误切片、错误切片发现
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Error slice discovery associates structured patterns with model errors. Existing methods discover error slices by clustering the error-prone samples with similar patterns or assigning discrete attributes to each sample for post-hoc analysis. While these methods aim for interpretability and easier mitigation through reweighting or rebalancing, they may not capture the full complexity of error patterns due to incomplete or missing attributes. Contrary to the existing approach, this paper utilizes the reasoning capabilities of the Large Language Model (LLM) to analyze complex error patterns and generate testable hypotheses. This paper proposes LADDER: Language Driven slice Discovery and Error Rectification. It first projects the model’s representation into a language-aligned feature space (\eg CLIP) to preserve semantics in the original model feature space. This ensures the accurate retrieval of sentences that highlight the model’s errors. Next, the LLM utilizes the sentences and generates hypotheses to discover error slices. Finally, we mitigate the error by fine-tuning the classification head by creating a group-balanced dataset using the hypotheses. Our entire method does not require any attribute annotation, either explicitly or through external tagging models. We validate our method with \textbffive image classification datasets. The code is available\footnote\urlthis https URL
摘要：错误切片发现将结构化模式与模型错误相关联。现有方法通过对具有相似模式的容易出错的样本进行聚类或将离散属性分配给每个样本以进行事后分析来发现错误切片。虽然这些方法的目的是通过重新加权或重新平衡来实现可解释性和更容易的缓解，但由于属性不完整或缺失，它们可能无法捕获错误模式的全部复杂性。与现有方法不同，本文利用大语言模型的推理能力来分析复杂的错误模式，并生成可测试的假设。本文提出了梯形图：语言驱动的切片发现和纠错。它首先将模型的表示投影到语言对齐的特征空间(如CLIP)中，以保留原始模型特征空间中的语义。这确保了对突出模型错误的句子的准确检索。接下来，LLM利用句子并生成假设来发现错误切片。最后，我们通过使用假设创建组平衡数据集来微调分类头来减少错误。我们的整个方法不需要任何属性注释，无论是显式的还是通过外部标记模型。我们用文本有效的图像分类数据集来验证我们的方法。代码可用\Footnote\urlThis HTTPS URL

[NLP-32] Enhancing Supply Chain Visibility with Knowledge Graphs and Large Language Models
[NLP-32] 利用知识图和大型语言模型增强供应链可见性

链接: https://arxiv.org/abs/2408.07705
作者: Sara AlMahri,Liming Xu,Alexandra Brintrup
关键词-EN: today globalized economy, supply chain, Large Language Models, comprehensive supply chain, supply
关键词-ZH: 当今全球化经济、供应链、大型语言模型、综合供应链、供应
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In today’s globalized economy, comprehensive supply chain visibility is crucial for effective risk management. Achieving visibility remains a significant challenge due to limited information sharing among supply chain partners. This paper presents a novel framework leveraging Knowledge Graphs (KGs) and Large Language Models (LLMs) to enhance supply chain visibility without relying on direct stakeholder information sharing. Our zero-shot, LLM-driven approach automates the extraction of supply chain information from diverse public sources and constructs KGs to capture complex interdependencies between supply chain entities. We employ zero-shot prompting for Named Entity Recognition (NER) and Relation Extraction (RE) tasks, eliminating the need for extensive domain-specific training. We validate the framework with a case study on electric vehicle supply chains, focusing on tracking critical minerals for battery manufacturing. Results show significant improvements in supply chain mapping, extending visibility beyond tier-2 suppliers. The framework reveals critical dependencies and alternative sourcing options, enhancing risk management and strategic planning. With high accuracy in NER and RE tasks, it provides an effective tool for understanding complex, multi-tiered supply networks. This research offers a scalable, flexible method for constructing domain-specific supply chain KGs, addressing longstanding challenges in visibility and paving the way for advancements in digital supply chain surveillance.
摘要：在全球化经济的今天，全面的供应链可见性对于有效的风险管理至关重要。由于供应链合作伙伴之间的信息共享有限，实现可见性仍然是一项重大挑战。提出了一种利用知识图(KGs)和大型语言模型(LLM)来提高供应链可见性的新框架，而不依赖于直接的利益相关者信息共享。我们的零镜头、LLM驱动的方法自动从不同的公共来源提取供应链信息，并构建KG来捕获供应链实体之间的复杂相互依赖关系。我们为命名实体识别(NER)和关系提取(RE)任务使用了零命中提示，消除了对广泛的领域特定培训的需要。我们通过电动汽车供应链的案例研究验证了该框架，重点是跟踪电池制造的关键矿物。结果显示，供应链地图显著改善，将可见性扩展到二级供应商以外。该框架揭示了关键的依赖关系和替代来源选择，加强了风险管理和战略规划。它在NER和RE任务中具有很高的准确性，为理解复杂的、多层次的供应网络提供了有效的工具。这项研究为构建特定领域的供应链知识系统提供了一种可扩展的、灵活的方法，解决了可见性方面的长期挑战，并为数字供应链监控的进步铺平了道路。

[NLP-33] What Color Scheme is More Effective in Assisting Readers to Locate Information in a Color-Coded Article?
[NLP-33] 哪种配色方案在帮助读者定位彩色编码文章中的信息方面更有效？

链接: https://arxiv.org/abs/2408.06494
作者: Ho Yin Ng,Zeyu He,Ting-Hao ‘Kenneth’ Huang
关键词-EN: human cognitive activities, aiding human cognitive, Large Language Models, cluster information types, assigning specific colors
关键词-ZH: 人类认知活动、辅助人类认知、大型语言模型、集群信息类型、分配特定颜色
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Color coding, a technique assigning specific colors to cluster information types, has proven advantages in aiding human cognitive activities, especially reading and comprehension. The rise of Large Language Models (LLMs) has streamlined document coding, enabling simple automatic text labeling with various schemes. This has the potential to make color-coding more accessible and benefit more users. However, the impact of color choice on information seeking is understudied. We conducted a user study assessing various color schemes’ effectiveness in LLM-coded text documents, standardizing contrast ratios to approximately 5.55:1 across schemes. Participants performed timed information-seeking tasks in color-coded scholarly abstracts. Results showed non-analogous and yellow-inclusive color schemes improved performance, with the latter also being more preferred by participants. These findings can inform better color scheme choices for text annotation. As LLMs advance document coding, we advocate for more research focusing on the “color” aspect of color-coding techniques.
摘要：颜色编码是一种赋予信息类型特定颜色的技术，在帮助人类认知活动，特别是阅读和理解方面具有明显的优势。大型语言模型(LLM)的兴起简化了文档编码，使用各种方案实现了简单的自动文本标记。这有可能使颜色编码更容易获得，并使更多用户受益。然而，颜色选择对信息寻求的影响还没有得到充分的研究。我们进行了一项用户研究，评估了各种配色方案在LLM编码的文本文档中的有效性，将不同方案的对比度标准化为大约5.55：1。参与者在用颜色编码的学术摘要中执行定时信息搜索任务。结果显示，非相似和包含黄色的配色方案提高了性能，后者也更受参与者的青睐。这些发现可以为文本注释提供更好的配色方案选择。随着LLMS推进文档编码，我们主张对颜色编码技术的“颜色”方面进行更多研究。

[NLP-34] Natural Language Outlines for Code: Literate Programming in the LLM Era
[NLP-34] 自然语言代码大纲：法学硕士时代的文学编程

链接: https://arxiv.org/abs/2408.04820
作者: Kensen Shi,Deniz Altınbüken,Saswat Anand,Mihai Christodorescu,Katja Grünwedel,Alexa Koenings,Sai Naidu,Anurag Pathak,Marc Rasi,Fredde Ribeiro,Brandon Ruffin,Siddhant Sanyam,Maxim Tabachnyk,Sara Toth,Roy Tu,Tobias Welp,Pengcheng Yin,Manzil Zaheer,Satish Chandra,Charles Sutton
关键词-EN: software development process, natural language outlines, development process, natural language, modality and interaction
关键词-ZH: 软件开发过程、自然语言大纲、开发过程、自然语言、形态和交互
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose using natural language outlines as a novel modality and interaction surface for providing AI assistance to developers throughout the software development process. An NL outline for a code function comprises multiple statements written in concise prose, which partition the code and summarize its main ideas in the style of literate programming. Crucially, we find that modern LLMs can generate accurate and high-quality NL outlines in practice. Moreover, NL outlines enable a bidirectional sync between code and NL, allowing changes in one to be automatically reflected in the other. We discuss many use cases for NL outlines: they can accelerate understanding and navigation of code and diffs, simplify code maintenance, augment code search, steer code generation, and more. We then propose and compare multiple LLM prompting techniques for generating outlines and ask professional developers to judge outline quality. Finally, we present two case studies applying NL outlines toward code review and the difficult task of malware detection.
摘要：我们建议使用自然语言大纲作为一种新的形式和交互界面，在整个软件开发过程中为开发人员提供人工智能辅助。代码函数的NL大纲包含多个用简洁的散文编写的语句，这些语句以文字编程的风格划分代码并总结其主要思想。最重要的是，我们发现现代的LLM可以在实践中生成准确和高质量的自然语言轮廓。此外，NL轮廓支持代码和NL之间的双向同步，允许其中一个中的更改自动反映在另一个中。我们讨论了许多自然语言大纲的用例：它们可以加速对代码和差异的理解和导航，简化代码维护，增强代码搜索，引导代码生成，等等。然后，我们提出并比较了多种LLM提示技术来生成轮廓，并请专业开发人员来判断轮廓的质量。最后，我们提供了两个案例研究，将自然语言概要应用于代码审查和恶意软件检测的困难任务。

[NLP-35] Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words
[NLP-35] 通过稀有和歧义词的上下文化增强基于大语言模型的语音识别

链接: https://arxiv.org/abs/2408.08027
作者: Kento Nozawa,Takashi Masuko,Toru Taniguchi
关键词-EN: based automatic speech, large language model, automatic speech recognition, based automatic, develop a large
关键词-ZH: 基于自动语音、大语言模型、自动语音识别、基于自动、开发大
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 13 pages, 1 figure, and 7 tables

点击查看摘要

Abstract:We develop a large language model (LLM) based automatic speech recognition (ASR) system that can be contextualized by providing keywords as prior information in text prompts. We adopt decoder-only architecture and use our in-house LLM, PLaMo-100B, pre-trained from scratch using datasets dominated by Japanese and English texts as the decoder. We adopt a pre-trained Whisper encoder as an audio encoder, and the audio embeddings from the audio encoder are projected to the text embedding space by an adapter layer and concatenated with text embeddings converted from text prompts to form inputs to the decoder. By providing keywords as prior information in the text prompts, we can contextualize our LLM-based ASR system without modifying the model architecture to transcribe ambiguous words in the input audio accurately. Experimental results demonstrate that providing keywords to the decoder can significantly improve the recognition performance of rare and ambiguous words.
摘要：我们开发了一个基于大语言模型（LLM）的自动语音识别（ASB）系统，该系统可以通过在文本提示中提供关键词作为先验信息来进行上下文化。我们采用仅解码器的架构，并使用我们的内部LLM PLaMo-100 B，使用以日语和英语文本为主的数据集作为解码器从头开始预训练。我们采用预先训练的Whisper编码器作为音频编码器，来自音频编码器的音频嵌入通过适配器层投影到文本嵌入空间，并与从文本提示转换的文本嵌入级联，形成解码器的输入。通过在文本提示中提供关键词作为先验信息，我们可以将我们的基于LLM的ASB系统上下文化，而无需修改模型架构以准确地转录输入音频中的歧义词。实验结果表明，向解码器提供关键词可以显着提高稀有词和歧义词的识别性能。

人工智能

[AI-0] Can Large Language Models Understand Symbolic Graphics Programs?

链接: https://arxiv.org/abs/2408.08313
作者: Zeju Qiu,Weiyang Liu,Haiwen Feng,Zhen Liu,Tim Z. Xiao,Katherine M. Collins,Joshua B. Tenenbaum,Adrian Weller,Michael J. Black,Bernhard Schölkopf
关键词-EN: symbolic graphics programs, Assessing the capabilities, symbolic graphics, graphics content, graphics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report v1 (44 pages, 23 figures, project page: this https URL )

点击查看摘要

[AI-1] HyperTaxel: Hyper-Resolution for Taxel-Based Tactile Signals Through Contrastive Learning IROS2024

链接: https://arxiv.org/abs/2408.08312
作者: Hongyu Li,Snehal Dikhale,Jinda Cui,Soshi Iba,Nawid Jamali
关键词-EN: achieve dexterity comparable, intelligently process tactile, robots must intelligently, tactile sensor data, Taxel-based tactile signals
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted by IROS 2024

点击查看摘要

Abstract:To achieve dexterity comparable to that of humans, robots must intelligently process tactile sensor data. Taxel-based tactile signals often have low spatial-resolution, with non-standardized representations. In this paper, we propose a novel framework, HyperTaxel, for learning a geometrically-informed representation of taxel-based tactile signals to address challenges associated with their spatial resolution. We use this representation and a contrastive learning objective to encode and map sparse low-resolution taxel signals to high-resolution contact surfaces. To address the uncertainty inherent in these signals, we leverage joint probability distributions across multiple simultaneous contacts to improve taxel hyper-resolution. We evaluate our representation by comparing it with two baselines and present results that suggest our representation outperforms the baselines. Furthermore, we present qualitative results that demonstrate the learned representation captures the geometric features of the contact surface, such as flatness, curvature, and edges, and generalizes across different objects and sensor configurations. Moreover, we present results that suggest our representation improves the performance of various downstream tasks, such as surface classification, 6D in-hand pose estimation, and sim-to-real transfer.

[AI-2] Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy Consistency and Reasoning Behaviors

链接: https://arxiv.org/abs/2408.08302
作者: Usman Syed,Ethan Light,Xingang Guo,Huan Zhang,Lianhui Qin,Yanfeng Ouyang,Bin Hu
关键词-EN: large language models, transportation engineering problems, selected undergraduate-level transportation, undergraduate-level transportation engineering, transportation engineering
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-3] SLCA: Unleash the Power of Sequential Fine-tuning for Continual Learning with Pre-training ICCV23

链接: https://arxiv.org/abs/2408.08295
作者: Gengwei Zhang,Liyuan Wang,Guoliang Kang,Ling Chen,Yunchao Wei
关键词-EN: received widespread interest, recent years, widespread interest, training from scratch, received widespread
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper is an extension of our ICCV 23 paper ( arXiv:2303.05118 )

点击查看摘要

Abstract:In recent years, continual learning with pre-training (CLPT) has received widespread interest, instead of its traditional focus of training from scratch. The use of strong pre-trained models (PTMs) can greatly facilitate knowledge transfer and alleviate catastrophic forgetting, but also suffers from progressive overfitting of pre-trained knowledge into specific downstream tasks. A majority of current efforts often keep the PTMs frozen and incorporate task-specific prompts to instruct representation learning, coupled with a prompt selection process for inference. However, due to the limited capacity of prompt parameters, this strategy demonstrates only sub-optimal performance in continual learning. In comparison, tuning all parameters of PTMs often provides the greatest potential for representation learning, making sequential fine-tuning (Seq FT) a fundamental baseline that has been overlooked in CLPT. To this end, we present an in-depth analysis of the progressive overfitting problem from the lens of Seq FT. Considering that the overly fast representation learning and the biased classification layer constitute this particular problem, we introduce the advanced Slow Learner with Classifier Alignment (SLCA++) framework to unleash the power of Seq FT, serving as a strong baseline approach for CLPT. Our approach involves a Slow Learner to selectively reduce the learning rate of backbone parameters, and a Classifier Alignment to align the disjoint classification layers in a post-hoc fashion. We further enhance the efficacy of SL with a symmetric cross-entropy loss, as well as employ a parameter-efficient strategy to implement Seq FT with SLCA++. Across a variety of continual learning scenarios on image classification benchmarks, our approach provides substantial improvements and outperforms state-of-the-art methods by a large margin. Code: this https URL.

[AI-4] Autonomous Behavior Planning For Humanoid Loco-manipulation Through Grounded Language Model IROS2024

链接: https://arxiv.org/abs/2408.08282
作者: Jin Wang,Arturo Laurenzi,Nikos Tsagarakis
关键词-EN: achieving embodied intelligence, Enabling humanoid robots, Enabling humanoid, embodied intelligence, perform autonomously loco-manipulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper accepted by IROS 2024

点击查看摘要

Abstract:Enabling humanoid robots to perform autonomously loco-manipulation in unstructured environments is crucial and highly challenging for achieving embodied intelligence. This involves robots being able to plan their actions and behaviors in long-horizon tasks while using multi-modality to perceive deviations between task execution and high-level planning. Recently, large language models (LLMs) have demonstrated powerful planning and reasoning capabilities for comprehension and processing of semantic information through robot control tasks, as well as the usability of analytical judgment and decision-making for multi-modal inputs. To leverage the power of LLMs towards humanoid loco-manipulation, we propose a novel language-model based framework that enables robots to autonomously plan behaviors and low-level execution under given textual instructions, while observing and correcting failures that may occur during task execution. To systematically evaluate this framework in grounding LLMs, we created the robot ‘action’ and ‘sensing’ behavior library for task planning, and conducted mobile manipulation tasks and experiments in both simulated and real environments using the CENTAURO robot, and verified the effectiveness and application of this approach in robotic tasks with autonomous behavioral planning.

[AI-5] InVAErt networks for amortized inference and identifiability analysis of lumped parameter hemodynamic models

链接: https://arxiv.org/abs/2408.08264
作者: Guoxiang Grayson Tong,Carlos A. Sing Long,Daniele E. Schiavazzi
关键词-EN: electronic health records, significant challenge primarily, challenge primarily due, Estimation of cardiovascular, cardiovascular model parameters
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimation of cardiovascular model parameters from electronic health records (EHR) poses a significant challenge primarily due to lack of identifiability. Structural non-identifiability arises when a manifold in the space of parameters is mapped to a common output, while practical non-identifiability can result due to limited data, model misspecification, or noise corruption. To address the resulting ill-posed inverse problem, optimization-based or Bayesian inference approaches typically use regularization, thereby limiting the possibility of discovering multiple solutions. In this study, we use inVAErt networks, a neural network-based, data-driven framework for enhanced digital twin analysis of stiff dynamical systems. We demonstrate the flexibility and effectiveness of inVAErt networks in the context of physiological inversion of a six-compartment lumped parameter hemodynamic model from synthetic data to real data with missing components.

[AI-6] Snuffy: Efficient Whole Slide Image Classifier ECCV2024

链接: https://arxiv.org/abs/2408.08258
作者: Hossein Jafarinia,Alireza Alipanah,Danial Hamdi,Saeed Razavi,Nahal Mirzaie,Mohammad Hossein Rohban
关键词-EN: multiple instance learning, significant computational challenges, faces significant computational, Slide Image, digital pathology faces
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
*备注: Accepted for ECCV 2024

点击查看摘要

Abstract:Whole Slide Image (WSI) classification with multiple instance learning (MIL) in digital pathology faces significant computational challenges. Current methods mostly rely on extensive self-supervised learning (SSL) for satisfactory performance, requiring long training periods and considerable computational resources. At the same time, no pre-training affects performance due to domain shifts from natural images to WSIs. We introduce \textbf\textitSnuffy architecture, a novel MIL-pooling method based on sparse transformers that mitigates performance loss with limited pre-training and enables continual few-shot pre-training as a competitive option. Our sparsity pattern is tailored for pathology and is theoretically proven to be a universal approximator with the tightest probabilistic sharp bound on the number of layers for sparse transformers, to date. We demonstrate Snuffy’s effectiveness on CAMELYON16 and TCGA Lung cancer datasets, achieving superior WSI and patch-level accuracies. The code is available on \urlthis https URL.

[AI-7] Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding

链接: https://arxiv.org/abs/2408.08252
作者: Xiner Li,Yulai Zhao,Chenyu Wang,Gabriele Scalia,Gokcen Eraslan,Surag Nair,Tommaso Biancalani,Aviv Regev,Sergey Levine,Masatoshi Uehara
关键词-EN: natural design spaces, design spaces, excel at capturing, Diffusion models, Diffusion models excel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN); Machine Learning (stat.ML)
*备注: The code is available at this https URL

点击查看摘要

Abstract:Diffusion models excel at capturing the natural design spaces of images, molecules, DNA, RNA, and protein sequences. However, rather than merely generating designs that are natural, we often aim to optimize downstream reward functions while preserving the naturalness of these design spaces. Existing methods for achieving this goal often require ``differentiable’’ proxy models (\textite.g., classifier guidance or DPS) or involve computationally expensive fine-tuning of diffusion models (\textite.g., classifier-free guidance, RL-based fine-tuning). In our work, we propose a new method to address these challenges. Our algorithm is an iterative sampling method that integrates soft value functions, which looks ahead to how intermediate noisy states lead to high rewards in the future, into the standard inference procedure of pre-trained diffusion models. Notably, our approach avoids fine-tuning generative models and eliminates the need to construct differentiable models. This enables us to (1) directly utilize non-differentiable features/reward feedback, commonly used in many scientific domains, and (2) apply our method to recent discrete diffusion models in a principled way. Finally, we demonstrate the effectiveness of our algorithm across several domains, including image generation, molecule generation, and DNA/RNA sequence generation. The code is available at \hrefthis https URLthis https URL.

[AI-8] Conformalized Answer Set Prediction for Knowledge Graph Embedding

链接: https://arxiv.org/abs/2408.08248
作者: Yuqicheng Zhu,Nico Potyka,Jiarong Pan,Bo Xiong,Yunjie He,Evgeny Kharlamov,Steffen Staab
关键词-EN: Knowledge graph embeddings, apply machine learning, provide non-classical reasoning, non-classical reasoning capabilities, reasoning capabilities based
类目: Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Knowledge graph embeddings (KGE) apply machine learning methods on knowledge graphs (KGs) to provide non-classical reasoning capabilities based on similarities and analogies. The learned KG embeddings are typically used to answer queries by ranking all potential answers, but rankings often lack a meaningful probabilistic interpretation - lower-ranked answers do not necessarily have a lower probability of being true. This limitation makes it difficult to distinguish plausible from implausible answers, posing challenges for the application of KGE methods in high-stakes domains like medicine. We address this issue by applying the theory of conformal prediction that allows generating answer sets, which contain the correct answer with probabilistic guarantees. We explain how conformal prediction can be used to generate such answer sets for link prediction tasks. Our empirical evaluation on four benchmark datasets using six representative KGE methods validates that the generated answer sets satisfy the probabilistic guarantees given by the theory of conformal prediction. We also demonstrate that the generated answer sets often have a sensible size and that the size adapts well with respect to the difficulty of the query.

[AI-9] A Conflicts-free Speed-lossless KAN-based Reinforcement Learning Decision System for Interactive Driving in Roundabouts

链接: https://arxiv.org/abs/2408.08242
作者: Zhihao Lin,Zhen Tian,Qi Zhang,Ziyang Ye,Hanyang Zhuang,Jianglin Lan
关键词-EN: human-driven vehicles coexist, autonomous vehicles, vehicles coexist, human-driven vehicles, safe and efficient
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 15 pages, 12 figures, submitted to an IEEE journal

点击查看摘要

Abstract:Safety and efficiency are crucial for autonomous driving in roundabouts, especially in the context of mixed traffic where autonomous vehicles (AVs) and human-driven vehicles coexist. This paper introduces a learning-based algorithm tailored to foster safe and efficient driving behaviors across varying levels of traffic flows in roundabouts. The proposed algorithm employs a deep Q-learning network to effectively learn safe and efficient driving strategies in complex multi-vehicle roundabouts. Additionally, a KAN (Kolmogorov-Arnold network) enhances the AVs’ ability to learn their surroundings robustly and precisely. An action inspector is integrated to replace dangerous actions to avoid collisions when the AV interacts with the environment, and a route planner is proposed to enhance the driving efficiency and safety of the AVs. Moreover, a model predictive control is adopted to ensure stability and precision of the driving actions. The results show that our proposed system consistently achieves safe and efficient driving whilst maintaining a stable training process, as evidenced by the smooth convergence of the reward function and the low variance in the training curves across various traffic flows. Compared to state-of-the-art benchmarks, the proposed algorithm achieves a lower number of collisions and reduced travel time to destination.

[AI-10] Explaining an Agents Future Beliefs through Temporally Decomposing Future Reward Estimators ECAI2024

链接: https://arxiv.org/abs/2408.08230
作者: Mark Towers,Yali Du,Christopher Freeman,Timothy J. Norman
关键词-EN: Q-value and state-value, reinforcement learning agents, Future reward estimation, future rewards, state-value functions
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages + 3 pages of supplementary material. Published at ECAI 2024

点击查看摘要

Abstract:Future reward estimation is a core component of reinforcement learning agents; i.e., Q-value and state-value functions, predicting an agent’s sum of future rewards. Their scalar output, however, obfuscates when or what individual future rewards an agent may expect to receive. We address this by modifying an agent’s future reward estimator to predict their next N expected rewards, referred to as Temporal Reward Decomposition (TRD). This unlocks novel explanations of agent behaviour. Through TRD we can: estimate when an agent may expect to receive a reward, the value of the reward and the agent’s confidence in receiving it; measure an input feature’s temporal importance to the agent’s action decisions; and predict the influence of different actions on future rewards. Furthermore, we show that DQN agents trained on Atari environments can be efficiently retrained to incorporate TRD with minimal impact on performance.

[AI-11] Evolving A* to Efficiently Solve the k Shortest-Path Problem (Extended Version) ECAI

链接: https://arxiv.org/abs/2408.08227
作者: Carlos Linares López,Ian Herman
关键词-EN: shortest path problem, shortest path, widely studied, single shortest path, path problem
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注: 249 plots in 48 figures, and 81 tables. This is an extended version of the paper Linares López, Carlos and Herman, Ian. 2024. Evolving A to Efficiently Solve the k Shortest-Path Problem. Proceedings of the European Conference on Artificial Intelligence (ECAI). To appear

点击查看摘要

Abstract:The problem of finding the shortest path in a graph G(V, E) has been widely studied. However, in many applications it is necessary to compute an arbitrary number of them, k. Even though the problem has raised a lot of interest from different research communities and many applications of it are known, it has not been addressed to the same extent as the single shortest path problem. The best algorithm known for efficiently solving this task has a time complexity of O (|E| + |V|log|V|+k|V|) when computing paths in explicit form, and is based on best-first search. This paper introduces a new search algorithm with the same time complexity, which results from a natural evolution of A* thus, it preserves all its interesting properties, making it widely applicable to many different domains. Experiments in various testbeds show a significant improvement in performance over the state of the art, often by one or two orders of magnitude.

[AI-12] Predictive Multiplicity of Knowledge Graph Embeddings in Link Prediction

链接: https://arxiv.org/abs/2408.08226
作者: Yuqicheng Zhu,Nico Potyka,Mojtaba Nayyeri,Bo Xiong,Yunjie He,Evgeny Kharlamov,Steffen Staab
关键词-EN: Knowledge graph embedding, Knowledge graph, predict missing links, Knowledge, predict missing
类目: Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Knowledge graph embedding (KGE) models are often used to predict missing links for knowledge graphs (KGs). However, multiple KG embeddings can perform almost equally well for link prediction yet suggest conflicting predictions for certain queries, termed \textitpredictive multiplicity in literature. This behavior poses substantial risks for KGE-based applications in high-stake domains but has been overlooked in KGE research. In this paper, we define predictive multiplicity in link prediction. We introduce evaluation metrics and measure predictive multiplicity for representative KGE methods on commonly used benchmark datasets. Our empirical study reveals significant predictive multiplicity in link prediction, with 8% to 39% testing queries exhibiting conflicting predictions. To address this issue, we propose leveraging voting methods from social choice theory, significantly mitigating conflicts by 66% to 78% according to our experiments.

[AI-13] he Dawn of KAN in Image-to-Image (I2I) Translation: Integrating Kolmogorov-Arnold Networks with GANs for Unpaired I2I Translation

链接: https://arxiv.org/abs/2408.08216
作者: Arpan Mahara,Naphtali D. Rishe,Liangdong Deng
关键词-EN: Generative Artificial Intelligence, Artificial Intelligence, applications spanning healthcare, Generative Adversarial Networks, Generative Artificial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 Figures, 1 Table

点击查看摘要

Abstract:Image-to-Image translation in Generative Artificial Intelligence (Generative AI) has been a central focus of research, with applications spanning healthcare, remote sensing, physics, chemistry, photography, and more. Among the numerous methodologies, Generative Adversarial Networks (GANs) with contrastive learning have been particularly successful. This study aims to demonstrate that the Kolmogorov-Arnold Network (KAN) can effectively replace the Multi-layer Perceptron (MLP) method in generative AI, particularly in the subdomain of image-to-image translation, to achieve better generative quality. Our novel approach replaces the two-layer MLP with a two-layer KAN in the existing Contrastive Unpaired Image-to-Image Translation (CUT) model, developing the KAN-CUT model. This substitution favors the generation of more informative features in low-dimensional vector representations, which contrastive learning can utilize more effectively to produce high-quality images in the target domain. Extensive experiments, detailed in the results section, demonstrate the applicability of KAN in conjunction with contrastive learning and GANs in Generative AI, particularly for image-to-image translation. This work suggests that KAN could be a valuable component in the broader generative AI domain.

[AI-14] Moving Healthcare AI-Support Systems for Visually Detectable Diseases onto Constrained Devices

链接: https://arxiv.org/abs/2408.08215
作者: Tess Watt,Christos Chrysoulas,Peter J Barclay
关键词-EN: reach rural areas, including hard, rural areas, hard to reach, reach rural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Image classification usually requires connectivity and access to the cloud which is often limited in many parts of the world, including hard to reach rural areas. TinyML aims to solve this problem by hosting AI assistants on constrained devices, eliminating connectivity issues by processing data within the device itself, without internet or cloud access. This pilot study explores the use of tinyML to provide healthcare support with low spec devices in low connectivity environments, focusing on diagnosis of skin diseases and the ethical use of AI assistants in a healthcare setting. To investigate this, 10,000 images of skin lesions were used to train a model for classifying visually detectable diseases (VDDs). The model weights were then offloaded to a Raspberry Pi with a webcam attached, to be used for the classification of skin lesions without internet access. It was found that the developed prototype achieved a test accuracy of 78% and a test loss of 1.08.

[AI-15] Federated Fairness Analytics: Quantifying Fairness in Federated Learning

链接: https://arxiv.org/abs/2408.08214
作者: Oscar Dilley,Juan Marcelo Parra-Ullauri,Rasheed Hussain,Dimitra Simeonidou
关键词-EN: Federated Learning, privacy-enhancing technology, technology for distributed, fairness, Federated Fairness Analytics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a privacy-enhancing technology for distributed ML. By training models locally and aggregating updates - a federation learns together, while bypassing centralised data collection. FL is increasingly popular in healthcare, finance and personal computing. However, it inherits fairness challenges from classical ML and introduces new ones, resulting from differences in data quality, client participation, communication constraints, aggregation methods and underlying hardware. Fairness remains an unresolved issue in FL and the community has identified an absence of succinct definitions and metrics to quantify fairness; to address this, we propose Federated Fairness Analytics - a methodology for measuring fairness. Our definition of fairness comprises four notions with novel, corresponding metrics. They are symptomatically defined and leverage techniques originating from XAI, cooperative game-theory and networking engineering. We tested a range of experimental settings, varying the FL approach, ML task and data settings. The results show that statistical heterogeneity and client participation affect fairness and fairness conscious approaches such as Ditto and q-FedAvg marginally improve fairness-performance trade-offs. Using our techniques, FL practitioners can uncover previously unobtainable insights into their system’s fairness, at differing levels of granularity in order to address fairness challenges in FL. We have open-sourced our work at: this https URL.

[AI-16] LLM4DSR: Leveraing Large Language Model for Denoising Sequential Recommendation

链接: https://arxiv.org/abs/2408.08208
作者: Bohao Wang,Feng Liu,Jiawei Chen,Yudi Wu,Xingyu Lou,Jun Wang,Yan Feng,Chun Chen,Can Wang
关键词-EN: systems fundamentally rely, users’ historical interaction, recommendation systems fundamentally, historical interaction sequences, systems fundamentally
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sequential recommendation systems fundamentally rely on users’ historical interaction sequences, which are often contaminated by noisy interactions. Identifying these noisy interactions accurately without additional information is particularly difficult due to the lack of explicit supervisory signals to denote noise. Large Language Models (LLMs), equipped with extensive open knowledge and semantic reasoning abilities, present a promising avenue to bridge this information gap. However, employing LLMs for denoising in sequential recommendation introduces notable challenges: 1) Direct application of pretrained LLMs may not be competent for the denoising task, frequently generating nonsensical responses; 2) Even after fine-tuning, the reliability of LLM outputs remains questionable, especially given the complexity of the task and th inherent hallucinatory issue of LLMs. To tackle these challenges, we propose LLM4DSR, a tailored approach for denoising sequential recommendation using LLMs. We constructed a self-supervised fine-tuning task to activate LLMs’ capabilities to identify noisy items and suggest replacements. Furthermore, we developed an uncertainty estimation module that ensures only high-confidence responses are utilized for sequence corrections. Remarkably, LLM4DSR is model-agnostic, allowing the corrected sequences to be flexibly applied across various recommendation models. Extensive experiments validate the superiority of LLM4DSR over existing methods across three datasets and three recommendation backbones. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.08208 [cs.IR] (or arXiv:2408.08208v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2408.08208 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-17] Scaling Up Natural Language Understanding for Multi-Robots Through the Lens of Hierarchy

链接: https://arxiv.org/abs/2408.08188
作者: Shaojun Xu,Xusheng Luo,Yutong Huang,Letian Leng,Ruixuan Liu,Changliu Liu
关键词-EN: computational complexity, Long-horizon planning, Large Language Models, Linear Temporal Logic, uncertainty accumulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Long-horizon planning is hindered by challenges such as uncertainty accumulation, computational complexity, delayed rewards and incomplete information. This work proposes an approach to exploit the task hierarchy from human instructions to facilitate multi-robot planning. Using Large Language Models (LLMs), we propose a two-step approach to translate multi-sentence instructions into a structured language, Hierarchical Linear Temporal Logic (LTL), which serves as a formal representation for planning. Initially, LLMs transform the instructions into a hierarchical representation defined as Hierarchical Task Tree, capturing the logical and temporal relations among tasks. Following this, a domain-specific fine-tuning of LLM translates sub-tasks of each task into flat LTL formulas, aggregating them to form hierarchical LTL specifications. These specifications are then leveraged for planning using off-the-shelf planners. Our framework not only bridges the gap between instructions and algorithmic planning but also showcases the potential of LLMs in harnessing hierarchical reasoning to automate multi-robot task planning. Through evaluations in both simulation and real-world experiments involving human participants, we demonstrate that our method can handle more complex instructions compared to existing methods. The results indicate that our approach achieves higher success rates and lower costs in multi-robot task allocation and plan generation. Demos videos are available at this https URL .

[AI-18] Your Turn: Real-World Turning Angle Estimation for Parkinsons Disease Severity Assessment

链接: https://arxiv.org/abs/2408.08182
作者: Qiushuo Cheng,Catherine Morgan,Arindam Sikdar,Alessandro Masullo,Alan Whone,Majid Mirmehdi
关键词-EN: experience progressively worsening, progressively worsening gait, People with Parkinson, Parkinson Disease, experience progressively
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:People with Parkinson’s Disease (PD) often experience progressively worsening gait, including changes in how they turn around, as the disease progresses. Existing clinical rating tools are not capable of capturing hour-by-hour variations of PD symptoms, as they are confined to brief assessments within clinic settings. Measuring real-world gait turning angles continuously and passively is a component step towards using gait characteristics as sensitive indicators of disease progression in PD. This paper presents a deep learning-based approach to automatically quantify turning angles by extracting 3D skeletons from videos and calculating the rotation of hip and knee joints. We utilise state-of-the-art human pose estimation models, Fastpose and Strided Transformer, on a total of 1386 turning video clips from 24 subjects (12 people with PD and 12 healthy control volunteers), trimmed from a PD dataset of unscripted free-living videos in a home-like setting (Turn-REMAP). We also curate a turning video dataset, Turn-H3.6M, from the public Human3.6M human pose benchmark with 3D ground truth, to further validate our method. Previous gait research has primarily taken place in clinics or laboratories evaluating scripted gait outcomes, but this work focuses on real-world settings where complexities exist, such as baggy clothing and poor lighting. Due to difficulties in obtaining accurate ground truth data in a free-living setting, we quantise the angle into the nearest bin 45^\circ based on the manual labelling of expert clinicians. Our method achieves a turning calculation accuracy of 41.6%, a Mean Absolute Error (MAE) of 34.7°, and a weighted precision WPrec of 68.3% for Turn-REMAP. This is the first work to explore the use of single monocular camera data to quantify turns by PD patients in a home setting.

[AI-19] owards flexible perception with visual memory

链接: https://arxiv.org/abs/2408.08172
作者: Robert Geirhos,Priyank Jaini,Austin Stone,Sourabh Medapati,Xi Yi,George Toderici,Abhijit Ogale,Jonathon Shlens
关键词-EN: deep neural networks, monolithic endeavor, process is completed, information is distributed, Training a neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training a neural network is a monolithic endeavor, akin to carving knowledge into stone: once the process is completed, editing the knowledge in a network is nearly impossible, since all information is distributed across the network’s weights. We here explore a simple, compelling alternative by marrying the representational power of deep neural networks with the flexibility of a database. Decomposing the task of image classification into image similarity (from a pre-trained embedding) and search (via fast nearest neighbor retrieval from a knowledge database), we build a simple and flexible visual memory that has the following key capabilities: (1.) The ability to flexibly add data across scales: from individual samples all the way to entire classes and billion-scale data; (2.) The ability to remove data through unlearning and memory pruning; (3.) An interpretable decision-mechanism on which we can intervene to control its behavior. Taken together, these capabilities comprehensively demonstrate the benefits of an explicit visual memory. We hope that it might contribute to a conversation on how knowledge should be represented in deep vision models – beyond carving it in ``stone’’ weights.

[AI-20] General-purpose Clothes Manipulation with Semantic Keypoints

链接: https://arxiv.org/abs/2408.08160
作者: Yuhong Deng,David Hsu
关键词-EN: clothes manipulation, recent progress, progress in task-specific, clothes manipulation tasks, task-specific clothes manipulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We have seen much recent progress in task-specific clothes manipulation, but generalizable clothes manipulation is still a challenge. Clothes manipulation requires sequential actions, making it challenging to generalize to unseen tasks. Besides, a general clothes state representation method is crucial. In this paper, we adopt language instructions to specify and decompose clothes manipulation tasks, and propose a large language model based hierarchical learning method to enhance generalization. For state representation, we use semantic keypoints to capture the geometry of clothes and outline their manipulation methods. Simulation experiments show that the proposed method outperforms the baseline method in terms of success rate and generalization for clothes manipulation tasks.

[AI-21] DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

链接: https://arxiv.org/abs/2408.08152
作者: Huajian Xin,Z.Z. Ren,Junxiao Song,Zhihong Shao,Wanjia Zhao,Haocheng Wang,Bo Liu,Liyue Zhang,Xuan Lu,Qiushi Du,Wenjun Gao,Qihao Zhu,Dejian Yang,Zhibin Gou,Z.F. Wu,Fuli Luo,Chong Ruan
关键词-EN: open-source language model, language model designed, inference processes, optimizing both training, training and inference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

[AI-22] Winning Snake: Design Choices in Multi-Shot ASP

链接: https://arxiv.org/abs/2408.08150
作者: Elisa Böhl,Stefan Ellmauthaler,Sarah Alice Gaggl
关键词-EN: Answer set programming, Answer set, knowledge representation paradigm, well-understood and established, established problem-solving
类目: Artificial Intelligence (cs.AI)
*备注: 17 pages, 3 figures, to appear in Theory and Practice of Logic Programming (TPLP), Proceedings of ICLP 2024

点击查看摘要

Abstract:Answer set programming is a well-understood and established problem-solving and knowledge representation paradigm. It has become more prominent amongst a wider audience due to its multiple applications in science and industry. The constant development of advanced programming and modeling techniques extends the toolset for developers and users regularly. This paper demonstrates different techniques to reuse logic program parts (multi-shot) by solving the arcade game snake. This game is particularly interesting because a victory can be assured by solving the underlying NP-hard problem of Hamiltonian Cycles. We will demonstrate five hands-on implementations in clingo and compare their performance in an empirical evaluation. In addition, our implementation utilizes clingraph to generate a simple yet informative image representation of the game’s progress.

[AI-23] Model-based Workflow for the Automated Generation of PDDL Descriptions

链接: https://arxiv.org/abs/2408.08145
作者: Hamied Nabizada,Tom Jeleniewski,Felix Gehlhoff,Alexander Fay
关键词-EN: Domain Definition Language, Planning Domain Definition, Definition Language, Domain Definition, Manually creating Planning
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Manually creating Planning Domain Definition Language (PDDL) descriptions is difficult, error-prone, and requires extensive expert knowledge. However, this knowledge is already embedded in engineering models and can be reused. Therefore, this contribution presents a comprehensive workflow for the automated generation of PDDL descriptions from integrated system and product models. The proposed workflow leverages Model-Based Systems Engineering (MBSE) to organize and manage system and product information, translating it automatically into PDDL syntax for planning purposes. By connecting system and product models with planning aspects, it ensures that changes in these models are quickly reflected in updated PDDL descriptions, facilitating efficient and adaptable planning processes. The workflow is validated within a use case from aircraft assembly.

[AI-24] EXPLAIN AGREE LEARN: Scaling Learning for Neural Probabilistic Logic

链接: https://arxiv.org/abs/2408.08133
作者: Victor Verreet,Lennert De Smet,Luc De Raedt,Emanuele Sansone
关键词-EN: Neural probabilistic logic, logic systems follow, neural networks, probabilistic logic, follow the neuro-symbolic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural probabilistic logic systems follow the neuro-symbolic (NeSy) paradigm by combining the perceptive and learning capabilities of neural networks with the robustness of probabilistic logic. Learning corresponds to likelihood optimization of the neural networks. However, to obtain the likelihood exactly, expensive probabilistic logic inference is required. To scale learning to more complex systems, we therefore propose to instead optimize a sampling based objective. We prove that the objective has a bounded error with respect to the likelihood, which vanishes when increasing the sample count. Furthermore, the error vanishes faster by exploiting a new concept of sample diversity. We then develop the EXPLAIN, AGREE, LEARN (EXAL) method that uses this objective. EXPLAIN samples explanations for the data. AGREE reweighs each explanation in concordance with the neural component. LEARN uses the reweighed explanations as a signal for learning. In contrast to previous NeSy methods, EXAL can scale to larger problem sizes while retaining theoretical guarantees on the error. Experimentally, our theoretical claims are verified and EXAL outperforms recent NeSy methods when scaling up the MNIST addition and Warcraft pathfinding problems.

[AI-25] Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images

链接: https://arxiv.org/abs/2408.08105
作者: Zhiyuan Li,Heng Wang,Dongnan Liu,Chaoyi Zhang,Ao Ma,Jieting Long,Weidong Cai
关键词-EN: Large Language Models, Vision Large Language, Language Models, Large Language, showcased exceptional ability
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 20 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have showcased exceptional ability in causal reasoning from textual information. However, will these causalities remain straightforward for Vision Large Language Models (VLLMs) when only visual hints are provided? Motivated by this, we propose a novel Multimodal Causal Reasoning benchmark, namely MuCR, to challenge VLLMs to infer semantic cause-and-effect relationship when solely relying on visual cues such as action, appearance, clothing, and environment. Specifically, we introduce a prompt-driven image synthesis approach to create siamese images with embedded semantic causality and visual cues, which can effectively evaluate VLLMs’ causal reasoning capabilities. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess VLLMs’ comprehension abilities. Our extensive experiments reveal that the current state-of-the-art VLLMs are not as skilled at multimodal causal reasoning as we might have hoped. Furthermore, we perform a comprehensive analysis to understand these models’ shortcomings from different views and suggest directions for future research. We hope MuCR can serve as a valuable resource and foundational benchmark in multimodal causal reasoning research. The project is available at: this https URL

[AI-26] OC3D: Weakly Supervised Outdoor 3D Object Detection with Only Coarse Click Annotation

链接: https://arxiv.org/abs/2408.08092
作者: Qiming Xia,Hongwei Lin,Wei Ye,Hai Wu,Yadan Luo,Shijia Zhao,Xin Li,Chenglu Wen
关键词-EN: received widespread attention, LiDAR-based outdoor, widespread attention, received widespread, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LiDAR-based outdoor 3D object detection has received widespread attention. However, training 3D detectors from the LiDAR point cloud typically relies on expensive bounding box annotations. This paper presents OC3D, an innovative weakly supervised method requiring only coarse clicks on the bird’ s eye view of the 3D point cloud. A key challenge here is the absence of complete geometric descriptions of the target objects from such simple click annotations. To address this problem, our proposed OC3D adopts a two-stage strategy. In the first stage, we initially design a novel dynamic and static classification strategy and then propose the Click2Box and Click2Mask modules to generate box-level and mask-level pseudo-labels for static and dynamic instances, respectively. In the second stage, we design a Mask2Box module, leveraging the learning capabilities of neural networks to update mask-level pseudo-labels, which contain less information, to box level pseudo-labels. Experimental results on the widely used KITTI and nuScenes datasets demonstrate that our OC3D with only coarse clicks achieves state-of-the-art performance compared to weakly-supervised 3D detection methods. Combining OC3D with a missing click mining strategy, we propose a OC3D++ pipeline, which requires only 0.2% annotation cost in the KITTI dataset to achieve performance comparable to fully supervised methods.

[AI-27] Agent Court: Simulating Court with Adversarial Evolvable Lawyer Agents

链接: https://arxiv.org/abs/2408.08089
作者: Guhong Chen,Liyang Fan,Zihan Gong,Nan Xie,Zixuan Li,Ziqiang Liu,Chengming Li,Qiang Qu,Shiwen Ni,Min Yang
关键词-EN: simulation system called, system called AgentCourt, entire courtroom process, system called, lawyer agents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-28] An Efficient Replay for Class-Incremental Learning with Pre-trained Models

链接: https://arxiv.org/abs/2408.08084
作者: Weimin Yin,Bin Chen adn Chunzhao Xie,Zhenhao Tan
关键词-EN: general class-incremental learning, class-incremental learning, tool to avoid, researchers typically, avoid catastrophic forgetting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In general class-incremental learning, researchers typically use sample sets as a tool to avoid catastrophic forgetting during continuous learning. At the same time, researchers have also noted the differences between class-incremental learning and Oracle training and have attempted to make corrections. In recent years, researchers have begun to develop class-incremental learning algorithms utilizing pre-trained models, achieving significant results. This paper observes that in class-incremental learning, the steady state among the weight guided by each class center is disrupted, which is significantly correlated with catastrophic forgetting. Based on this, we propose a new method to overcoming forgetting . In some cases, by retaining only a single sample unit of each class in memory for replay and applying simple gradient constraints, very good results can be achieved. Experimental results indicate that under the condition of pre-trained models, our method can achieve competitive performance with very low computational cost and by simply using the cross-entropy loss.

[AI-29] Confidence-weighted integration of human and machine judgments for superior decision-making

链接: https://arxiv.org/abs/2408.08083
作者: Felipe Yáñez,Xiaoliang Luo,Omar Valerio Minero,Bradley C. Love
关键词-EN: Large language models, Large language, language models, emerged as powerful, powerful tools
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have emerged as powerful tools in various domains. Recent studies have shown that LLMs can surpass humans in certain tasks, such as predicting the outcomes of neuroscience studies. What role does this leave for humans in the overall decision process? One possibility is that humans, despite performing worse than LLMs, can still add value when teamed with them. A human and machine team can surpass each individual teammate when team members’ confidence is well-calibrated and team members diverge in which tasks they find difficult (i.e., calibration and diversity are needed). We simplified and extended a Bayesian approach to combining judgments using a logistic regression framework that integrates confidence-weighted judgments for any number of team members. Using this straightforward method, we demonstrated in a neuroscience forecasting task that, even when humans were inferior to LLMs, their combination with one or more LLMs consistently improved team performance. Our hope is that this simple and effective strategy for integrating the judgments of humans and machines will lead to productive collaborations.

[AI-30] reat Stillness with Movement: Remote Sensing Change Detection via Coarse-grained Temporal Foregrounds Mining

链接: https://arxiv.org/abs/2408.08078
作者: Xixi Wang,Zitian Wang,Jingtao Jiang,Lan Chen,Xiao Wang,Bo Jiang
关键词-EN: Current works focus, Current works, Coarse-grained Temporal Mining, focus on addressing, Temporal Mining Augmented
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: In Peer Review

点击查看摘要

Abstract:Current works focus on addressing the remote sensing change detection task using bi-temporal images. Although good performance can be achieved, however, seldom of they consider the motion cues which may also be vital. In this work, we revisit the widely adopted bi-temporal images-based framework and propose a novel Coarse-grained Temporal Mining Augmented (CTMA) framework. To be specific, given the bi-temporal images, we first transform them into a video using interpolation operations. Then, a set of temporal encoders is adopted to extract the motion features from the obtained video for coarse-grained changed region prediction. Subsequently, we design a novel Coarse-grained Foregrounds Augmented Spatial Encoder module to integrate both global and local information. We also introduce a motion augmented strategy that leverages motion cues as an additional output to aggregate with the spatial features for improved results. Meanwhile, we feed the input image pairs into the ResNet to get the different features and also the spatial blocks for fine-grained feature learning. More importantly, we propose a mask augmented strategy that utilizes coarse-grained changed regions, incorporating them into the decoder blocks to enhance the final changed prediction. Extensive experiments conducted on multiple benchmark datasets fully validated the effectiveness of our proposed framework for remote sensing image change detection. The source code of this paper will be released on this https URL

[AI-31] A Survey on Integrated Sensing Communication and Computation

链接: https://arxiv.org/abs/2408.08074
作者: Dingzhu Wen,Yong Zhou,Xiaoyang Li,Yuanming Shi,Kaibin Huang,Khaled B. Letaief
关键词-EN: traditional data-centric services, wireless technology, promises a revolutionary, forthcoming generation, generation of wireless
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The forthcoming generation of wireless technology, 6G, promises a revolutionary leap beyond traditional data-centric services. It aims to usher in an era of ubiquitous intelligent services, where everything is interconnected and intelligent. This vision requires the seamless integration of three fundamental modules: Sensing for information acquisition, communication for information sharing, and computation for information processing and decision-making. These modules are intricately linked, especially in complex tasks such as edge learning and inference. However, the performance of these modules is interdependent, creating a resource competition for time, energy, and bandwidth. Existing techniques like integrated communication and computation (ICC), integrated sensing and computation (ISC), and integrated sensing and communication (ISAC) have made partial strides in addressing this challenge, but they fall short of meeting the extreme performance requirements. To overcome these limitations, it is essential to develop new techniques that comprehensively integrate sensing, communication, and computation. This integrated approach, known as Integrated Sensing, Communication, and Computation (ISCC), offers a systematic perspective for enhancing task performance. This paper begins with a comprehensive survey of historic and related techniques such as ICC, ISC, and ISAC, highlighting their strengths and limitations. It then explores the state-of-the-art signal designs for ISCC, along with network resource management strategies specifically tailored for ISCC. Furthermore, this paper discusses the exciting research opportunities that lie ahead for implementing ISCC in future advanced networks. By embracing ISCC, we can unlock the full potential of intelligent connectivity, paving the way for groundbreaking applications and services.

[AI-32] RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2408.08067
作者: Dongyu Ru,Lin Qiu,Xiangkun Hu,Tianhang Zhang,Peng Shi,Shuaichen Chang,Jiayang Cheng,Cunxiang Wang,Shichao Sun,Huanyu Li,Zizhao Zhang,Binjie Wang,Jiarong Jiang,Tong He,Zhiguo Wang,Pengfei Liu,Yue Zhang,Zheng Zhang
关键词-EN: leveraging external knowledge, shown promising capability, RAG systems, external knowledge, reliability of measurements
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

[AI-33] Maximally Permissive Reward Machines ECAI

链接: https://arxiv.org/abs/2408.08059
作者: Giovanni Varricchione,Natasha Alechina,Mehdi Dastani,Brian Logan
关键词-EN: temporally extended tasks, Reward machines, tasks and behaviors, temporally extended, extended tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Paper accepted for publication at the European Conference on Artificial Intelligence (ECAI) 2024

点击查看摘要

Abstract:Reward machines allow the definition of rewards for temporally extended tasks and behaviors. Specifying “informative” reward machines can be challenging. One way to address this is to generate reward machines from a high-level abstract description of the learning environment, using techniques such as AI planning. However, previous planning-based approaches generate a reward machine based on a single (sequential or partial-order) plan, and do not allow maximum flexibility to the learning agent. In this paper we propose a new approach to synthesising reward machines which is based on the set of partial order plans for a goal. We prove that learning using such “maximally permissive” reward machines results in higher rewards than learning using RMs based on a single plan. We present experimental results which support our theoretical claims by showing that our approach obtains higher rewards than the single-plan approach in practice.

[AI-34] Navigating Data Scarcity using Foundation Models: A Benchmark of Few-Shot and Zero-Shot Learning Approaches in Medical Imaging MICCAI2024

链接: https://arxiv.org/abs/2408.08058
作者: Stefano Woerner,Christian F. Baumgartner
关键词-EN: major limiting factor, applying modern machine, modern machine learning, machine learning techniques, major limiting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted as an oral presentation in MICCAI 2024 2nd International Workshop on Foundation Models for General Medical AI

点击查看摘要

Abstract:Data scarcity is a major limiting factor for applying modern machine learning techniques to clinical tasks. Although sufficient data exists for some well-studied medical tasks, there remains a long tail of clinically relevant tasks with poor data availability. Recently, numerous foundation models have demonstrated high suitability for few-shot learning (FSL) and zero-shot learning (ZSL), potentially making them more accessible to practitioners. However, it remains unclear which foundation model performs best on FSL medical image analysis tasks and what the optimal methods are for learning from limited data. We conducted a comprehensive benchmark study of ZSL and FSL using 16 pretrained foundation models on 19 diverse medical imaging datasets. Our results indicate that BiomedCLIP, a model pretrained exclusively on medical data, performs best on average for very small training set sizes, while very large CLIP models pretrained on LAION-2B perform best with slightly more training samples. However, simply fine-tuning a ResNet-18 pretrained on ImageNet performs similarly with more than five training examples per class. Our findings also highlight the need for further research on foundation models specifically tailored for medical applications and the collection of more datasets to train these models.

[AI-35] COTODE: COntinuous Trajectory neural Ordinary Differential Equations for modelling event sequences

链接: https://arxiv.org/abs/2408.08055
作者: Ilya Kuleshov,Galina Boeva,Vladislav Zhuzhel,Evgenia Romanenkova,Evgeni Vorsin,Alexey Zaytsev
关键词-EN: generate event sequences, event sequences reveals, evolve continuously, Gaussian Process, sequences reveals
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Observation of the underlying actors that generate event sequences reveals that they often evolve continuously. Most modern methods, however, tend to model such processes through at most piecewise-continuous trajectories. To address this, we adopt a way of viewing events not as standalone phenomena but instead as observations of a Gaussian Process, which in turn governs the actor’s dynamics. We propose integrating these obtained dynamics, resulting in a continuous-trajectory modification of the widely successful Neural ODE model. Through Gaussian Process theory, we were able to evaluate the uncertainty in an actor’s representation, which arises from not observing them between events. This estimate led us to develop a novel, theoretically backed negative feedback mechanism. Empirical studies indicate that our model with Gaussian process interpolation and negative feedback achieves state-of-the-art performance, with improvements up to 20% AUROC against similar architectures.

[AI-36] xt2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework

链接: https://arxiv.org/abs/2408.08054
作者: Changyu Du,Sebastian Esser,Stavros Nousias,André Borrmann
关键词-EN: typically requires designers, process typically requires, conventional BIM authoring, BIM authoring process, BIM authoring
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
*备注:

点击查看摘要

[AI-37] he Clever Hans Effect in Unsupervised Learning

链接: https://arxiv.org/abs/2408.08041
作者: Jacob Kauffmann,Jonas Dippel,Lukas Ruff,Wojciech Samek,Klaus-Robert Müller,Grégoire Montavon
关键词-EN: essential building block, essential building, building block, Unsupervised learning, so-called Clever Hans
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 12 pages + supplement

点击查看摘要

Abstract:Unsupervised learning has become an essential building block of AI systems. The representations it produces, e.g. in foundation models, are critical to a wide variety of downstream applications. It is therefore important to carefully examine unsupervised models to ensure not only that they produce accurate predictions, but also that these predictions are not “right for the wrong reasons”, the so-called Clever Hans (CH) effect. Using specially developed Explainable AI techniques, we show for the first time that CH effects are widespread in unsupervised learning. Our empirical findings are enriched by theoretical insights, which interestingly point to inductive biases in the unsupervised learning machine as a primary source of CH effects. Overall, our work sheds light on unexplored risks associated with practical applications of unsupervised learning and suggests ways to make unsupervised learning more robust.

[AI-38] Adaptive User Journeys in Pharma E-Commerce with Reinforcement Learning: Insights from SwipeRx KDD KDD2024

链接: https://arxiv.org/abs/2408.08024
作者: Ana Fernández del Río,Michael Brennan Leong,Paulo Saraiva,Ivan Nazarov,Aditya Rastogi,Moiz Hassan,Dexian Tang,África Periáñez
关键词-EN: healthcare digital tools, reinforcement learning, tools through personalization, paper introduces, introduces a reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Presented at the Third Workshop on End-to-End Customer Journey Optimization at KDD 2024 (KDD CJ Workshop '24), August 26, Barcelona, Spain

点击查看摘要

Abstract:This paper introduces a reinforcement learning (RL) platform that enhances end-to-end user journeys in healthcare digital tools through personalization. We explore a case study with SwipeRx, the most popular all-in-one app for pharmacists in Southeast Asia, demonstrating how the platform can be used to personalize and adapt user experiences. Our RL framework is tested through a series of experiments with product recommendations tailored to each pharmacy based on real-time information on their purchasing history and in-app engagement, showing a significant increase in basket size. By integrating adaptive interventions into existing mobile health solutions and enriching user journeys, our platform offers a scalable solution to improve pharmaceutical supply chain management, health worker capacity building, and clinical decision and patient care, ultimately contributing to better healthcare outcomes.

[AI-39] Causal Discovery from Time-Series Data with Short-Term Invariance-Based Convolutional Neural Networks

链接: https://arxiv.org/abs/2408.08023
作者: Rujia Shen,Boran Wang,Chao Zhao,Yi Guan,Jingchi Jiang
关键词-EN: time-series data, time-series data aims, Causal discovery, discovery from time-series, time-series data necessitates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Causal discovery from time-series data aims to capture both intra-slice (contemporaneous) and inter-slice (time-lagged) causality between variables within the temporal chain, which is crucial for various scientific disciplines. Compared to causal discovery from non-time-series data, causal discovery from time-series data necessitates more serialized samples with a larger amount of observed time steps. To address the challenges, we propose a novel gradient-based causal discovery approach STIC, which focuses on \textbfShort-\textbfTerm \textbfInvariance using \textbfConvolutional neural networks to uncover the causal relationships from time-series data. Specifically, STIC leverages both the short-term time and mechanism invariance of causality within each window observation, which possesses the property of independence, to enhance sample efficiency. Furthermore, we construct two causal convolution kernels, which correspond to the short-term time and mechanism invariance respectively, to estimate the window causal graph. To demonstrate the necessity of convolutional neural networks for causal discovery from time-series data, we theoretically derive the equivalence between convolution and the underlying generative principle of time-series data under the assumption that the additive noise model is identifiable. Experimental evaluations conducted on both synthetic and FMRI benchmark datasets demonstrate that our STIC outperforms baselines significantly and achieves the state-of-the-art performance, particularly when the datasets contain a limited number of observed time steps. Code is available at \urlthis https URL.

[AI-40] DIVE: Towards Descriptive and Diverse Visual Commonsense Generation EMNLP2023

链接: https://arxiv.org/abs/2408.08021
作者: Jun-Hyung Park,Hyuntae Park,Youjin Kang,Eojin Jeon,SangKeun Lee(Korea University)
关键词-EN: visual commonsense generation, visual commonsense, commonsense generation, generate commonsense inferences, commonsense
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 19 pages, 10 figuers, EMNLP 2023 (main)

点击查看摘要

Abstract:Towards human-level visual understanding, visual commonsense generation has been introduced to generate commonsense inferences beyond images. However, current research on visual commonsense generation has overlooked an important human cognitive ability: generating descriptive and diverse inferences. In this work, we propose a novel visual commonsense generation framework, called DIVE, which aims to improve the descriptiveness and diversity of generated inferences. DIVE involves two methods, generic inference filtering and contrastive retrieval learning, which address the limitations of existing visual commonsense resources and training objectives. Experimental results verify that DIVE outperforms state-of-the-art models for visual commonsense generation in terms of both descriptiveness and diversity, while showing a superior quality in generating unique and novel inferences. Notably, DIVE achieves human-level descriptiveness and diversity on Visual Commonsense Graphs. Furthermore, human evaluations confirm that DIVE aligns closely with human judgments on descriptiveness and diversity\footnoteOur code and dataset are available at this https URL.

[AI-41] Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization

链接: https://arxiv.org/abs/2408.08019
作者: Sang-Hoon Lee,Ha-Yeong Choi,Seong-Whan Lee
关键词-EN: flow matching optimization, flow matching, adversarial flow matching, high-efficient waveform generation, paper introduces PeriodWave-Turbo
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: 9 pages, 9 tables, 1 figure,

点击查看摘要

Abstract:This paper introduces PeriodWave-Turbo, a high-fidelity and high-efficient waveform generation model via adversarial flow matching optimization. Recently, conditional flow matching (CFM) generative models have been successfully adopted for waveform generation tasks, leveraging a single vector field estimation objective for training. Although these models can generate high-fidelity waveform signals, they require significantly more ODE steps compared to GAN-based models, which only need a single generation step. Additionally, the generated samples often lack high-frequency information due to noisy vector field estimation, which fails to ensure high-frequency reproduction. To address this limitation, we enhance pre-trained CFM-based generative models by incorporating a fixed-step generator modification. We utilized reconstruction losses and adversarial feedback to accelerate high-fidelity waveform generation. Through adversarial flow matching optimization, it only requires 1,000 steps of fine-tuning to achieve state-of-the-art performance across various objective metrics. Moreover, we significantly reduce inference speed from 16 steps to 2 or 4 steps. Additionally, by scaling up the backbone of PeriodWave from 29M to 70M parameters for improved generalization, PeriodWave-Turbo achieves unprecedented performance, with a perceptual evaluation of speech quality (PESQ) score of 4.454 on the LibriTTS dataset. Audio samples, source code and checkpoints will be available at this https URL.

[AI-42] Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices

链接: https://arxiv.org/abs/2408.08015
作者: Shengyuan Ye,Liekang Zeng,Xiaowen Chu,Guoliang Xing,Xu Chen
关键词-EN: Deep Neural Network, On-device Deep Neural, On-device Deep, Neural Network, Deep Neural
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted by The 30th Annual International Conference on Mobile Computing and Networking (MobiCom’24)

点击查看摘要

Abstract:On-device Deep Neural Network (DNN) training has been recognized as crucial for privacy-preserving machine learning at the edge. However, the intensive training workload and limited onboard computing resources pose significant challenges to the availability and efficiency of model training. While existing works address these challenges through native resource management optimization, we instead leverage our observation that edge environments usually comprise a rich set of accompanying trusted edge devices with idle resources beyond a single terminal. We propose Asteroid, a distributed edge training system that breaks the resource walls across heterogeneous edge devices for efficient model training acceleration. Asteroid adopts a hybrid pipeline parallelism to orchestrate distributed training, along with a judicious parallelism planning for maximizing throughput under certain resource constraints. Furthermore, a fault-tolerant yet lightweight pipeline replay mechanism is developed to tame the device-level dynamics for training robustness and performance stability. We implement Asteroid on heterogeneous edge devices with both vision and language models, demonstrating up to 12.2x faster training than conventional parallelism methods and 2.1x faster than state-of-the-art hybrid parallelism methods through evaluations. Furthermore, Asteroid can recover training pipeline 14x faster than baseline methods while preserving comparable throughput despite unexpected device exiting and failure.

[AI-43] IIU: Independent Inference Units for Knowledge-based Visual Question Answering

链接: https://arxiv.org/abs/2408.07989
作者: Yili Li,Jing Yu,Keke Gai,Gang Xiong
关键词-EN: Knowledge-based visual question, visual question answering, question answering requires, answering requires external, requires external knowledge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge-based visual question answering requires external knowledge beyond visible content to answer the question correctly. One limitation of existing methods is that they focus more on modeling the inter-modal and intra-modal correlations, which entangles complex multimodal clues by implicit embeddings and lacks interpretability and generalization ability. The key challenge to solve the above problem is to separate the information and process it separately at the functional level. By reusing each processing unit, the generalization ability of the model to deal with different data can be increased. In this paper, we propose Independent Inference Units (IIU) for fine-grained multi-modal reasoning to decompose intra-modal information by the functionally independent units. Specifically, IIU processes each semantic-specific intra-modal clue by an independent inference unit, which also collects complementary information by communication from different units. To further reduce the impact of redundant information, we propose a memory update module to maintain semantic-relevant memory along with the reasoning process gradually. In comparison with existing non-pretrained multi-modal reasoning models on standard datasets, our model achieves a new state-of-the-art, enhancing performance by 3%, and surpassing basic pretrained multi-modal models. The experimental results show that our IIU model is effective in disentangling intra-modal clues as well as reasoning units to provide explainable reasoning evidence. Our code is available at this https URL.

[AI-44] Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning

链接: https://arxiv.org/abs/2408.07985
作者: Lukas Kirchdorfer,Cathrin Elich,Simon Kutsche,Heiner Stuckenschmidt,Lukas Schott,Jan M. Köhler
关键词-EN: gained significant relevance, gained significant, significant relevance, neural network training, MTL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the rise of neural networks in various domains, multi-task learning (MTL) gained significant relevance. A key challenge in MTL is balancing individual task losses during neural network training to improve performance and efficiency through knowledge sharing across tasks. To address these challenges, we propose a novel task-weighting method by building on the most prevalent approach of Uncertainty Weighting and computing analytically optimal uncertainty-based weights, normalized by a softmax function with tunable temperature. Our approach yields comparable results to the combinatorially prohibitive, brute-force approach of Scalarization while offering a more cost-effective yet high-performing alternative. We conduct an extensive benchmark on various datasets and architectures. Our method consistently outperforms six other common weighting methods. Furthermore, we report noteworthy experimental findings for the practical application of MTL. For example, larger networks diminish the influence of weighting methods, and tuning the weight decay has a low impact compared to the learning rate.

[AI-45] ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models

链接: https://arxiv.org/abs/2408.07983
作者: Faris Hijazi(1),Somayah AlHarbi(1),Abdulaziz AlHussein(1),Harethah Abu Shairah(2),Reem AlZahrani(2),Hebah AlShamlan(1),Omar Knio(2),George Turkiyyah(2) ((1) THIQAH, (2) KAUST)
关键词-EN: Large Language Models, natural language processing, Language Models, Large Language, advancements in Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

[AI-46] oward a Dialogue System Using a Large Language Model to Recognize User Emotions with a Camera

链接: https://arxiv.org/abs/2408.07982
作者: Hiroki Tanioka,Tetsushi Ueta,Masahiko Sano
关键词-EN: call center operations, Artificial Intelligence agents, multimodal dialogue functions, performance of ChatGPT, improved tremendously
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 4 pages, 5 figures, 1 table, The 1st InterAI: Interactive AI for Human-Centered Robotics workshop in conjuction with IEEE Ro-MAN 2024, Pasadona, LA, USA, Aug. 2024

点击查看摘要

Abstract:The performance of ChatGPT© and other LLMs has improved tremendously, and in online environments, they are increasingly likely to be used in a wide variety of situations, such as ChatBot on web pages, call center operations using voice interaction, and dialogue functions using agents. In the offline environment, multimodal dialogue functions are also being realized, such as guidance by Artificial Intelligence agents (AI agents) using tablet terminals and dialogue systems in the form of LLMs mounted on robots. In this multimodal dialogue, mutual emotion recognition between the AI and the user will become important. So far, there have been methods for expressing emotions on the part of the AI agent or for recognizing them using textual or voice information of the user’s utterances, but methods for AI agents to recognize emotions from the user’s facial expressions have not been studied. In this study, we examined whether or not LLM-based AI agents can interact with users according to their emotional states by capturing the user in dialogue with a camera, recognizing emotions from facial expressions, and adding such emotion information to prompts. The results confirmed that AI agents can have conversations according to the emotional state for emotional states with relatively high scores, such as Happy and Angry.

[AI-47] LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning

链接: https://arxiv.org/abs/2408.07981
作者: Jiajie Li,Garrett Skinner,Gene Yang,Brian R Quaranto,Steven D Schwaitzberg,Peter C W Kim,Jinjun Xiong
关键词-EN: achieved notable success, Multimodal large language, large language models, unimodal images, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal large language models (LLMs) have achieved notable success across various domains, while research in the medical field has largely focused on unimodal images. Meanwhile, current general-domain multimodal models for videos still lack the capabilities to understand and engage in conversations about surgical videos. One major contributing factor is the absence of datasets in the surgical field. In this paper, we create a new dataset, Surg-QA, consisting of 102,000 surgical video-instruction pairs, the largest of its kind so far. To build such a dataset, we propose a novel two-stage question-answer generation pipeline with LLM to learn surgical knowledge in a structured manner from the publicly available surgical lecture videos. The pipeline breaks down the generation process into two stages to significantly reduce the task complexity, allowing us to use a more affordable, locally deployed open-source LLM than the premium paid LLM services. It also mitigates the risk of LLM hallucinations during question-answer generation, thereby enhancing the overall quality of the generated data. We further train LLaVA-Surg, a novel vision-language conversational assistant capable of answering open-ended questions about surgical videos, on this Surg-QA dataset, and conduct comprehensive evaluations on zero-shot surgical video question-answering tasks. We show that LLaVA-Surg significantly outperforms all previous general-domain models, demonstrating exceptional multimodal conversational skills in answering open-ended questions about surgical videos. We will release our code, model, and the instruction-tuning dataset.

[AI-48] Meta SAC-Lag: Towards Deployable Safe Reinforcement Learning via MetaGradient-based Hyperparameter Tuning IROS

链接: https://arxiv.org/abs/2408.07962
作者: Homayoun Honari,Amir Mehdi Soufi Enayati,Mehran Ghafarian Tamizi,Homayoun Najjaran
关键词-EN: Safe Reinforcement Learning, Reinforcement Learning, prevalently studied subcategories, Safe Reinforcement, Meta SAC-Lag
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Main text accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2024, 10 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Safe Reinforcement Learning (Safe RL) is one of the prevalently studied subcategories of trial-and-error-based methods with the intention to be deployed on real-world systems. In safe RL, the goal is to maximize reward performance while minimizing constraints, often achieved by setting bounds on constraint functions and utilizing the Lagrangian method. However, deploying Lagrangian-based safe RL in real-world scenarios is challenging due to the necessity of threshold fine-tuning, as imprecise adjustments may lead to suboptimal policy convergence. To mitigate this challenge, we propose a unified Lagrangian-based model-free architecture called Meta Soft Actor-Critic Lagrangian (Meta SAC-Lag). Meta SAC-Lag uses meta-gradient optimization to automatically update the safety-related hyperparameters. The proposed method is designed to address safe exploration and threshold adjustment with minimal hyperparameter tuning requirement. In our pipeline, the inner parameters are updated through the conventional formulation and the hyperparameters are adjusted using the meta-objectives which are defined based on the updated parameters. Our results show that the agent can reliably adjust the safety performance due to the relatively fast convergence rate of the safety threshold. We evaluate the performance of Meta SAC-Lag in five simulated environments against Lagrangian baselines, and the results demonstrate its capability to create synergy between parameters, yielding better or competitive results. Furthermore, we conduct a real-world experiment involving a robotic arm tasked with pouring coffee into a cup without spillage. Meta SAC-Lag is successfully trained to execute the task, while minimizing effort constraints.

[AI-49] RandomNet: Clustering Time Series Using Untrained Deep Neural Networks

链接: https://arxiv.org/abs/2408.07956
作者: Xiaosheng Li,Wenjie Xi,Jessica Lin
关键词-EN: Neural networks, machine learning, time series, deep neural networks, data mining
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, 10 figures

点击查看摘要

Abstract:Neural networks are widely used in machine learning and data mining. Typically, these networks need to be trained, implying the adjustment of weights (parameters) within the network based on the input data. In this work, we propose a novel approach, RandomNet, that employs untrained deep neural networks to cluster time series. RandomNet uses different sets of random weights to extract diverse representations of time series and then ensembles the clustering relationships derived from these different representations to build the final clustering results. By extracting diverse representations, our model can effectively handle time series with different characteristics. Since all parameters are randomly generated, no training is required during the process. We provide a theoretical analysis of the effectiveness of the method. To validate its performance, we conduct extensive experiments on all of the 128 datasets in the well-known UCR time series archive and perform statistical analysis of the results. These datasets have different sizes, sequence lengths, and they are from diverse fields. The experimental results show that the proposed method is competitive compared with existing state-of-the-art methods.

[AI-50] Solving a Rubiks Cube Using its Local Graph Structure

链接: https://arxiv.org/abs/2408.07945
作者: Shunyu Yao,Mitchy Lee
关键词-EN: Rubix Cube, single-player combination puzzle, reinforcement learning community, scrambled Rubix cube, combination puzzle attracting
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Rubix Cube is a 3-dimensional single-player combination puzzle attracting attention in the reinforcement learning community. A Rubix Cube has six faces and twelve possible actions, leading to a small and unconstrained action space and a very large state space with only one goal state. Modeling such a large state space and storing the information of each state requires exceptional computational resources, which makes it challenging to find the shortest solution to a scrambled Rubix cube with limited resources. The Rubix Cube can be represented as a graph, where states of the cube are nodes and actions are edges. Drawing on graph convolutional networks, we design a new heuristic, weighted convolutional distance, for A star search algorithm to find the solution to a scrambled Rubix Cube. This heuristic utilizes the information of neighboring nodes and convolves them with attention-like weights, which creates a deeper search for the shortest path to the solved state.

[AI-51] Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning

链接: https://arxiv.org/abs/2408.07931
作者: Haofeng Liu,Erli Zhang,Junde Wu,Mingxuan Hong,Yueming Jin
关键词-EN: enhancing surgical quality, Surgical video segmentation, Surgical video, video segmentation, patient outcomes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注: 16 pages, 2 figures

点击查看摘要

Abstract:Surgical video segmentation is a critical task in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, the Segment Anything Model 2 (SAM2) framework has shown superior advancements in image and video segmentation. However, SAM2 struggles with efficiency due to the high computational demands of processing high-resolution images and complex and long-range temporal dynamics in surgical videos. To address these challenges, we introduce Surgical SAM 2 (SurgSAM-2), an advanced model to utilize SAM2 with an Efficient Frame Pruning (EFP) mechanism, to facilitate real-time surgical video segmentation. The EFP mechanism dynamically manages the memory bank by selectively retaining only the most informative frames, reducing memory usage and computational cost while maintaining high segmentation accuracy. Our extensive experiments demonstrate that SurgSAM-2 significantly improves both efficiency and segmentation accuracy compared to the vanilla SAM2. Remarkably, SurgSAM-2 achieves a 3 \times FPS compared with SAM2, while also delivering state-of-the-art performance after fine-tuning with lower-resolution data. These advancements establish SurgSAM-2 as a leading model for surgical video analysis, making real-time surgical video segmentation in resource-constrained environments a feasible reality.

[AI-52] MAG-SQL: Multi-Agent Generative Approach with Soft Schema Linking and Iterative Sub-SQL Refinement for Text-to-SQL

链接: https://arxiv.org/abs/2408.07930
作者: Wenxuan Xie,Gaochen Wu,Bowen Zhou
关键词-EN: Recent In-Context Learning, In-Context Learning based, Learning based methods, achieved remarkable success, Recent In-Context
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 22 pages, 14 figures

点击查看摘要

[AI-53] CEGRL-TKGR: A Causal Enhanced Graph Representation Learning Framework for Improving Temporal Knowledge Graph Extrapolation Reasoning

链接: https://arxiv.org/abs/2408.07911
作者: Jinze Sun,Yongpan Sheng,Lirong He
关键词-EN: increasingly gaining attention, incomplete temporal knowledge, knowledge graph reasoning, inherently incomplete temporal, Temporal knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Temporal knowledge graph reasoning (TKGR) is increasingly gaining attention for its ability to extrapolate new events from historical data, thereby enriching the inherently incomplete temporal knowledge graphs. Existing graph-based representation learning frameworks have made significant strides in developing evolving representations for both entities and relational embeddings. Despite these achievements, there’s a notable tendency in these models to inadvertently learn biased data representations and mine spurious correlations, consequently failing to discern the causal relationships between events. This often leads to incorrect predictions based on these false correlations. To address this, we propose an innovative causal enhanced graph representation learning framework for TKGR (named CEGRL-TKGR). This framework introduces causal structures in graph-based representation learning to unveil the essential causal relationships between events, ultimately enhancing task performance. Specifically, we first disentangle the evolutionary representations of entities and relations in a temporal graph sequence into two distinct components, namely causal representations and confounding representations. Then, drawing on causal intervention theory, we advocate the utilization of causal representations for predictions, aiming to mitigate the effects of erroneous correlations caused by confounding features, thus achieving more robust and accurate predictions. Finally, extensive experimental results on six benchmark datasets demonstrate the superior performance of our model in the link prediction task.

[AI-54] KAN versus MLP on Irregular or Noisy Functions

链接: https://arxiv.org/abs/2408.07906
作者: Chen Zeng,Jiahui Wang,Haoran Shen,Qiao Wang
关键词-EN: Multi-Layer Perceptron, functions, Perceptron, KAN, noisy functions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In this paper, we compare the performance of Kolmogorov-Arnold Networks (KAN) and Multi-Layer Perceptron (MLP) networks on irregular or noisy functions. We control the number of parameters and the size of the training samples to ensure a fair comparison. For clarity, we categorize the functions into six types: regular functions, continuous functions with local non-differentiable points, functions with jump discontinuities, functions with singularities, functions with coherent oscillations, and noisy functions. Our experimental results indicate that KAN does not always perform best. For some types of functions, MLP outperforms or performs comparably to KAN. Furthermore, increasing the size of training samples can improve performance to some extent. When noise is added to functions, the irregular features are often obscured by the noise, making it challenging for both MLP and KAN to extract these features effectively. We hope these experiments provide valuable insights for future neural network research and encourage further investigations to overcome these challenges.

[AI-55] Assessing Language Models Worldview for Fiction Generation

链接: https://arxiv.org/abs/2408.07904
作者: Aisha Khatun,Daniel G. Brown
关键词-EN: Large Language Models, Large Language, Language Models, computational creativity, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Short paper

点击查看摘要

[AI-56] Quantum-inspired Interpretable Deep Learning Architecture for Text Sentiment Analysis

链接: https://arxiv.org/abs/2408.07891
作者: Bingyu Li,Da Zhang,Zhiyuan Zhao,Junyu Gao,Yuan Yuan
关键词-EN: social media, predominant form, form of communication, communication on social, Text
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text has become the predominant form of communication on social media, embedding a wealth of emotional nuances. Consequently, the extraction of emotional information from text is of paramount importance. Despite previous research making some progress, existing text sentiment analysis models still face challenges in integrating diverse semantic information and lack interpretability. To address these issues, we propose a quantum-inspired deep learning architecture that combines fundamental principles of quantum mechanics (QM principles) with deep learning models for text sentiment analysis. Specifically, we analyze the commonalities between text representation and QM principles to design a quantum-inspired text representation method and further develop a quantum-inspired text embedding layer. Additionally, we design a feature extraction layer based on long short-term memory (LSTM) networks and self-attention mechanisms (SAMs). Finally, we calculate the text density matrix using the quantum complex numbers principle and apply 2D-convolution neural networks (CNNs) for feature condensation and dimensionality reduction. Through a series of visualization, comparative, and ablation experiments, we demonstrate that our model not only shows significant advantages in accuracy and efficiency compared to previous related models but also achieves a certain level of interpretability by integrating QM principles. Our code is available at QISA.

[AI-57] IReCa: Intrinsic Reward-enhanced Context-aware Reinforcement Learning for Human-AI Coordination

链接: https://arxiv.org/abs/2408.07877
作者: Xin Hao,Bahareh Nakisa,Mohmmad Naim Rastgoo,Richard Dazeley
关键词-EN: exhibit asymmetric behaviors, human-AI coordination scenarios, sparse rewards, human-AI coordination, rewards
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In human-AI coordination scenarios, human agents usually exhibit asymmetric behaviors that are significantly sparse and unpredictable compared to those of AI agents. These characteristics introduce two primary challenges to human-AI coordination: the effectiveness of obtaining sparse rewards and the efficiency of training the AI agents. To tackle these challenges, we propose an Intrinsic Reward-enhanced Context-aware (IReCa) reinforcement learning (RL) algorithm, which leverages intrinsic rewards to facilitate the acquisition of sparse rewards and utilizes environmental context to enhance training efficiency. Our IReCa RL algorithm introduces three unique features: (i) it encourages the exploration of sparse rewards by incorporating intrinsic rewards that supplement traditional extrinsic rewards from the environment; (ii) it improves the acquisition of sparse rewards by prioritizing the corresponding sparse state-action pairs; and (iii) it enhances the training efficiency by optimizing the exploration and exploitation through innovative context-aware weights of extrinsic and intrinsic rewards. Extensive simulations executed in the Overcooked layouts demonstrate that our IReCa RL algorithm can increase the accumulated rewards by approximately 20% and reduce the epochs required for convergence by approximately 67% compared to state-of-the-art baselines.

[AI-58] CON-FOLD – Explainable Machine Learning with Confidence

链接: https://arxiv.org/abs/2408.07854
作者: Lachlan McGinness,Peter Baumgartner
关键词-EN: explainable machine learning, training data, data to create, create a set, machine learning classification
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:FOLD-RM is an explainable machine learning classification algorithm that uses training data to create a set of classification rules. In this paper we introduce CON-FOLD which extends FOLD-RM in several ways. CON-FOLD assigns probability-based confidence scores to rules learned for a classification task. This allows users to know how confident they should be in a prediction made by the model. We present a confidence-based pruning algorithm that uses the unique structure of FOLD-RM rules to efficiently prune rules and prevent overfitting. Furthermore, CON-FOLD enables the user to provide pre-existing knowledge in the form of logic program rules that are either (fixed) background knowledge or (modifiable) initial rule candidates. The paper describes our method in detail and reports on practical experiments. We demonstrate the performance of the algorithm on benchmark datasets from the UCI Machine Learning Repository. For that, we introduce a new metric, Inverse Brier Score, to evaluate the accuracy of the produced confidence scores. Finally we apply this extension to a real world example that requires explainability: marking of student responses to a short answer question from the Australian Physics Olympiad.

[AI-59] raining Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

链接: https://arxiv.org/abs/2408.07852
作者: Jiri Hron,Laura Culp,Gamaleldin Elsayed,Rosanne Liu,Ben Adlam,Maxwell Bileschi,Bernd Bohnet,JD Co-Reyes,Noah Fiedel,C. Daniel Freeman,Izzeddin Gur,Kathleen Kenealy,Jaehoon Lee,Peter J. Liu,Gaurav Mishra,Igor Mordatch,Azade Nova,Roman Novak,Aaron Parisi,Jeffrey Pennington,Alex Rizkowsky,Isabelle Simpson,Hanie Sedghi,Jascha Sohl-dickstein,Kevin Swersky,Sharad Vikram,Tris Warkentin,Lechao Xiao,Kelvin Xu,Jasper Snoek,Simon Kornblith
关键词-EN: increased training budget, capabilities of language, training budget, fully understood, increased training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at COLM 2024. 16 pages, 11 figures

点击查看摘要

[AI-60] SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition INTERSPEECH2024

链接: https://arxiv.org/abs/2408.07851
作者: Mohamed Osman,Daniel Z. Kaplan,Tamer Nadeem
关键词-EN: powerful self-supervised learning, made significant strides, self-supervised learning, made significant, significant strides
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at INTERSPEECH 2024

点击查看摘要

[AI-61] A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites

链接: https://arxiv.org/abs/2408.07846
作者: Andrea Lops,Fedelucio Narducci,Azzurra Ragone,Michelantonio Trizio,Claudio Bartolini
关键词-EN: software testing lifecycle, Large Language Models, ensuring software correctness, Unit tests, Unit tests represent
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unit tests represent the most basic level of testing within the software testing lifecycle and are crucial to ensuring software correctness. Designing and creating unit tests is a costly and labor-intensive process that is ripe for automation. Recently, Large Language Models (LLMs) have been applied to various aspects of software development, including unit test generation. Although several empirical studies evaluating LLMs’ capabilities in test code generation exist, they primarily focus on simple scenarios, such as the straightforward generation of unit tests for individual methods. These evaluations often involve independent and small-scale test units, providing a limited view of LLMs’ performance in real-world software development scenarios. Moreover, previous studies do not approach the problem at a suitable scale for real-life applications. Generated unit tests are often evaluated via manual integration into the original projects, a process that limits the number of tests executed and reduces overall efficiency. To address these gaps, we have developed an approach for generating and evaluating more real-life complexity test suites. Our approach focuses on class-level test code generation and automates the entire process from test generation to test assessment. In this work, we present \textscAgoneTest: an automated system for generating test suites for Java projects and a comprehensive and principled methodology for evaluating the generated test suites. Starting from a state-of-the-art dataset (i.e., \textscMethods2Test), we built a new dataset for comparing human-written tests with those generated by LLMs. Our key contributions include a scalable automated software system, a new dataset, and a detailed methodology for evaluating test quality.

[AI-62] Enhancing Equitable Access to AI in Housing and Homelessness System of Care through Federated Learning AAAI

链接: https://arxiv.org/abs/2408.07845
作者: Musa Taib,Jiajun Wu,Steve Drew,Geoffrey G. Messier
关键词-EN: System of Care, Homelessness System, people experiencing homelessness, experiencing homelessness, supportive housing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted at the 2024 AAAI/ACM Conference on AI, Ethics, and Society (AIES)

点击查看摘要

Abstract:The top priority of a Housing and Homelessness System of Care (HHSC) is to connect people experiencing homelessness to supportive housing. An HHSC typically consists of many agencies serving the same population. Information technology platforms differ in type and quality between agencies, so their data are usually isolated from one agency to another. Larger agencies may have sufficient data to train and test artificial intelligence (AI) tools but smaller agencies typically do not. To address this gap, we introduce a Federated Learning (FL) approach enabling all agencies to train a predictive model collaboratively without sharing their sensitive data. We demonstrate how FL can be used within an HHSC to provide all agencies equitable access to quality AI and further assist human decision-makers in the allocation of resources within HHSC. This is achieved while preserving the privacy of the people within the data by not sharing identifying information between agencies without their consent. Our experimental results using real-world HHSC data from Calgary, Alberta, demonstrate that our FL approach offers comparable performance with the idealized scenario of training the predictive model with data fully shared and linked between agencies.

[AI-63] SustainDC – Benchmarking for Sustainable Data Center Control NEURIPS2024

链接: https://arxiv.org/abs/2408.07841
作者: Avisek Naug,Antonio Guillen,Ricardo Luna,Vineet Gundecha,Desik Rengarajan,Sahand Ghorbanpour,Sajad Mousavi,Ashwin Ramesh Babu,Dejan Markovikj,Lekhapriya D Kashyap,Soumyendu Sarkar
关键词-EN: Machine learning, consume significant amounts, massive data centers, computational demand, leading to massive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Under review at Advances in Neural Information Processing Systems 2024 (NeurIPS 2024)

点击查看摘要

Abstract:Machine learning has driven an exponential increase in computational demand, leading to massive data centers that consume significant amounts of energy and contribute to climate change. This makes sustainable data center control a priority. In this paper, we introduce SustainDC, a set of Python environments for benchmarking multi-agent reinforcement learning (MARL) algorithms for data centers (DC). SustainDC supports custom DC configurations and tasks such as workload scheduling, cooling optimization, and auxiliary battery management, with multiple agents managing these operations while accounting for the effects of each other. We evaluate various MARL algorithms on SustainDC, showing their performance across diverse DC designs, locations, weather conditions, grid carbon intensity, and workload requirements. Our results highlight significant opportunities for improvement of data center operations using MARL algorithms. Given the increasing use of DC due to AI, SustainDC provides a crucial platform for the development and benchmarking of advanced algorithms essential for achieving sustainable computing and addressing other heterogeneous real-world challenges.

[AI-64] ONSEP: A Novel Online Neural-Symbolic Framework for Event Prediction Based on Large Language Model ACL2024

链接: https://arxiv.org/abs/2408.07840
作者: Xuanqing Yu,Wangtao Sun,Jingwei Li,Kang Liu,Chengbao Liu,Jie Tan
关键词-EN: temporal knowledge graph, knowledge graph forecasting, temporal knowledge, graph forecasting, pivotal technique
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注: 16 pages, ACL 2024 Findings

点击查看摘要

[AI-65] A Culturally-Aware Tool for Crowdworkers: Leveraging Chronemics to Support Diverse Work Styles

链接: https://arxiv.org/abs/2408.07838
作者: Carlos Toxtli,Christopher Curtis,Saiph Savage
关键词-EN: feature standardized interfaces, Crowdsourcing markets, expanding worldwide, negatively impacting, well-being and productivity
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 32 pages, 9 figures, Computer Supported Cooperative Work (CSCW) 2024

点击查看摘要

Abstract:Crowdsourcing markets are expanding worldwide, but often feature standardized interfaces that ignore the cultural diversity of their workers, negatively impacting their well-being and productivity. To transform these workplace dynamics, this paper proposes creating culturally-aware workplace tools, specifically designed to adapt to the cultural dimensions of monochronic and polychronic work styles. We illustrate this approach with “CultureFit,” a tool that we engineered based on extensive research in Chronemics and culture theories. To study and evaluate our tool in the real world, we conducted a field experiment with 55 workers from 24 different countries. Our field experiment revealed that CultureFit significantly improved the earnings of workers from cultural backgrounds often overlooked in design. Our study is among the pioneering efforts to examine culturally aware digital labor interventions. It also provides access to a dataset with over two million data points on culture and digital work, which can be leveraged for future research in this emerging field. The paper concludes by discussing the importance and future possibilities of incorporating cultural insights into the design of tools for digital labor.

[AI-66] SSRFlow: Semantic-aware Fusion with Spatial Temporal Re-embedding for Real-world Scene Flow

链接: https://arxiv.org/abs/2408.07825
作者: Zhiyang Lu,Qinghan Chen,Zhimin Yuan,Ming Cheng
关键词-EN: dynamic scene perception, contemporary scene flow, Scene flow, scene flow methods, scene perception
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 19 pages,12 figures. arXiv admin note: substantial text overlap with arXiv:2403.07032

点击查看摘要

Abstract:Scene flow, which provides the 3D motion field of the first frame from two consecutive point clouds, is vital for dynamic scene perception. However, contemporary scene flow methods face three major challenges. Firstly, they lack global flow embedding or only consider the context of individual point clouds before embedding, leading to embedded points struggling to perceive the consistent semantic relationship of another frame. To address this issue, we propose a novel approach called Dual Cross Attentive (DCA) for the latent fusion and alignment between two frames based on semantic contexts. This is then integrated into Global Fusion Flow Embedding (GF) to initialize flow embedding based on global correlations in both contextual and Euclidean spaces. Secondly, deformations exist in non-rigid objects after the warping layer, which distorts the spatiotemporal relation between the consecutive frames. For a more precise estimation of residual flow at next-level, the Spatial Temporal Re-embedding (STR) module is devised to update the point sequence features at current-level. Lastly, poor generalization is often observed due to the significant domain gap between synthetic and LiDAR-scanned datasets. We leverage novel domain adaptive losses to effectively bridge the gap of motion inference from synthetic to real-world. Experiments demonstrate that our approach achieves state-of-the-art (SOTA) performance across various datasets, with particularly outstanding results in real-world LiDAR-scanned situations. Our code will be released upon publication.

[AI-67] An Efficient and Explanatory Image and Text Clustering System with Multimodal Autoencoder Architecture

链接: https://arxiv.org/abs/2408.07791
作者: Tiancheng Shi,Yuanchen Wei,John R. Kender
关键词-EN: Convolutional-Recurrent Variational Autoencoder, LLM interpreters, international news event, Autoencoders and LLM, demonstrate the efficiencies
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We demonstrate the efficiencies and explanatory abilities of extensions to the common tools of Autoencoders and LLM interpreters, in the novel context of comparing different cultural approaches to the same international news event. We develop a new Convolutional-Recurrent Variational Autoencoder (CRVAE) model that extends the modalities of previous CVAE models, by using fully-connected latent layers to embed in parallel the CNN encodings of video frames, together with the LSTM encodings of their related text derived from audio. We incorporate the model within a larger system that includes frame-caption alignment, latent space vector clustering, and a novel LLM-based cluster interpreter. We measure, tune, and apply this system to the task of summarizing a video into three to five thematic clusters, with each theme described by ten LLM-produced phrases. We apply this system to two news topics, COVID-19 and the Winter Olympics, and five other topics are in progress.

[AI-68] On learning capacities of Sugeno integrals with systems of fuzzy relational equations

链接: https://arxiv.org/abs/2408.07768
作者: Ismaïl Baaj
关键词-EN: underlying a Sugeno, Sugeno integral, training data based, training data, fuzzy relational equations
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this article, we introduce a method for learning a capacity underlying a Sugeno integral according to training data based on systems of fuzzy relational equations. To the training data, we associate two systems of equations: a \max-\min system and a \min-\max system. By solving these two systems (in the case that they are consistent) using Sanchez’s results, we show that we can directly obtain the extremal capacities representing the training data. By reducing the \max-\min (resp. \min-\max ) system of equations to subsets of criteria of cardinality less than or equal to q (resp. of cardinality greater than or equal to n-q ), where n is the number of criteria, we give a sufficient condition for deducing, from its potential greatest solution (resp. its potential lowest solution), a q -maxitive (resp. q -minitive) capacity. Finally, if these two reduced systems of equations are inconsistent, we show how to obtain the greatest approximate q -maxitive capacity and the lowest approximate q -minitive capacity, using recent results to handle the inconsistency of systems of fuzzy relational equations.

[AI-69] Enhancing Model Interpretability with Local Attribution over Global Exploration

链接: https://arxiv.org/abs/2408.07736
作者: Zhiyu Zhu,Zhibo Jin,Jiayu Zhang,Huaming Chen
关键词-EN: black boxes’ due, artificial intelligence, black boxes’, internal mechanisms, field of artificial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by ACMMM 2024

点击查看摘要

Abstract:In the field of artificial intelligence, AI models are frequently described as `black boxes’ due to the obscurity of their internal mechanisms. It has ignited research interest on model interpretability, especially in attribution methods that offers precise explanations of model decisions. Current attribution algorithms typically evaluate the importance of each parameter by exploring the sample space. A large number of intermediate states are introduced during the exploration process, which may reach the model’s Out-of-Distribution (OOD) space. Such intermediate states will impact the attribution results, making it challenging to grasp the relative importance of features. In this paper, we firstly define the local space and its relevant properties, and we propose the Local Attribution (LA) algorithm that leverages these properties. The LA algorithm comprises both targeted and untargeted exploration phases, which are designed to effectively generate intermediate states for attribution that thoroughly encompass the local space. Compared to the state-of-the-art attribution methods, our approach achieves an average improvement of 38.21% in attribution effectiveness. Extensive ablation studies in our experiments also validate the significance of each component in our algorithm. Our code is available at: this https URL

[AI-70] Graph neural network surrogate for strategic transport planning

链接: https://arxiv.org/abs/2408.07726
作者: Nikita Makarov,Santhanakrishnan Narayanan,Constantinos Antoniou
关键词-EN: urban environments continue, Graph Neural Network, Graph Attention Network, advanced Graph Neural, graph convolution networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the complexities of urban environments continue to grow, the modelling of transportation systems become increasingly challenging. This paper explores the application of advanced Graph Neural Network (GNN) architectures as surrogate models for strategic transport planning. Building upon a prior work that laid the foundation with graph convolution networks (GCN), our study delves into the comparative analysis of established GCN with the more expressive Graph Attention Network (GAT). Additionally, we propose a novel GAT variant (namely GATv3) to address over-smoothing issues in graph-based models. Our investigation also includes the exploration of a hybrid model combining both GCN and GAT architectures, aiming to investigate the performance of the mixture. The three models are applied to various experiments to understand their limits. We analyse hierarchical regression setups, combining classification and regression tasks, and introduce fine-grained classification with a proposal of a method to convert outputs to precise values. Results reveal the superior performance of the new GAT in classification tasks. To the best of the authors’ knowledge, this is the first GAT model in literature to achieve larger depths. Surprisingly, the fine-grained classification task demonstrates the GCN’s unexpected dominance with additional training data. This shows that synthetic data generators can increase the training data, without overfitting issues whilst improving model performance. In conclusion, this research advances GNN based surrogate modelling, providing insights for refining GNN architectures. The findings open avenues for investigating the potential of the newly proposed GAT architecture and the modelling setups for other transportation problems.

[AI-71] Re-Thinking Process Mining in the AI-Based Agents Era

链接: https://arxiv.org/abs/2408.07720
作者: Alessandro Berti,Mayssa Maatallah,Urszula Jessen,Michal Sroka,Sonia Ayachi Ghannouchi
关键词-EN: Large Language Models, Large Language, Language Models, powerful conversational interfaces, shown promising results
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful conversational interfaces, and their application in process mining (PM) tasks has shown promising results. However, state-of-the-art LLMs struggle with complex scenarios that demand advanced reasoning capabilities. In the literature, two primary approaches have been proposed for implementing PM using LLMs: providing textual insights based on a textual abstraction of the process mining artifact, and generating code executable on the original artifact. This paper proposes utilizing the AI-Based Agents Workflow (AgWf) paradigm to enhance the effectiveness of PM on LLMs. This approach allows for: i) the decomposition of complex tasks into simpler workflows, and ii) the integration of deterministic tools with the domain knowledge of LLMs. We examine various implementations of AgWf and the types of AI-based tasks involved. Additionally, we discuss the CrewAI implementation framework and present examples related to process mining.

[AI-72] Operator Feature Neural Network for Symbolic Regression

链接: https://arxiv.org/abs/2408.07719
作者: Yusong Deng,Min Wu,Lina Yu,Jingyi Liu,Shu Wei,Yanjie Li,Weijun Li
关键词-EN: generally involving skeleton, involving skeleton prediction, Symbolic regression, generally involving, constant optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages

点击查看摘要

Abstract:Symbolic regression is a task aimed at identifying patterns in data and representing them through mathematical expressions, generally involving skeleton prediction and constant optimization. Many methods have achieved some success, however they treat variables and symbols merely as characters of natural language without considering their mathematical essence. This paper introduces the operator feature neural network (OF-Net) which employs operator representation for expressions and proposes an implicit feature encoding method for the intrinsic mathematical operational logic of operators. By substituting operator features for numeric loss, we can predict the combination of operators of target expressions. We evaluate the model on public datasets, and the results demonstrate that the model achieves superior recovery rates and high R^2 scores. With the discussion of the results, we analyze the merit and demerit of OF-Net and propose optimizing schemes.

[AI-73] Impact of Inaccurate Contamination Ratio on Robust Unsupervised Anomaly Detection NEURIPS2024

链接: https://arxiv.org/abs/2408.07718
作者: Jordan F. Masakuna,DJeff Kanda Nkashama,Arian Soltani,Marc Frappier,Pierre-Martin Tardif,Froduald Kabanza
关键词-EN: unsupervised anomaly detection, Training data sets, robust unsupervised anomaly, unsupervised anomaly, significantly undermines model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This is an accepted extended abstract at Black in AI Workshop which will be co-located with NeurIPS 2024 in Canada

点击查看摘要

Abstract:Training data sets intended for unsupervised anomaly detection, typically presumed to be anomaly-free, often contain anomalies (or contamination), a challenge that significantly undermines model performance. Most robust unsupervised anomaly detection models rely on contamination ratio information to tackle contamination. However, in reality, contamination ratio may be inaccurate. We investigate on the impact of inaccurate contamination ratio information in robust unsupervised anomaly detection. We verify whether they are resilient to misinformed contamination ratios. Our investigation on 6 benchmark data sets reveals that such models are not adversely affected by exposure to misinformation. In fact, they can exhibit improved performance when provided with such inaccurate contamination ratios.

[AI-74] An Introduction to Reinforcement Learning: Fundamental Concepts and Practical Applications

链接: https://arxiv.org/abs/2408.07712
作者: Majid Ghasemi,Amir Hossein Moosavi,Ibrahim Sorkhoh,Anjali Agrawal,Fadi Alzhouri,Dariush Ebrahimi
关键词-EN: Artificial Intelligence, maximize cumulative rewards, branch of Artificial, Reinforcement Learning, focuses on training
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) is a branch of Artificial Intelligence (AI) which focuses on training agents to make decisions by interacting with their environment to maximize cumulative rewards. An overview of RL is provided in this paper, which discusses its core concepts, methodologies, recent trends, and resources for learning. We provide a detailed explanation of key components of RL such as states, actions, policies, and reward signals so that the reader can build a foundational understanding. The paper also provides examples of various RL algorithms, including model-free and model-based methods. In addition, RL algorithms are introduced and resources for learning and implementing them are provided, such as books, courses, and online communities. This paper demystifies a comprehensive yet simple introduction for beginners by offering a structured and clear pathway for acquiring and implementing real-time techniques.

[AI-75] Enhancing Supply Chain Visibility with Knowledge Graphs and Large Language Models

链接: https://arxiv.org/abs/2408.07705
作者: Sara AlMahri,Liming Xu,Alexandra Brintrup
关键词-EN: today globalized economy, supply chain, Large Language Models, comprehensive supply chain, supply
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[AI-76] Empathic Responding for Digital Interpersonal Emotion Regulation via Content Recommendation

链接: https://arxiv.org/abs/2408.07704
作者: Akriti Verma,Shama Islam,Valeh Moghaddam,Adnan Anwar,Sharon Horwood
关键词-EN: Interpersonal communication plays, emotion regulation, Interpersonal Emotion Regulation, managing people emotions, Digital Emotion Regulation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Interpersonal communication plays a key role in managing people’s emotions, especially on digital platforms. Studies have shown that people use social media and consume online content to regulate their emotions and find support for rest and recovery. However, these platforms are not designed for emotion regulation, which limits their effectiveness in this regard. To address this issue, we propose an approach to enhance Interpersonal Emotion Regulation (IER) on online platforms through content recommendation. The objective is to empower users to regulate their emotions while actively or passively engaging in online platforms by crafting media content that aligns with IER strategies, particularly empathic responding. The proposed recommendation system is expected to blend system-initiated and user-initiated emotion regulation, paving the way for real-time IER practices on digital media platforms. To assess the efficacy of this approach, a mixed-method research design is used, including the analysis of text-based social media data and a user survey. Digital applications has served as facilitators in this process, given the widespread recognition of digital media applications for Digital Emotion Regulation (DER). The study collects 37.5K instances of user posts and interactions on Reddit over a year to design a Contextual Multi-Armed Bandits (CMAB) based recommendation system using features from user activity and preferences. The experimentation shows that the empathic recommendations generated by the proposed recommendation system are preferred by users over widely accepted ER strategies such as distraction and avoidance.

[AI-77] Natural Language Outlines for Code: Literate Programming in the LLM Era

链接: https://arxiv.org/abs/2408.04820
作者: Kensen Shi,Deniz Altınbüken,Saswat Anand,Mihai Christodorescu,Katja Grünwedel,Alexa Koenings,Sai Naidu,Anurag Pathak,Marc Rasi,Fredde Ribeiro,Brandon Ruffin,Siddhant Sanyam,Maxim Tabachnyk,Sara Toth,Roy Tu,Tobias Welp,Pengcheng Yin,Manzil Zaheer,Satish Chandra,Charles Sutton
关键词-EN: software development process, natural language outlines, development process, natural language, modality and interaction
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-78] SPEED: Scalable Preprocessing of EEG Data for Self-Supervised Learning

链接: https://arxiv.org/abs/2408.08065
作者: Anders Gjølbye,Lina Skerath,William Lehn-Schiøler,Nicolas Langer,Lars Kai Hansen
关键词-EN: narrowly defined objectives, research typically focuses, defined objectives, larger models, range of applications
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: To appear in proceedings of 2024 IEEE International workshop on Machine Learning for Signal Processing

点击查看摘要

Abstract:Electroencephalography (EEG) research typically focuses on tasks with narrowly defined objectives, but recent studies are expanding into the use of unlabeled data within larger models, aiming for a broader range of applications. This addresses a critical challenge in EEG research. For example, Kostas et al. (2021) show that self-supervised learning (SSL) outperforms traditional supervised methods. Given the high noise levels in EEG data, we argue that further improvements are possible with additional preprocessing. Current preprocessing methods often fail to efficiently manage the large data volumes required for SSL, due to their lack of optimization, reliance on subjective manual corrections, and validation processes or inflexible protocols that limit SSL. We propose a Python-based EEG preprocessing pipeline optimized for self-supervised learning, designed to efficiently process large-scale data. This optimization not only stabilizes self-supervised training but also enhances performance on downstream tasks compared to training with raw data.

[AI-79] Conditional Brownian Bridge Diffusion Model for VHR SAR to Optical Image Translation

链接: https://arxiv.org/abs/2408.07947
作者: Seon-Hoon Kim,Dae-won Chung
关键词-EN: Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, imaging technology, conditions and time
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 2 figures, 1 table

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) imaging technology provides the unique advantage of being able to collect data regardless of weather conditions and time. However, SAR images exhibit complex backscatter patterns and speckle noise, which necessitate expertise for interpretation. To deal with this challenge, research has been conducted on translating SAR images into optical-like representations to aid the interpretation of SAR data. Nevertheless, existing studies have predominantly utilized low-resolution satellite imagery datasets and have largely been based on Generative Adversarial Network (GAN) which are known for their training instability and low fidelity. To overcome these limitations of low-resolution data usage and GAN-based approaches, this paper introduces a conditional image-to-image translation approach based on Brownian Bridge Diffusion Model (BBDM). We conducted comprehensive experiments on the MSAW dataset, a paired SAR and optical images collection of 0.5m Very-High-Resolution (VHR) images. The experimental results indicate that our method surpasses both the Conditional Diffusion Model (CDM) and the GAN-based models in diverse perceptual quality metrics.

[AI-80] Exploration of LLMs EEG and behavioral data to measure and support attention and sleep

链接: https://arxiv.org/abs/2408.07822
作者: Akane Sano,Judith Amores,Mary Czerwinski
关键词-EN: large language models, sleep quality based, massive textual data, sleep improvement suggestions, pre-trained models
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore the application of large language models (LLMs), pre-trained models with massive textual data for detecting and improving these altered states. We investigate the use of LLMs to estimate attention states, sleep stages, and sleep quality and generate sleep improvement suggestions and adaptive guided imagery scripts based on electroencephalogram (EEG) and physical activity data (e.g. waveforms, power spectrogram images, numerical features). Our results show that LLMs can estimate sleep quality based on human textual behavioral features and provide personalized sleep improvement suggestions and guided imagery scripts; however detecting attention, sleep stages, and sleep quality based on EEG and activity data requires further training data and domain-specific knowledge.

计算机视觉

[CV-0] Can Large Language Models Understand Symbolic Graphics Programs?

链接: https://arxiv.org/abs/2408.08313
作者: Zeju Qiu,Weiyang Liu,Haiwen Feng,Zhen Liu,Tim Z. Xiao,Katherine M. Collins,Joshua B. Tenenbaum,Adrian Weller,Michael J. Black,Bernhard Schölkopf
关键词-EN: symbolic graphics programs, Assessing the capabilities, symbolic graphics, graphics content, graphics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report v1 (44 pages, 23 figures, project page: this https URL )

点击查看摘要

[CV-1] Understanding the Local Geometry of Generative Model Manifolds

链接: https://arxiv.org/abs/2408.08307
作者: Ahmed Imtiaz Humayun,Ibtihel Amara,Candice Schumann,Golnoosh Farnadi,Negar Rostamzadeh,Mohammad Havaei
关键词-EN: Deep generative models, Fréchet Inception Distance, complex data manifolds, generative models learn, Deep generative
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Pre-print. 11 pages main, 8 pages app., 28 figures

点击查看摘要

Abstract:Deep generative models learn continuous representations of complex data manifolds using a finite number of samples during training. For a pre-trained generative model, the common way to evaluate the quality of the manifold representation learned, is by computing global metrics like Fréchet Inception Distance using a large number of generated and real samples. However, generative model performance is not uniform across the learned manifold, e.g., for \textitfoundation models like Stable Diffusion generation performance can vary significantly based on the conditioning or initial noise vector being denoised. In this paper we study the relationship between the \textitlocal geometry of the learned manifold and downstream generation. Based on the theory of continuous piecewise-linear (CPWL) generators, we use three geometric descriptors - scaling ( \psi ), rank ( \nu ), and complexity ( \delta ) - to characterize a pre-trained generative model manifold locally. We provide quantitative and qualitative evidence showing that for a given latent, the local descriptors are correlated with generation aesthetics, artifacts, uncertainty, and even memorization. Finally we demonstrate that training a \textitreward model on the local geometry can allow controlling the likelihood of a generated sample under the learned distribution.

[CV-2] owards Flexible Visual Relationship Segmentation

链接: https://arxiv.org/abs/2408.08305
作者: Fangrui Zhu,Jianwei Yang,Huaizu Jiang
关键词-EN: scene graph generation, human-object interaction, scene graph, graph generation, studied separately
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual relationship understanding has been studied separately in human-object interaction(HOI) detection, scene graph generation(SGG), and referring relationships(RR) tasks. Given the complexity and interconnectedness of these tasks, it is crucial to have a flexible framework that can effectively address these tasks in a cohesive manner. In this work, we propose FleVRS, a single model that seamlessly integrates the above three aspects in standard and promptable visual relationship segmentation, and further possesses the capability for open-vocabulary segmentation to adapt to novel scenarios. FleVRS leverages the synergy between text and image modalities, to ground various types of relationships from images and use textual features from vision-language models to visual conceptual understanding. Empirical validation across various datasets demonstrates that our framework outperforms existing models in standard, promptable, and open-vocabulary tasks, e.g., +1.9 mAP on HICO-DET, +11.4 Acc on VRD, +4.7 mAP on unseen HICO-DET. Our FleVRS represents a significant step towards a more intuitive, comprehensive, and scalable understanding of visual relationships.

[CV-3] SLCA: Unleash the Power of Sequential Fine-tuning for Continual Learning with Pre-training ICCV23

点击查看摘要

[CV-4] HeightLane: BEV Heightmap guided 3D Lane Detection

链接: https://arxiv.org/abs/2408.08270
作者: Chaesong Park,Eunbin Seo,Jongwoo Lim
关键词-EN: presents significant challenges, significant challenges due, imperfect ground modeling, images presents significant, monocular images presents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Accurate 3D lane detection from monocular images presents significant challenges due to depth ambiguity and imperfect ground modeling. Previous attempts to model the ground have often used a planar ground assumption with limited degrees of freedom, making them unsuitable for complex road environments with varying slopes. Our study introduces HeightLane, an innovative method that predicts a height map from monocular images by creating anchors based on a multi-slope assumption. This approach provides a detailed and accurate representation of the ground. HeightLane employs the predicted heightmap along with a deformable attention-based spatial feature transform framework to efficiently convert 2D image features into 3D bird’s eye view (BEV) features, enhancing spatial understanding and lane structure recognition. Additionally, the heightmap is used for the positional encoding of BEV features, further improving their spatial accuracy. This explicit view transformation bridges the gap between front-view perceptions and spatially accurate BEV representations, significantly improving detection performance. To address the lack of the necessary ground truth (GT) height map in the original OpenLane dataset, we leverage the Waymo dataset and accumulate its LiDAR data to generate a height map for the drivable area of each scene. The GT heightmaps are used to train the heightmap extraction module from monocular images. Extensive experiments on the OpenLane validation set show that HeightLane achieves state-of-the-art performance in terms of F-score, highlighting its potential in real-world applications.

[CV-5] Snuffy: Efficient Whole Slide Image Classifier ECCV2024

点击查看摘要

[CV-6] Computer Vision Model Compression Techniques for Embedded Systems: A Survey

链接: https://arxiv.org/abs/2408.08250
作者: Alexandre Lopes,Fernando Pereira dos Santos,Diulhio de Oliveira,Mauricio Schiezaro,Helio Pedrini
关键词-EN: Convolutional Neural Networks, Deep neural networks, neural networks, computer vision problems, consistently represented
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep neural networks have consistently represented the state of the art in most computer vision problems. In these scenarios, larger and more complex models have demonstrated superior performance to smaller architectures, especially when trained with plenty of representative data. With the recent adoption of Vision Transformer (ViT) based architectures and advanced Convolutional Neural Networks (CNNs), the total number of parameters of leading backbone architectures increased from 62M parameters in 2012 with AlexNet to 7B parameters in 2024 with AIM-7B. Consequently, deploying such deep architectures faces challenges in environments with processing and runtime constraints, particularly in embedded systems. This paper covers the main model compression techniques applied for computer vision tasks, enabling modern models to be used in embedded systems. We present the characteristics of compression subareas, compare different approaches, and discuss how to choose the best technique and expected variations when analyzing it on various embedded devices. We also share codes to assist researchers and new practitioners in overcoming initial implementation challenges for each subarea and present trends for Model Compression. Case studies for compression models are available at \hrefthis https URLthis https URL.

[CV-7] Comparative Evaluation of 3D Reconstruction Methods for Object Pose Estimation

链接: https://arxiv.org/abs/2408.08234
作者: Varun Burde,Assia Benbihi,Pavel Burget,Torsten Sattler
关键词-EN: involving robotic manipulation, industrial applications involving, applications involving robotic, pose estimation, Object pose estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Object pose estimation is essential to many industrial applications involving robotic manipulation, navigation, and augmented reality. Current generalizable object pose estimators, i.e., approaches that do not need to be trained per object, rely on accurate 3D models. Predominantly, CAD models are used, which can be hard to obtain in practice. At the same time, it is often possible to acquire images of an object. Naturally, this leads to the question whether 3D models reconstructed from images are sufficient to facilitate accurate object pose estimation. We aim to answer this question by proposing a novel benchmark for measuring the impact of 3D reconstruction quality on pose estimation accuracy. Our benchmark provides calibrated images for object reconstruction registered with the test images of the YCB-V dataset for pose evaluation under the BOP benchmark format. Detailed experiments with multiple state-of-the-art 3D reconstruction and object pose estimation approaches show that the geometry produced by modern reconstruction methods is often sufficient for accurate pose estimation. Our experiments lead to interesting observations: (1) Standard metrics for measuring 3D reconstruction quality are not necessarily indicative of pose estimation accuracy, which shows the need for dedicated benchmarks such as ours. (2) Classical, non-learning-based approaches can perform on par with modern learning-based reconstruction techniques and can even offer a better reconstruction time-pose accuracy tradeoff. (3) There is still a sizable gap between performance with reconstructed and with CAD models. To foster research on closing this gap, our benchmark is publicly available at this https URL.

[CV-8] he Dawn of KAN in Image-to-Image (I2I) Translation: Integrating Kolmogorov-Arnold Networks with GANs for Unpaired I2I Translation

点击查看摘要

[CV-9] Moving Healthcare AI-Support Systems for Visually Detectable Diseases onto Constrained Devices

点击查看摘要

[CV-10] WaterSplatting: Fast Underwater 3D Scene Reconstruction Using Gaussian Splatting

链接: https://arxiv.org/abs/2408.08206
作者: Huapeng Li,Wenxuan Song,Tianao Xu,Alexandre Elsig,Jonas Kulhanek
关键词-EN: applications ranging, ranging from naval, naval robots, interesting problem, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Web: this https URL

点击查看摘要

Abstract:The underwater 3D scene reconstruction is a challenging, yet interesting problem with applications ranging from naval robots to VR experiences. The problem was successfully tackled by fully volumetric NeRF-based methods which can model both the geometry and the medium (water). Unfortunately, these methods are slow to train and do not offer real-time rendering. More recently, 3D Gaussian Splatting (3DGS) method offered a fast alternative to NeRFs. However, because it is an explicit method that renders only the geometry, it cannot render the medium and is therefore unsuited for underwater reconstruction. Therefore, we propose a novel approach that fuses volumetric rendering with 3DGS to handle underwater data effectively. Our method employs 3DGS for explicit geometry representation and a separate volumetric field (queried once per pixel) for capturing the scattering medium. This dual representation further allows the restoration of the scenes by removing the scattering medium. Our method outperforms state-of-the-art NeRF-based methods in rendering quality on the underwater SeaThru-NeRF dataset. Furthermore, it does so while offering real-time rendering performance, addressing the efficiency limitations of existing methods. Web: this https URL

[CV-11] A Multi-task Adversarial Attack Against Face Authentication

链接: https://arxiv.org/abs/2408.08205
作者: Hanrui Wang,Shuo Wang,Cunjian Chen,Massimo Tistarelli,Zhe Jin
关键词-EN: identity management systems, identity management, face authentication systems, MTADV, management systems
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Multimedia (cs.MM)
*备注: Accepted by ACM Transactions on Multimedia Computing, Communications, and Applications

点击查看摘要

Abstract:Deep-learning-based identity management systems, such as face authentication systems, are vulnerable to adversarial attacks. However, existing attacks are typically designed for single-task purposes, which means they are tailored to exploit vulnerabilities unique to the individual target rather than being adaptable for multiple users or systems. This limitation makes them unsuitable for certain attack scenarios, such as morphing, universal, transferable, and counter attacks. In this paper, we propose a multi-task adversarial attack algorithm called MTADV that are adaptable for multiple users or systems. By interpreting these scenarios as multi-task attacks, MTADV is applicable to both single- and multi-task attacks, and feasible in the white- and gray-box settings. Furthermore, MTADV is effective against various face datasets, including LFW, CelebA, and CelebA-HQ, and can work with different deep learning models, such as FaceNet, InsightFace, and CurricularFace. Importantly, MTADV retains its feasibility as a single-task attack targeting a single user/system. To the best of our knowledge, MTADV is the first adversarial attack method that can target all of the aforementioned scenarios in one algorithm.

[CV-12] owards Practical Human Motion Prediction with LiDAR Point Clouds

链接: https://arxiv.org/abs/2408.08202
作者: Xiao Han,Yiming Ren,Yichen Yao,Yujing Sun,Yuexin Ma
关键词-EN: human-centric multimedia understanding, understanding and interacting, crucial for human-centric, human-centric multimedia, multimedia understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human motion prediction is crucial for human-centric multimedia understanding and interacting. Current methods typically rely on ground truth human poses as observed input, which is not practical for real-world scenarios where only raw visual sensor data is available. To implement these methods in practice, a pre-phrase of pose estimation is essential. However, such two-stage approaches often lead to performance degradation due to the accumulation of errors. Moreover, reducing raw visual data to sparse keypoint representations significantly diminishes the density of information, resulting in the loss of fine-grained features. In this paper, we propose \textitLiDAR-HMP, the first single-LiDAR-based 3D human motion prediction approach, which receives the raw LiDAR point cloud as input and forecasts future 3D human poses directly. Building upon our novel structure-aware body feature descriptor, LiDAR-HMP adaptively maps the observed motion manifold to future poses and effectively models the spatial-temporal correlations of human motions for further refinement of prediction results. Extensive experiments show that our method achieves state-of-the-art performance on two public benchmarks and demonstrates remarkable robustness and efficacy in real-world deployments.

[CV-13] Heavy Labels Out! Dataset Distillation with Label Space Lightening

链接: https://arxiv.org/abs/2408.08201
作者: Ruonan Yu,Songhua Liu,Zigeng Chen,Jingwen Ye,Xinchao Wang
关键词-EN: networks are similar, condensation aims, aims to condense, neural networks, large-scale training dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dataset distillation or condensation aims to condense a large-scale training dataset into a much smaller synthetic one such that the training performance of distilled and original sets on neural networks are similar. Although the number of training samples can be reduced substantially, current state-of-the-art methods heavily rely on enormous soft labels to achieve satisfactory performance. As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. Specifically, to construct such projectors, we leverage prior knowledge in open-source foundation models, e.g., CLIP, and introduce a LoRA-like fine-tuning strategy to mitigate the gap between pre-trained and target distributions, so that original models for soft-label generation can be distilled into a group of low-rank matrices. Moreover, an effective image optimization method is proposed to further mitigate the potential error between the original and distilled label generators. Extensive experiments demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets. Our code will be available.

[CV-14] Beyond Full Label: Single-Point Prompt for Infrared Small Target Label Generation

链接: https://arxiv.org/abs/2408.08191
作者: Shuai Yuan,Hanlin Qin,Renke Kou,Xiang Yan,Zechuan Li,Chenxu Peng,Abd-Krim Seghouane
关键词-EN: infrared small target, label generation, infrared small, small target label, small target
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this work, we make the first attempt to construct a learning-based single-point annotation paradigm for infrared small target label generation (IRSTLG). Our intuition is that label generation requires just one more point prompt than target detection: IRSTLG can be regarded as an infrared small target detection (IRSTD) task with the target location hint. Based on this insight, we introduce an energy double guided single-point prompt (EDGSP) framework, which adeptly transforms the target detection network into a refined label generation method. Specifically, the proposed EDGSP includes: 1) target energy initialization (TEI) to create a foundational outline for sufficient shape evolution of pseudo label, 2) double prompt embedding (DPE) for rapid localization of interested regions and reinforcement of individual differences to avoid label adhesion, and 3) bounding box-based matching (BBM) to eliminate false alarms. Experimental results show that pseudo labels generated by three baselines equipped with EDGSP achieve 100% object-level probability of detection (Pd) and 0% false-alarm rate (Fa) on SIRST, NUDT-SIRST, and IRSTD-1k datasets, with a pixel-level intersection over union (IoU) improvement of 13.28% over state-of-the-art label generation methods. Additionally, the downstream detection task reveals that our centroid-annotated pseudo labels surpass full labels, even with coarse single-point annotations, it still achieves 99.5% performance of full labeling.

[CV-15] FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

链接: https://arxiv.org/abs/2408.08189
作者: Jiasong Feng,Ao Ma,Jing Wang,Bo Cheng,Xiaodan Liang,Dawei Leng,Yuhui Yin
关键词-EN: Synthesizing motion-rich, frame-specific textual guidance, textual guidance, Textual Guidance Module, artificial intelligence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model’s capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Lastly, TFB boosts the temporal consistency of latent features. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our approach achieves state-of-the-art T2V generation results on the EvalCrafter benchmark and facilitates the synthesis of dynamic and consistent videos. The video show results can be available at this https URL, and we will make our code and model weights publicly available.

[CV-16] Not Every Image is Worth a Thousand Words: Quantifying Originality in Stable Diffusion ICML2024

链接: https://arxiv.org/abs/2408.08184
作者: Adi Haviv,Shahar Sarfaty,Uri Hacohen,Niva Elkin-Koren,Roi Livni,Amit H Bermano
关键词-EN: addresses the challenge, challenge of quantifying, diffusion models, model, stable diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: GenLaw ICML 2024

点击查看摘要

Abstract:This work addresses the challenge of quantifying originality in text-to-image (T2I) generative diffusion models, with a focus on copyright originality. We begin by evaluating T2I models’ ability to innovate and generalize through controlled experiments, revealing that stable diffusion models can effectively recreate unseen elements with sufficiently diverse training data. Then, our key insight is that concepts and combinations of image elements the model is familiar with, and saw more during training, are more concisly represented in the model’s latent space. We hence propose a method that leverages textual inversion to measure the originality of an image based on the number of tokens required for its reconstruction by the model. Our approach is inspired by legal definitions of originality and aims to assess whether a model can produce original content without relying on specific prompts or having the training data of the model. We demonstrate our method using both a pre-trained stable diffusion model and a synthetic dataset, showing a correlation between the number of tokens and image originality. This work contributes to the understanding of originality in generative models and has implications for copyright infringement cases.

[CV-17] Your Turn: Real-World Turning Angle Estimation for Parkinsons Disease Severity Assessment

点击查看摘要

[CV-18] owards flexible perception with visual memory

点击查看摘要

[CV-19] Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks

链接: https://arxiv.org/abs/2408.08149
作者: Jiawei Wu,Zhi Jin
关键词-EN: Recent research, high-level vision, high-level vision tasks, degraded environments, human perception
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent research tries to extend image restoration capabilities from human perception to machine perception, thereby enhancing the performance of high-level vision tasks in degraded environments. These methods, primarily based on supervised learning, typically involve the retraining of restoration networks or high-level vision networks. However, collecting paired data in real-world scenarios and retraining large-scale models are challenge. To this end, we propose an unsupervised learning method called \textbfVariational \textbfTranslator (VaT), which does not require retraining existing restoration and high-level vision networks. Instead, it establishes a lightweight network that serves as an intermediate bridge between them. By variational inference, VaT approximates the joint distribution of restoration output and high-level vision input, dividing the optimization objective into preserving content and maximizing marginal likelihood associated with high-level vision tasks. By cleverly leveraging self-training paradigms, VaT achieves the above optimization objective without requiring labels. As a result, the translated images maintain a close resemblance to their original content while also demonstrating exceptional performance on high-level vision tasks. Extensive experiments in dehazing and low-light enhancement for detection and classification show the superiority of our method over other state-of-the-art unsupervised counterparts, even significantly surpassing supervised methods in some complex real-world scenarios.

[CV-20] Unlearnable Examples Detection via Iterative Filtering ICANN2024

链接: https://arxiv.org/abs/2408.08143
作者: Yi Yu,Qichen Zheng,Siyuan Yang,Wenhan Yang,Jun Liu,Shijian Lu,Yap-Peng Tan,Kwok-Yan Lam,Alex Kot
关键词-EN: Deep neural networks, Deep neural, data poisoning attacks, data poisoning, neural networks
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICANN 2024

点击查看摘要

Abstract:Deep neural networks are proven to be vulnerable to data poisoning attacks. Recently, a specific type of data poisoning attack known as availability attacks has led to the failure of data utilization for model learning by adding imperceptible perturbations to images. Consequently, it is quite beneficial and challenging to detect poisoned samples, also known as Unlearnable Examples (UEs), from a mixed dataset. In response, we propose an Iterative Filtering approach for UEs identification. This method leverages the distinction between the inherent semantic mapping rules and shortcuts, without the need for any additional information. We verify that when training a classifier on a mixed dataset containing both UEs and clean data, the model tends to quickly adapt to the UEs compared to the clean data. Due to the accuracy gaps between training with clean/poisoned samples, we employ a model to misclassify clean samples while correctly identifying the poisoned ones. The incorporation of additional classes and iterative refinement enhances the model’s ability to differentiate between clean and poisoned samples. Extensive experiments demonstrate the superiority of our method over state-of-the-art detection approaches across various attacks, datasets, and poison ratios, significantly reducing the Half Total Error Rate (HTER) compared to existing methods.

[CV-21] CorrAdaptor: Adaptive Local Context Learning for Correspondence Pruning ECAI

链接: https://arxiv.org/abs/2408.08134
作者: Wei Zhu,Yicheng Liu,Yuping He,Tangfei Liao,Kang Zheng,Xiaoqiu Xu,Tao Wang,Tong Lu
关键词-EN: accurate pixel-level correspondences, enabling advanced tasks, vision and robotics, accurate pixel-level, localization and mapping
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 4 figures, accepted by ECAI

点击查看摘要

Abstract:In the fields of computer vision and robotics, accurate pixel-level correspondences are essential for enabling advanced tasks such as structure-from-motion and simultaneous localization and mapping. Recent correspondence pruning methods usually focus on learning local consistency through k-nearest neighbors, which makes it difficult to capture robust context for each correspondence. We propose CorrAdaptor, a novel architecture that introduces a dual-branch structure capable of adaptively adjusting local contexts through both explicit and implicit local graph learning. Specifically, the explicit branch uses KNN-based graphs tailored for initial neighborhood identification, while the implicit branch leverages a learnable matrix to softly assign neighbors and adaptively expand the local context scope, significantly enhancing the model’s robustness and adaptability to complex image variations. Moreover, we design a motion injection module to integrate motion consistency into the network to suppress the impact of outliers and refine local context learning, resulting in substantial performance improvements. The experimental results on extensive correspondence-based tasks indicate that our CorrAdaptor achieves state-of-the-art performance both qualitatively and quantitatively. The code and pre-trained models are available at this https URL.

[CV-22] Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification ACM-MM2024

链接: https://arxiv.org/abs/2408.08125
作者: Jiexuan Yan,Sheng Huang,Nankun Mu,Luwen Huangfu,Bo Liu
关键词-EN: Real-world data consistently, data consistently exhibits, spanning multiple categories, category-specific visual representations, Real-world data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Real-world data consistently exhibits a long-tailed distribution, often spanning multiple categories. This complexity underscores the challenge of content comprehension, particularly in scenarios requiring Long-Tailed Multi-Label image Classification (LTMLC). In such contexts, imbalanced data distribution and multi-object recognition pose significant hurdles. To address this issue, we propose a novel and effective approach for LTMLC, termed Category-Prompt Refined Feature Learning (CPRFL), utilizing semantic correlations between different categories and decoupling category-specific visual representations for each category. Specifically, CPRFL initializes category-prompts from the pretrained CLIP’s embeddings and decouples category-specific visual representations through interaction with visual features, thereby facilitating the establishment of semantic correlations between the head and tail classes. To mitigate the visual-semantic domain bias, we design a progressive Dual-Path Back-Propagation mechanism to refine the prompts by progressively incorporating context-related visual information into prompts. Simultaneously, the refinement process facilitates the progressive purification of the category-specific visual representations under the guidance of the refined prompts. Furthermore, taking into account the negative-positive sample imbalance, we adopt the Asymmetric Loss as our optimization objective to suppress negative samples across all classes and potentially enhance the head-to-tail recognition performance. We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines. The code is available at this https URL. Comments: Accepted by ACM MM 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.08125 [cs.CV] (or arXiv:2408.08125v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.08125 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-23] Unsupervised Part Discovery via Dual Representation Alignment

链接: https://arxiv.org/abs/2408.08108
作者: Jiahao Xia,Wenjian Huang,Min Xu,Jianguo Zhang,Haimin Zhang,Ziyu Sheng,Dong Xu
关键词-EN: Object parts serve, part representations, Object parts, downstream tasks, part
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by TPAMI-2024

点击查看摘要

Abstract:Object parts serve as crucial intermediate representations in various downstream tasks, but part-level representation learning still has not received as much attention as other vision tasks. Previous research has established that Vision Transformer can learn instance-level attention without labels, extracting high-quality instance-level representations for boosting downstream tasks. In this paper, we achieve unsupervised part-specific attention learning using a novel paradigm and further employ the part representations to improve part discovery performance. Specifically, paired images are generated from the same image with different geometric transformations, and multiple part representations are extracted from these paired images using a novel module, named PartFormer. These part representations from the paired images are then exchanged to improve geometric transformation invariance. Subsequently, the part representations are aligned with the feature map extracted by a feature map encoder, achieving high similarity with the pixel representations of the corresponding part regions and low similarity in irrelevant regions. Finally, the geometric and semantic constraints are applied to the part representations through the intermediate results in alignment for part-specific attention learning, encouraging the PartFormer to focus locally and the part representations to explicitly include the information of the corresponding parts. Moreover, the aligned part representations can further serve as a series of reliable detectors in the testing phase, predicting pixel masks for part discovery. Extensive experiments are carried out on four widely used datasets, and our results demonstrate that the proposed method achieves competitive performance and robustness due to its part-specific attention.

[CV-24] Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images

点击查看摘要

[CV-25] When Video Coding Meets Multimodal Large Language Models : A Unified Paradigm for Video Coding

链接: https://arxiv.org/abs/2408.08093
作者: Pingping Zhang,Jinlong Li,Meng Wang,Nicu Sebe,Sam Kwong,Shiqi Wang
关键词-EN: eliminate intrinsic redundancies, Multimodal Large Language, Existing codecs, Large Language Models, Video Coding
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video compression. Herein, we introduce a unified paradigm for Cross-Modality Video Coding (CMVC), which is a pioneering approach to explore multimodality representation and video generative models in video coding. Specifically, on the encoder side, we disentangle a video into spatial content and motion components, which are subsequently transformed into distinct modalities to achieve very compact representation by leveraging MLLMs. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes that optimize video reconstruction quality for specific decoding requirements, including Text-Text-to-Video (TT2V) mode to ensure high-quality semantic information and Image-Text-to-Video (IT2V) mode to achieve superb perceptual consistency. In addition, we propose an efficient frame interpolation model for IT2V mode via Low-Rank Adaption (LoRA) tuning to guarantee perceptual quality, which allows the generated motion cues to behave smoothly. Experiments on benchmarks indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency. These results highlight potential directions for future research in video coding.

[CV-26] OC3D: Weakly Supervised Outdoor 3D Object Detection with Only Coarse Click Annotation

点击查看摘要

[CV-27] HAIR: Hypernetworks-based All-in-One Image Restoration

链接: https://arxiv.org/abs/2408.08091
作者: Jin Cao,Yi Cao,Li Pang,Deyu Meng,Xiangyong Cao
关键词-EN: restoration involves recovering, high-quality clean image, Image restoration, Image restoration involves, image restoration tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Image restoration involves recovering a high-quality clean image from its degraded version, which is a fundamental task in computer vision. Recent progress in image restoration has demonstrated the effectiveness of learning models capable of addressing various degradations simultaneously, i.e., the All-in-One image restoration models. However, these existing methods typically utilize the same parameters facing images with different degradation types, which causes the model to be forced to trade off between degradation types, therefore impair the total performance. To solve this problem, we propose HAIR, a Hypernetworks-based plug-in-and-play method that dynamically generated parameters for the corresponding networks based on the contents of input images. HAIR consists of 2 main components: Classifier (Cl) and Hyper Selecting Net (HSN). To be more specific, the Classifier is a simple image classification network which is used to generate a Global Information Vector (GIV) that contains the degradation information of the input image; And the HSNs can be seen as a simple Fully-connected Neural Network that receive the GIV and output parameters for the corresponding modules. Extensive experiments shows that incorporating HAIR into the architectures can significantly improve the performance of different models on image restoration tasks at a low cost, \textbfalthough HAIR only generate parameters and haven’t change these models’ logical structures at all. With incorporating HAIR into the popular architecture Restormer, our method obtains superior or at least comparable performance to current state-of-the-art methods on a range of image restoration tasks. \hrefthis https URL\textcolorblue \underline\textbfCode and pre-trained checkpoints are available here.

[CV-28] ColorMamba: Towards High-quality NIR-to-RGB Spectral Translation with Mamba

链接: https://arxiv.org/abs/2408.08087
作者: Huiyu Zhai,Guang Jin,Xingxing Yang,Guosheng Kang
关键词-EN: Translating NIR, Selective Structured State, Structured State Space, cross-domain complexities, State Space Model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Translating NIR to the visible spectrum is challenging due to cross-domain complexities. Current models struggle to balance a broad receptive field with computational efficiency, limiting practical use. Although the Selective Structured State Space Model, especially the improved version, Mamba, excels in generative tasks by capturing long-range dependencies with linear complexity, its default approach of converting 2D images into 1D sequences neglects local context. In this work, we propose a simple but effective backbone, dubbed ColorMamba, which first introduces Mamba into spectral translation tasks. To explore global long-range dependencies and local context for efficient spectral translation, we introduce learnable padding tokens to enhance the distinction of image boundaries and prevent potential confusion within the sequence model. Furthermore, local convolutional enhancement and agent attention are designed to improve the vanilla Mamba. Moreover, we exploit the HSV color to provide multi-scale guidance in the reconstruction process for more accurate spectral translation. Extensive experiments show that our ColorMamba achieves a 1.02 improvement in terms of PSNR compared with the state-of-the-art method. Our code is available at this https URL.

[CV-29] Single-image coherent reconstruction of objects and humans CVPR

链接: https://arxiv.org/abs/2408.08086
作者: Sarthak Batra,Partha P. Chakrabarti,Simon Hadfield,Armin Mustafa
关键词-EN: monocular image suffer, severe mesh collisions, interacting occluding objects, suffer from severe, severe mesh
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at AI for 3D Generation, CVPR Workshop

点击查看摘要

Abstract:Existing methods for reconstructing objects and humans from a monocular image suffer from severe mesh collisions and performance limitations for interacting occluding objects. This paper introduces a method to obtain a globally consistent 3D reconstruction of interacting objects and people from a single image. Our contributions include: 1) an optimization framework, featuring a collision loss, tailored to handle human-object and human-human interactions, ensuring spatially coherent scene reconstruction; and 2) a novel technique to robustly estimate 6 degrees of freedom (DOF) poses, specifically for heavily occluded objects, exploiting image inpainting. Notably, our proposed method operates effectively on images from real-world scenarios, without necessitating scene or object-level 3D supervision. Extensive qualitative and quantitative evaluation against existing methods demonstrates a significant reduction in collisions in the final reconstructions of scenes with multiple interacting humans and objects and a more coherent scene reconstruction.

[CV-30] reat Stillness with Movement: Remote Sensing Change Detection via Coarse-grained Temporal Foregrounds Mining

点击查看摘要

[CV-31] MambaMIM: Pre-training Mamba with State Space Token-interpolation

链接: https://arxiv.org/abs/2408.08070
作者: Fenghe Tang,Bingkun Nian,Yingtai Li,Jie Yang,Liu Wei,S. Kevin Zhou
关键词-EN: Convolutional Neural Networks, Neural Networks, Vision Transformers, Convolutional Neural, learning demonstrates outstanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Generative self-supervised learning demonstrates outstanding representation learning capabilities in both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). However, there are currently no generative pre-training methods related to selective state space models (Mamba) that can handle long-range dependencies effectively. To address this challenge, we introduce a generative self-supervised learning method for Mamba (MambaMIM) based on Selective Structure State Space Sequence Token-interpolation (S6T), a general-purpose pre-training method for arbitrary Mamba architectures. Our method, MambaMIM, incorporates a bottom-up 3D hybrid masking strategy in the encoder to maintain masking consistency across different architectures. Additionally, S6T is employed to learn causal relationships between the masked sequence in the state space. MambaMIM can be used on any single or hybrid Mamba architectures to enhance the Mamba long-range representation capability. Extensive downstream experiments reveal the feasibility and advancement of using Mamba for pre-training medical image tasks. The code is available at: this https URL

[CV-32] Navigating Data Scarcity using Foundation Models: A Benchmark of Few-Shot and Zero-Shot Learning Approaches in Medical Imaging MICCAI2024

点击查看摘要

[CV-33] CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection ECCV2024

链接: https://arxiv.org/abs/2408.08050
作者: Xunfa Lai,Zhiyu Yang,Jie Hu,Shengchuan Zhang,Liujuan Cao,Guannan Jiang,Zhiyu Wang,Songan Zhang,Rongrong Ji
关键词-EN: Instance-wise Consistency Learning, Pixel-wise Consistency Learning, employs Pixel-wise Consistency, Existing camouflaged object, DRCL minimizes pseudo-label
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Existing camouflaged object detection~(COD) methods depend heavily on large-scale pixel-level annotations.However, acquiring such annotations is laborious due to the inherent camouflage characteristics of the objects.Semi-supervised learning offers a promising solution to this challenge.Yet, its application in COD is hindered by significant pseudo-label noise, both pixel-level and instance-level.We introduce CamoTeacher, a novel semi-supervised COD framework, utilizing Dual-Rotation Consistency Learning~(DRCL) to effectively address these noise issues.Specifically, DRCL minimizes pseudo-label noise by leveraging rotation views’ consistency in pixel-level and instance-level.First, it employs Pixel-wise Consistency Learning~(PCL) to deal with pixel-level noise by reweighting the different parts within the pseudo-label.Second, Instance-wise Consistency Learning~(ICL) is used to adjust weights for pseudo-labels, which handles instance-level noise.Extensive experiments on four COD benchmark datasets demonstrate that the proposed CamoTeacher not only achieves state-of-the-art compared with semi-supervised learning methods, but also rivals established fully-supervised learning methods.Our code will be available soon.

[CV-34] An Advanced Deep Learning Based Three-Stream Hybrid Model for Dynamic Hand Gesture Recognition

链接: https://arxiv.org/abs/2408.08035
作者: Md Abdur Rahim,Abu Saleh Musa Miah,Hemel Sharker Akash,Jungpil Shin,Md. Imran Hossain,Md. Najmul Hossain
关键词-EN: hand gesture recognition, modern context, recognition has emerged, feature, deep learning feature
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the modern context, hand gesture recognition has emerged as a focal point. This is due to its wide range of applications, which include comprehending sign language, factories, hands-free devices, and guiding robots. Many researchers have attempted to develop more effective techniques for recognizing these hand gestures. However, there are challenges like dataset limitations, variations in hand forms, external environments, and inconsistent lighting conditions. To address these challenges, we proposed a novel three-stream hybrid model that combines RGB pixel and skeleton-based features to recognize hand gestures. In the procedure, we preprocessed the dataset, including augmentation, to make rotation, translation, and scaling independent systems. We employed a three-stream hybrid model to extract the multi-feature fusion using the power of the deep learning module. In the first stream, we extracted the initial feature using the pre-trained Imagenet module and then enhanced this feature by using a multi-layer of the GRU and LSTM modules. In the second stream, we extracted the initial feature with the pre-trained ReseNet module and enhanced it with the various combinations of the GRU and LSTM modules. In the third stream, we extracted the hand pose key points using the media pipe and then enhanced them using the stacked LSTM to produce the hierarchical feature. After that, we concatenated the three features to produce the final. Finally, we employed a classification module to produce the probabilistic map to generate predicted output. We mainly produced a powerful feature vector by taking advantage of the pixel-based deep learning feature and pos-estimation-based stacked deep learning feature, including a pre-trained model with a scratched deep learning model for unequalled gesture detection capabilities.

[CV-35] DIVE: Towards Descriptive and Diverse Visual Commonsense Generation EMNLP2023

点击查看摘要

[CV-36] Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices

点击查看摘要

[CV-37] Adaptive Learning of Consistency and Inconsistency Information for Fake News Detection

链接: https://arxiv.org/abs/2408.08013
作者: Aohan Li,Jiaxin Chen,Xin Liao,Dengyong Zhang
关键词-EN: posing a threat, trust and credibility, rapid advancement, platforms has significantly, significantly reduced
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The rapid advancement of social media platforms has significantly reduced the cost of information dissemination, yet it has also led to a proliferation of fake news, posing a threat to societal trust and credibility. Most of fake news detection research focused on integrating text and image information to represent the consistency of multiple modes in news content, while paying less attention to inconsistent information. Besides, existing methods that leveraged inconsistent information often caused one mode overshadowing another, leading to ineffective use of inconsistent clue. To address these issues, we propose an adaptive multi-modal feature fusion network (MFF-Net). Inspired by human judgment processes for determining truth and falsity in news, MFF-Net focuses on inconsistent parts when news content is generally consistent and consistent parts when it is generally inconsistent. Specifically, MFF-Net extracts semantic and global features from images and texts respectively, and learns consistency information between modes through a multiple feature fusion module. To deal with the problem of modal information being easily masked, we design a single modal feature filtering strategy to capture inconsistent information from corresponding modes separately. Finally, similarity scores are calculated based on global features with adaptive adjustments made to achieve weighted fusion of consistent and inconsistent features. Extensive experimental results demonstrate that MFF-Net outperforms state-of-the-art methods across three public news datasets derived from real social medias.

[CV-38] MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

链接: https://arxiv.org/abs/2408.08000
作者: Chenjie Cao,Chaohui Yu,Yanwei Fu,Fan Wang,Xiangyang Xue
关键词-EN: achieved prominent improvements, recently achieved prominent, generation have recently, prominent improvements, recently achieved
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Novel View Synthesis (NVS) and 3D generation have recently achieved prominent improvements. However, these works mainly focus on confined categories or synthetic 3D assets, which are discouraged from generalizing to challenging in-the-wild scenes and fail to be employed with 2D synthesis directly. Moreover, these methods heavily depended on camera poses, limiting their real-world applications. To overcome these issues, we propose MVInpainter, re-formulating the 3D editing as a multi-view 2D inpainting task. Specifically, MVInpainter partially inpaints multi-view images with the reference guidance rather than intractably generating an entirely novel view from scratch, which largely simplifies the difficulty of in-the-wild NVS and leverages unmasked clues instead of explicit pose conditions. To ensure cross-view consistency, MVInpainter is enhanced by video priors from motion components and appearance guidance from concatenated reference keyvalue attention. Furthermore, MVInpainter incorporates slot attention to aggregate high-level optical flow features from unmasked regions to control the camera movement with pose-free training and inference. Sufficient scene-level experiments on both object-centric and forward-facing datasets verify the effectiveness of MVInpainter, including diverse tasks, such as multi-view object removal, synthesis, insertion, and replacement. The project page is this https URL.

[CV-39] Co-Fix3D: Enhancing 3D Object Detection with Collaborative Refinement

链接: https://arxiv.org/abs/2408.07999
作者: Wenxuan Li,Qin Zou,Chi Chen,Bo Du,Long Chen
关键词-EN: autonomous driving,accurately detecting, driving,accurately detecting occluded, presents significant challenges, weak positive samples, presents significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the realm of autonomous driving,accurately detecting occluded or distant objects,referred to as weak positive sample ,presents significant challenges. These challenges predominantly arise during query initialization, where an over-reliance on heatmap confidence often results in a high rate of false positives, consequently masking weaker detections and impairing system performance. To alleviate this issue, we propose a novel approach, Co-Fix3D, which employs a collaborative hybrid multi-stage parallel query generation mechanism for BEV representations. Our method incorporates the Local-Global Feature Enhancement (LGE) module, which refines BEV features to more effectively highlight weak positive samples. It uniquely leverages the Discrete Wavelet Transform (DWT) for accurate noise reduction and features refinement in localized areas, and incorporates an attention mechanism to more comprehensively optimize global BEV features. Moreover, our method increases the volume of BEV queries through a multi-stage parallel processing of the LGE, significantly enhancing the probability of selecting weak positive samples. This enhancement not only improves training efficiency within the decoder framework but also boosts overall system performance. Notably, Co-Fix3D achieves superior results on the stringent nuScenes benchmark, outperforming all previous models with a 69.1% mAP and 72.9% NDS on the LiDAR-based benchmark, and 72.3% mAP and 74.1% NDS on the multi-modality benchmark, without relying on test-time augmentation or additional datasets. The source code will be made publicly available upon acceptance.

[CV-40] Monte Carlo Path Tracing and Statistical Event Detection for Event Camera Simulation

链接: https://arxiv.org/abs/2408.07996
作者: Yuichiro Manabe,Tatsuya Yatagawa,Shigeo Morishima,Hiroyuki Kubo
关键词-EN: simulation system fully, system fully based, Monte Carlo path, based Monte Carlo, camera simulation system
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 7 figures, Presented at ICCP 2024

点击查看摘要

Abstract:This paper presents a novel event camera simulation system fully based on physically based Monte Carlo path tracing with adaptive path sampling. The adaptive sampling performed in the proposed method is based on a statistical technique, hypothesis testing for the hypothesis whether the difference of logarithmic luminances at two distant periods is significantly larger than a predefined event threshold. To this end, our rendering system collects logarithmic luminances rather than raw luminance in contrast to the conventional rendering system imitating conventional RGB cameras. Then, based on the central limit theorem, we reasonably assume that the distribution of the population mean of logarithmic luminance can be modeled as a normal distribution, allowing us to model the distribution of the difference of logarithmic luminance as a normal distribution. Then, using Student’s t-test, we can test the hypothesis and determine whether to discard the null hypothesis for event non-occurrence. When we sample a sufficiently large number of path samples to satisfy the central limit theorem and obtain a clean set of events, our method achieves significant speed up compared to a simple approach of sampling paths uniformly at every pixel. To our knowledge, we are the first to simulate the behavior of event cameras in a physically accurate manner using an adaptive sampling technique in Monte Carlo path tracing, and we believe this study will contribute to the development of computer vision applications using event cameras.

[CV-41] IIU: Independent Inference Units for Knowledge-based Visual Question Answering

点击查看摘要

[CV-42] Exploring learning environments for label-efficient cancer diagnosis

链接: https://arxiv.org/abs/2408.07988
作者: Samta Rani,Tanvir Ahmad,Sarfaraz Masood,Chandni Saxena
关键词-EN: significant research efforts, efforts and advancements, remains a leading, learning, supervised learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to the journal

点击查看摘要

Abstract:Despite significant research efforts and advancements, cancer remains a leading cause of mortality. Early cancer prediction has become a crucial focus in cancer research to streamline patient care and improve treatment outcomes. Manual tumor detection by histopathologists can be time consuming, prompting the need for computerized methods to expedite treatment planning. Traditional approaches to tumor detection rely on supervised learning, necessitates a large amount of annotated data for model training. However, acquiring such extensive labeled data can be laborious and time-intensive. This research examines the three learning environments: supervised learning (SL), semi-supervised learning (Semi-SL), and self-supervised learning (Self-SL): to predict kidney, lung, and breast cancer. Three pre-trained deep learning models (Residual Network-50, Visual Geometry Group-16, and EfficientNetB0) are evaluated based on these learning settings using seven carefully curated training sets. To create the first training set (TS1), SL is applied to all annotated image samples. Five training sets (TS2-TS6) with different ratios of labeled and unlabeled cancer images are used to evaluateSemi-SL. Unlabeled cancer images from the final training set (TS7) are utilized for Self-SL assessment. Among different learning environments, outcomes from the Semi-SL setting show a strong degree of agreement with the outcomes achieved in the SL setting. The uniform pattern of observations from the pre-trained models across all three datasets validates the methodology and techniques of the research. Based on modest number of labeled samples and minimal computing cost, our study suggests that the Semi-SL option can be a highly viable replacement for the SL option under label annotation constraint scenarios.

[CV-43] Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning

点击查看摘要

[CV-44] LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning

点击查看摘要

[CV-45] Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models IROS2024

链接: https://arxiv.org/abs/2408.07975
作者: Tianyu Wang,Haitao Lin,Junqiu Yu,Yanwei Fu
关键词-EN: Large Language Models, recent Large Language, open-ended interactive robotic, interactive robotic manipulation, paper investigates
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IROS 2024. 8 pages, 5 figures. See this https URL

点击查看摘要

[CV-46] FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering

链接: https://arxiv.org/abs/2408.07967
作者: Guofeng Feng,Siyan Chen,Rong Fu,Zimu Liao,Yi Wang,Tao Liu,Zhilin Pei,Hengjie Li,Xingcheng Zhang,Bo Dai
关键词-EN: CUDA Python library, open-source CUDA Python, Gaussian Splatting, CUDA Python, Python library
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This work introduces FlashGS, an open-source CUDA Python library, designed to facilitate the efficient differentiable rasterization of 3D Gaussian Splatting through algorithmic and kernel-level optimizations. FlashGS is developed based on the observations from a comprehensive analysis of the rendering process to enhance computational efficiency and bring the technique to wide adoption. The paper includes a suite of optimization strategies, encompassing redundancy elimination, efficient pipelining, refined control and scheduling mechanisms, and memory access optimizations, all of which are meticulously integrated to amplify the performance of the rasterization process. An extensive evaluation of FlashGS’ performance has been conducted across a diverse spectrum of synthetic and real-world large-scale scenes, encompassing a variety of image resolutions. The empirical findings demonstrate that FlashGS consistently achieves an average 4x acceleration over mobile consumer GPUs, coupled with reduced memory consumption. These results underscore the superior performance and resource optimization capabilities of FlashGS, positioning it as a formidable tool in the domain of 3D rendering.

[CV-47] raining Spatial-Frequency Visual Prompts and Probabilistic Clusters for Accurate Black-Box Transfer Learning

链接: https://arxiv.org/abs/2408.07944
作者: Wonwoo Cho,Kangyeol Kim,Saemee Choi,Jaegul Choo
关键词-EN: prediction API services, directly applying general, real-world scenarios due, API services, applying general models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM Multimedia 2024

点击查看摘要

Abstract:Despite the growing prevalence of black-box pre-trained models (PTMs) such as prediction API services, there remains a significant challenge in directly applying general models to real-world scenarios due to the data distribution gap. Considering a data deficiency and constrained computational resource scenario, this paper proposes a novel parameter-efficient transfer learning framework for vision recognition models in the black-box setting. Our framework incorporates two novel training techniques. First, we align the input space (i.e., image) of PTMs to the target data distribution by generating visual prompts of spatial and frequency domain. Along with the novel spatial-frequency hybrid visual prompter, we design a novel training technique based on probabilistic clusters, which can enhance class separation in the output space (i.e., prediction probabilities). In experiments, our model demonstrates superior performance in a few-shot transfer learning setting across extensive visual recognition datasets, surpassing state-of-the-art baselines. Additionally, we show that the proposed method efficiently reduces computational costs for training and inference phases.

[CV-48] Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning

点击查看摘要

[CV-49] A Deep Features-Based Approach Using Modified ResNet50 and Gradient Boosting for Visual Sentiments Classification

链接: https://arxiv.org/abs/2408.07922
作者: Muhammad Arslan,Muhammad Mubeen,Arslan Akram,Saadullah Farooq Abbasi,Muhammad Salman Ali,Muhammad Usman Tariq
关键词-EN: Visual Sentiment Analysis, Sentiment Analysis, rising profile, Visual Sentiment, versatile nature
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 4 pages, 4 figures, 3 tables, IEEE International Conference on Multimedia Information Processing and Retrieval (MIPR) 2024

点击查看摘要

Abstract:The versatile nature of Visual Sentiment Analysis (VSA) is one reason for its rising profile. It isn’t easy to efficiently manage social media data with visual information since previous research has concentrated on Sentiment Analysis (SA) of single modalities, like textual. In addition, most visual sentiment studies need to adequately classify sentiment because they are mainly focused on simply merging modal attributes without investigating their intricate relationships. This prompted the suggestion of developing a fusion of deep learning and machine learning algorithms. In this research, a deep feature-based method for multiclass classification has been used to extract deep features from modified ResNet50. Furthermore, gradient boosting algorithm has been used to classify photos containing emotional content. The approach is thoroughly evaluated on two benchmarked datasets, CrowdFlower and GAPED. Finally, cutting-edge deep learning and machine learning models were used to compare the proposed strategy. When compared to state-of-the-art approaches, the proposed method demonstrates exceptional performance on the datasets presented.

[CV-50] GOReloc: Graph-based Object-Level Relocalization for Visual SLAM

链接: https://arxiv.org/abs/2408.07917
作者: Yutong Wang,Chaoyang Jiang,Xieyuanli Chen
关键词-EN: robotic systems, article introduces, lightweight object-level map, object, Abstract
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, accepted by IEEE RAL

点击查看摘要

Abstract:This article introduces a novel method for object-level relocalization of robotic systems. It determines the pose of a camera sensor by robustly associating the object detections in the current frame with 3D objects in a lightweight object-level map. Object graphs, considering semantic uncertainties, are constructed for both the incoming camera frame and the pre-built map. Objects are represented as graph nodes, and each node employs unique semantic descriptors based on our devised graph kernels. We extract a subgraph from the target map graph by identifying potential object associations for each object detection, then refine these associations and pose estimations using a RANSAC-inspired strategy. Experiments on various datasets demonstrate that our method achieves more accurate data association and significantly increases relocalization success rates compared to baseline methods. The implementation of our method is released at \urlthis https URL.

[CV-51] DM2RM: Dual-Mode Multimodal Ranking for Target Objects and Receptacles Based on Open-Vocabulary Instructions

链接: https://arxiv.org/abs/2408.07910
作者: Ryosuke Korekata,Kanta Kaneda,Shunya Nagashima,Yuto Imai,Komei Sugiura
关键词-EN: domestic service robot, carry everyday objects, service robot, pieces of furniture, open-vocabulary instructions
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-52] Persistence Image from 3D Medical Image: Superpixel and Optimized Gaussian Coefficient

链接: https://arxiv.org/abs/2408.07905
作者: Yanfan Zhu,Yash Singh,Khaled Younis,Shunxing Bao,Yuankai Huo
关键词-EN: uncovers crucial properties, uncovers crucial, crucial properties, properties of objects, TDA
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Topological data analysis (TDA) uncovers crucial properties of objects in medical imaging. Methods based on persistent homology have demonstrated their advantages in capturing topological features that traditional deep learning methods cannot detect in both radiology and pathology. However, previous research primarily focused on 2D image analysis, neglecting the comprehensive 3D context. In this paper, we propose an innovative 3D TDA approach that incorporates the concept of superpixels to transform 3D medical image features into point cloud data. By Utilizing Optimized Gaussian Coefficient, the proposed 3D TDA method, for the first time, efficiently generate holistic Persistence Images for 3D volumetric data. Our 3D TDA method exhibits superior performance on the MedMNist3D dataset when compared to other traditional methods, showcasing its potential effectiveness in modeling 3D persistent homology-based topological analysis when it comes to classification tasks. The source code is publicly available at this https URL.

[CV-53] Quantum-inspired Interpretable Deep Learning Architecture for Text Sentiment Analysis

点击查看摘要

[CV-54] MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

链接: https://arxiv.org/abs/2408.07889
作者: Simiao Lai,Chang Liu,Jiawen Zhu,Ben Kang,Yang Liu,Dong Wang,Huchuan Lu
关键词-EN: global interaction capability, Existing RGB-T tracking, Transformer architecture, made remarkable progress, Existing RGB-T
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available.

[CV-55] o Impute or Not: Recommendations for Multibiometric Fusion

链接: https://arxiv.org/abs/2408.07883
作者: Melissa R Dale,Elliot Singer,Bengt J. Borgström,Arun Ross
关键词-EN: Combining match scores, improving recognition accuracy, Combining match, recognition accuracy, well-established approach
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Proc. of IEEE International Workshop on Information Forensics and Security (WIFS), (Nuremberg, Germany), December 2023

点击查看摘要

Abstract:Combining match scores from different biometric systems via fusion is a well-established approach to improving recognition accuracy. However, missing scores can degrade performance as well as limit the possible fusion techniques that can be applied. Imputation is a promising technique in multibiometric systems for replacing missing data. In this paper, we evaluate various score imputation approaches on three multimodal biometric score datasets, viz. NIST BSSR1, BIOCOP2008, and MIT LL Trimodal, and investigate the factors which might influence the effectiveness of imputation. Our studies reveal three key observations: (1) Imputation is preferable over not imputing missing scores, even when the fusion rule does not require complete score data. (2) Balancing the classes in the training data is crucial to mitigate negative biases in the imputation technique towards the under-represented class, even if it involves dropping a substantial number of score vectors. (3) Multivariate imputation approaches seem to be beneficial when scores between modalities are correlated, while univariate approaches seem to benefit scenarios where scores between modalities are less correlated.

[CV-56] Continuous Perception Benchmark

链接: https://arxiv.org/abs/2408.07867
作者: Zeyu Wang,Zhenzhen Weng,Serena Yeung-Levy
关键词-EN: process visual signals, perceive and process, Humans continuously perceive, visual signals, key frames
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Humans continuously perceive and process visual signals. However, current video models typically either sample key frames sparsely or divide videos into chunks and densely sample within each chunk. This approach stems from the fact that most existing video benchmarks can be addressed by analyzing key frames or aggregating information from separate chunks. We anticipate that the next generation of vision models will emulate human perception by processing visual input continuously and holistically. To facilitate the development of such models, we propose the Continuous Perception Benchmark, a video question answering task that cannot be solved by focusing solely on a few frames or by captioning small chunks and then summarizing using language models. Extensive experiments demonstrate that existing models, whether commercial or open-source, struggle with these tasks, indicating the need for new technical advancements in this direction.

[CV-57] Learned Single-Pass Multitasking Perceptual Graphics for Immersive Displays

链接: https://arxiv.org/abs/2408.07836
作者: Doğa Yılmaz,Towaki Takikawa,Duygu Ceylan,Kaan Akşit
关键词-EN: utilizing emerging perceptual, delivering perceptually realistic, Immersive displays, advancing rapidly, rapidly in terms
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Immersive displays are advancing rapidly in terms of delivering perceptually realistic images by utilizing emerging perceptual graphics methods such as foveated rendering. In practice, multiple such methods need to be performed sequentially for enhanced perceived quality. However, the limited power and computational resources of the devices that drive immersive displays make it challenging to deploy multiple perceptual models simultaneously. We address this challenge by proposing a computationally-lightweight, text-guided, learned multitasking perceptual graphics model. Given RGB input images, our model outputs perceptually enhanced images by performing one or more perceptual tasks described by the provided text prompts. Our model supports a variety of perceptual tasks, including foveated rendering, dynamic range enhancement, image denoising, and chromostereopsis, through multitask learning. Uniquely, a single inference step of our model supports different permutations of these perceptual tasks at different prompted rates (i.e., mildly, lightly), eliminating the need for daisy-chaining multiple models to get the desired perceptual effect. We train our model on our new dataset of source and perceptually enhanced images, and their corresponding text prompts. We evaluate our model’s performance on embedded platforms and validate the perceptual quality of our model through a user study. Our method achieves on-par quality with the state-of-the-art task-specific methods using a single inference step, while offering faster inference speeds and flexibility to blend effects at various intensities. Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV) Cite as: arXiv:2408.07836 [cs.CV] (or arXiv:2408.07836v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.07836 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-58] Language Driven Slice Discovery and Error Rectification

链接: https://arxiv.org/abs/2408.07832
作者: Shantanu Ghosh,Chenyu Wang,Kayhan Batmanghelich
关键词-EN: discovery associates structured, associates structured patterns, slice discovery associates, discover error slices, Error slice discovery
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-59] Space-scale Exploration of the Poor Reliability of Deep Learning Models: the Case of the Remote Sensing of Rooftop Photovoltaic Systems

链接: https://arxiv.org/abs/2408.07828
作者: Gabriel Kasmi,Laurent Dubus,Yves-Marie Saint Drenan,Philippe Blanc
关键词-EN: energy grows rapidly, deep learning models, deep learning, learning models, grows rapidly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 24 pages, 13 figures, 5 tables, manuscript submitted to Environmental Data Science

点击查看摘要

Abstract:Photovoltaic (PV) energy grows rapidly and is crucial for the decarbonization of electric systems. However, centralized registries recording the technical characteristifs of rooftop PV systems are often missing, making it difficult to accurately monitor this growth. The lack of monitoring could threaten the integration of PV energy into the grid. To avoid this situation, the remote sensing of rooftop PV systems using deep learning emerged as a promising solution. However, existing techniques are not reliable enough to be used by public authorities or transmission system operators (TSOs) to construct up-to-date statistics on the rooftop PV fleet. The lack of reliability comes from the fact that deep learning models are sensitive to distribution shifts. This work proposes a comprehensive evaluation of the effects of distribution shifts on the classification accuracy of deep learning models trained to detect rooftop PV panels on overhead imagery. We construct a benchmark to isolate the sources of distribution shift and introduce a novel methodology that leverages explainable artificial intelligence (XAI) and decomposition of the input image and model’s decision in terms of scales to understand how distribution shifts affect deep learning models. Finally, based on our analysis, we introduce a data augmentation technique meant to improve the robustness of deep learning classifiers to varying acquisition conditions. We show that our proposed approach outperforms competing methods. We discuss some practical recommendations for mapping PV systems using overhead imagery and deep learning models.

[CV-60] SSRFlow: Semantic-aware Fusion with Spatial Temporal Re-embedding for Real-world Scene Flow

点击查看摘要

[CV-61] Regularized Contrastive Partial Multi-view Outlier Detection

链接: https://arxiv.org/abs/2408.07819
作者: Yijia Wang,Qianqian Xu,Yangbangyan Jiang,Siran Dai,Qingming Huang
关键词-EN: multi-view outlier detection, recent years, advanced significantly, aiming to identify, MVOD
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Proceedings of the 32nd ACM International Conference on Multimedia

点击查看摘要

Abstract:In recent years, multi-view outlier detection (MVOD) methods have advanced significantly, aiming to identify outliers within multi-view datasets. A key point is to better detect class outliers and class-attribute outliers, which only exist in multi-view data. However, existing methods either is not able to reduce the impact of outliers when learning view-consistent information, or struggle in cases with varying neighborhood structures. Moreover, most of them do not apply to partial multi-view data in real-world scenarios. To overcome these drawbacks, we propose a novel method named Regularized Contrastive Partial Multi-view Outlier Detection (RCPMOD). In this framework, we utilize contrastive learning to learn view-consistent information and distinguish outliers by the degree of consistency. Specifically, we propose (1) An outlier-aware contrastive loss with a potential outlier memory bank to eliminate their bias motivated by a theoretical analysis. (2) A neighbor alignment contrastive loss to capture the view-shared local structural correlation. (3) A spreading regularization loss to prevent the model from overfitting over outliers. With the Cross-view Relation Transfer technique, we could easily impute the missing view samples based on the features of neighbors. Experimental results on four benchmark datasets demonstrate that our proposed approach could outperform state-of-the-art competitors under different settings.

[CV-62] Algebraic Representations for Faster Predictions in Convolutional Neural Networks

链接: https://arxiv.org/abs/2408.07815
作者: Johnny Joyce,Jan Verschelde
关键词-EN: Convolutional neural networks, Convolutional neural, computer vision, deep neural network, popular choice
类目: Computer Vision and Pattern Recognition (cs.CV); Symbolic Computation (cs.SC)
*备注: Accepted for publication in the proceedings of the 27th International Workshop on Computer Algebra in Scientific Computing (CASC 2024)

点击查看摘要

Abstract:Convolutional neural networks (CNNs) are a popular choice of model for tasks in computer vision. When CNNs are made with many layers, resulting in a deep neural network, skip connections may be added to create an easier gradient optimization problem while retaining model expressiveness. In this paper, we show that arbitrarily complex, trained, linear CNNs with skip connections can be simplified into a single-layer model, resulting in greatly reduced computational requirements during prediction time. We also present a method for training nonlinear models with skip connections that are gradually removed throughout training, giving the benefits of skip connections without requiring computational overhead during during prediction time. These results are demonstrated with practical examples on Residual Networks (ResNet) architecture.

[CV-63] An Efficient and Explanatory Image and Text Clustering System with Multimodal Autoencoder Architecture

点击查看摘要

[CV-64] Cropper: Vision-Language Model for Image Cropping through In-Context Learning

链接: https://arxiv.org/abs/2408.07790
作者: Seung Hyun Lee,Junjie Ke,Yinxiao Li,Junfeng He,Steven Hickson,Katie Datsenko,Sangpil Kim,Ming-Hsuan Yang,Irfan Essa,Feng Yang
关键词-EN: identify visually appealing, visually appealing crops, identify visually, visually appealing, cropping
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The goal of image cropping is to identify visually appealing crops within an image. Conventional methods rely on specialized architectures trained on specific datasets, which struggle to be adapted to new requirements. Recent breakthroughs in large vision-language models (VLMs) have enabled visual in-context learning without explicit training. However, effective strategies for vision downstream tasks with VLMs remain largely unclear and underexplored. In this paper, we propose an effective approach to leverage VLMs for better image cropping. First, we propose an efficient prompt retrieval mechanism for image cropping to automate the selection of in-context examples. Second, we introduce an iterative refinement strategy to iteratively enhance the predicted crops. The proposed framework, named Cropper, is applicable to a wide range of cropping tasks, including free-form cropping, subject-aware cropping, and aspect ratio-aware cropping. Extensive experiments and a user study demonstrate that Cropper significantly outperforms state-of-the-art methods across several benchmarks.

[CV-65] NeuroPapyri: A Deep Attention Embedding Network for Handwritten Papyri Retrieval

链接: https://arxiv.org/abs/2408.07785
作者: Giuseppe De Gregorio,Simon Perrin,Rodrigo C. G. Pena,Isabelle Marthot-Santaniello,Harold Mouchère
关键词-EN: advancing historical research, machine learning approaches, machine learning, intersection of computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
*备注:

点击查看摘要

Abstract:The intersection of computer vision and machine learning has emerged as a promising avenue for advancing historical research, facilitating a more profound exploration of our past. However, the application of machine learning approaches in historical palaeography is often met with criticism due to their perceived ``black box’’ nature. In response to this challenge, we introduce NeuroPapyri, an innovative deep learning-based model specifically designed for the analysis of images containing ancient Greek papyri. To address concerns related to transparency and interpretability, the model incorporates an attention mechanism. This attention mechanism not only enhances the model’s performance but also provides a visual representation of the image regions that significantly contribute to the decision-making process. Specifically calibrated for processing images of papyrus documents with lines of handwritten text, the model utilizes individual attention maps to inform the presence or absence of specific characters in the input image. This paper presents the NeuroPapyri model, including its architecture and training methodology. Results from the evaluation demonstrate NeuroPapyri’s efficacy in document retrieval, showcasing its potential to advance the analysis of historical manuscripts.

[CV-66] A Guide to Similarity Measures

链接: https://arxiv.org/abs/2408.07706
作者: Avivit Levy,B. Riva Shalom,Michal Chalamish
关键词-EN: data science application, Similarity measures play, science application domains, Similarity measures, play a central
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 27 pages

点击查看摘要

Abstract:Similarity measures play a central role in various data science application domains for a wide assortment of tasks. This guide describes a comprehensive set of prevalent similarity measures to serve both non-experts and professional. Non-experts that wish to understand the motivation for a measure as well as how to use it may find a friendly and detailed exposition of the formulas of the measures, whereas experts may find a glance to the principles of designing similarity measures and ideas for a better way to measure similarity for their desired task in a given application domain.

[CV-67] What Color Scheme is More Effective in Assisting Readers to Locate Information in a Color-Coded Article?

链接: https://arxiv.org/abs/2408.06494
作者: Ho Yin Ng,Zeyu He,Ting-Hao ‘Kenneth’ Huang
关键词-EN: human cognitive activities, aiding human cognitive, Large Language Models, cluster information types, assigning specific colors
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

[CV-68] Rethinking Medical Anomaly Detection in Brain MRI: An Image Quality Assessment Perspective

链接: https://arxiv.org/abs/2408.08228
作者: Zixuan Pan,Jun Xia,Zheyu Yan,Guoyue Xu,Yawen Wu,Zhenge Jia,Jianxu Chen,Yiyu Shi
关键词-EN: brain MRI, image quality assessment, perform anomaly detection, Reconstruction-based methods, Structural Similarity Index
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstruction-based methods, particularly those leveraging autoencoders, have been widely adopted to perform anomaly detection in brain MRI. While most existing works try to improve detection accuracy by proposing new model structures or algorithms, we tackle the problem through image quality assessment, an underexplored perspective in the field. We propose a fusion quality loss function that combines Structural Similarity Index Measure loss with l1 loss, offering a more comprehensive evaluation of reconstruction quality. Additionally, we introduce a data pre-processing strategy that enhances the average intensity ratio (AIR) between normal and abnormal regions, further improving the distinction of anomalies. By fusing the aforementioned two methods, we devise the image quality assessment (IQA) approach. The proposed IQA approach achieves significant improvements (10%) in terms of Dice coefficient (DICE) and Area Under the Precision-Recall Curve (AUPRC) on the BraTS21 (T2, FLAIR) and MSULB datasets when compared with state-of-the-art methods. These results highlight the importance of invoking the comprehensive image quality assessment in medical anomaly detection and provide a new perspective for future research in this field.

[CV-69] Learned Multimodal Compression for Autonomous Driving

链接: https://arxiv.org/abs/2408.08211
作者: Hadi Hadizadeh,Ivan V. Bajić
关键词-EN: Autonomous driving sensors, driving sensors generate, Autonomous driving, amount of data, sensors generate
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 5 figures, IEEE MMSP 2024

点击查看摘要

Abstract:Autonomous driving sensors generate an enormous amount of data. In this paper, we explore learned multimodal compression for autonomous driving, specifically targeted at 3D object detection. We focus on camera and LiDAR modalities and explore several coding approaches. One approach involves joint coding of fused modalities, while others involve coding one modality first, followed by conditional coding of the other modality. We evaluate the performance of these coding schemes on the nuScenes dataset. Our experimental results indicate that joint coding of fused modalities yields better results compared to the alternatives.

[CV-70] PI-Att: Topology Attention for Segmentation Networks through Adaptive Persistence Image Representation

链接: https://arxiv.org/abs/2408.08038
作者: Mehmet Bahadir Erden,Sinan Unver,Ilke Ali Gurses,Rustu Turkay,Cigdem Gunduz-Demir
关键词-EN: Segmenting multiple objects, Segmenting multiple, multiple objects, simultaneously quantifies, quantifies the shape
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segmenting multiple objects (e.g., organs) in medical images often requires an understanding of their topology, which simultaneously quantifies the shape of the objects and their positions relative to each other. This understanding is important for segmentation networks to generalize better with limited training data, which is common in medical image analysis. However, many popular networks were trained to optimize only pixel-wise performance, ignoring the topological correctness of the segmentation. In this paper, we introduce a new topology-aware loss function, which we call PI-Att, that explicitly forces the network to minimize the topological dissimilarity between the ground truth and prediction maps. We quantify the topology of each map by the persistence image representation, for the first time in the context of a segmentation network loss. Besides, we propose a new mechanism to adaptively calculate the persistence image at the end of each epoch based on the network’s performance. This adaptive calculation enables the network to learn topology outline in the first epochs, and then topology details towards the end of training. The effectiveness of the proposed PI-Att loss is demonstrated on two different datasets for aorta and great vessel segmentation in computed tomography images.

[CV-71] Conditional Brownian Bridge Diffusion Model for VHR SAR to Optical Image Translation

点击查看摘要

[CV-72] MobileMEF: Fast and Efficient Method for Multi-Exposure Fusion

链接: https://arxiv.org/abs/2408.07932
作者: Lucas Nedel Kirsten,Zhicheng Fu,Nikhil Ambha Madhusudhana
关键词-EN: Recent advances, imaging technology, technology have enabled, Recent, high-quality images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in camera design and imaging technology have enabled the capture of high-quality images using smartphones. However, due to the limited dynamic range of digital cameras, the quality of photographs captured in environments with highly imbalanced lighting often results in poor-quality images. To address this issue, most devices capture multi-exposure frames and then use some multi-exposure fusion method to merge those frames into a final fused image. Nevertheless, most traditional and current deep learning approaches are unsuitable for real-time applications on mobile devices due to their heavy computational and memory requirements. We propose a new method for multi-exposure fusion based on an encoder-decoder deep learning architecture with efficient building blocks tailored for mobile devices. This efficient design makes our model capable of processing 4K resolution images in less than 2 seconds on mid-range smartphones. Our method outperforms state-of-the-art techniques regarding full-reference quality measures and computational efficiency (runtime and memory usage), making it ideal for real-time applications on hardware-constrained devices. Our code is available at: this https URL.

[CV-73] Deep Joint Denoising and Detection for Enhanced Intracellular Particle Analysis

链接: https://arxiv.org/abs/2408.07903
作者: Yao Yao,Ihor Smal,Ilya Grigoriev,Anna Akhmanova,Erik Meijering
关键词-EN: Reliable analysis, intracellular dynamic processes, images requires complete, detection, particle detection
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Reliable analysis of intracellular dynamic processes in time-lapse fluorescence microscopy images requires complete and accurate tracking of all small particles in all time frames of the image sequences. A fundamental first step towards this goal is particle detection. Given the small size of the particles, their detection is greatly affected by image noise. Recent studies have shown that applying image denoising as a preprocessing step indeed improves particle detection and their subsequent tracking. Deep learning based particle detection methods have shown superior results compared to traditional detection methods. However, they do not explicitly aim to remove noise from the images to facilitate detection. Thus we hypothesize that their performance could be further improved. In this paper, we propose a new deep neural network, called DENODET (denoising-detection network), which performs image denoising and particle detection simultaneously. We show that integrative denoising and detection yields more accurate detection results. Our method achieves superior results compared to state-of-the-art particle detection methods on the particle tracking challenge dataset and our own real fluorescence microscopy image data.

[CV-74] A Novel Generative Artificial Intelligence Method for Interference Study on Multiplex Brightfield Immunohistochemistry Images

链接: https://arxiv.org/abs/2408.07860
作者: Satarupa Mukherjee,Jim Martin,Yao Nie
关键词-EN: multiple consecutive slides, single biomarker labeling, brightfield imaging offers, simultaneously analyzing multiple, single slide
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multiplex brightfield imaging offers the advantage of simultaneously analyzing multiple biomarkers on a single slide, as opposed to single biomarker labeling on multiple consecutive slides. To accurately analyze multiple biomarkers localized at the same cellular compartment, two representative biomarker sets were selected as assay models - cMET-PDL1-EGFR and CD8-LAG3-PDL1, where all three biomarkers can co-localize on the cell membrane. One of the most crucial preliminary stages for analyzing such assay is identifying each unique chromogen on individual cells. This is a challenging problem due to the co-localization of membrane stains from all the three biomarkers. It requires advanced color unmixing for creating the equivalent singleplex images from each triplex image for each biomarker. In this project, we developed a cycle-Generative Adversarial Network (cycle-GAN) method for unmixing the triplex images generated from the above-mentioned assays. Three different models were designed to generate the singleplex image for each of the three stains Tamra (purple), QM-Dabsyl (yellow) and Green. A notable novelty of our approach was that the input to the network were images in the optical density domain instead of conventionally used RGB images. The use of the optical density domain helped in reducing the blurriness of the synthetic singleplex images, which was often observed when the network was trained on RGB images. The cycle-GAN models were validated on 10,800 lung, gastric and colon images for the cMET-PDL1-EGFR assay and 3600 colon images for the CD8-LAG3-PDL1 assay. Visual as well as quantified assessments demonstrated that the proposed method is effective and efficient when compared with the manual reviewing results and is readily applicable to various multiplex assays. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.07860 [eess.IV] (or arXiv:2408.07860v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2408.07860 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-75] Perspectives: Comparison of Deep Learning Segmentation Models on Biophysical and Biomedical Data

链接: https://arxiv.org/abs/2408.07786
作者: J Shepard Bryan IV,Meyam Tavakoli,Steve Presse
关键词-EN: Deep learning based, learning based approaches, tasks including image, including image segmentation, feature selection
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:Deep learning based approaches are now widely used across biophysics to help automate a variety of tasks including image segmentation, feature selection, and deconvolution. However, the presence of multiple competing deep learning architectures, each with its own unique advantages and disadvantages, makes it challenging to select an architecture best suited for a specific application. As such, we present a comprehensive comparison of common models. Here, we focus on the task of segmentation assuming the typically small training dataset sizes available from biophysics experiments and compare the following four commonly used architectures: convolutional neural networks, U-Nets, vision transformers, and vision state space models. In doing so, we establish criteria for determining optimal conditions under which each model excels, thereby offering practical guidelines for researchers and practitioners in the field.

机器学习

[LG-0] Can Large Language Models Understand Symbolic Graphics Programs?

链接: https://arxiv.org/abs/2408.08313
作者: Zeju Qiu,Weiyang Liu,Haiwen Feng,Zhen Liu,Tim Z. Xiao,Katherine M. Collins,Joshua B. Tenenbaum,Adrian Weller,Michael J. Black,Bernhard Schölkopf
关键词-EN: symbolic graphics programs, Assessing the capabilities, symbolic graphics, graphics content, graphics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report v1 (44 pages, 23 figures, project page: this https URL )

点击查看摘要

[LG-1] Understanding the Local Geometry of Generative Model Manifolds

点击查看摘要

[LG-2] Benchmarking the Capabilities of Large Language Models in Transportation System Engineering: Accuracy Consistency and Reasoning Behaviors

链接: https://arxiv.org/abs/2408.08302
作者: Usman Syed,Ethan Light,Xingang Guo,Huan Zhang,Lianhui Qin,Yanfeng Ouyang,Bin Hu
关键词-EN: large language models, transportation engineering problems, selected undergraduate-level transportation, undergraduate-level transportation engineering, transportation engineering
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-3] HELP: Hierarchical Embeddings-based Log Parsing

链接: https://arxiv.org/abs/2408.08300
作者: Andy Xu,Arno Gau
关键词-EN: Log, failure diagnosis, first-hand source, source of information, information for software
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Logs are a first-hand source of information for software maintenance and failure diagnosis. Log parsing, which converts semi-structured log messages into structured templates, is a prerequisite for automated log analysis tasks such as anomaly detection, troubleshooting, and root cause analysis. However, existing log parsers fail in real-world systems for three main reasons. First, traditional heuristics-based parsers require handcrafted features and domain knowledge, which are difficult to generalize at scale. Second, existing large language model-based parsers rely on periodic offline processing, limiting their effectiveness in real-time use cases. Third, existing online parsing algorithms are susceptible to log drift, where slight log changes create false positives that drown out real anomalies. To address these challenges, we propose HELP, a Hierarchical Embeddings-based Log Parser. HELP is the first online semantic-based parser to leverage LLMs for performant and cost-effective log parsing. We achieve this through a novel hierarchical embeddings module, which fine-tunes a text embedding model to cluster logs before parsing, reducing querying costs by multiple orders of magnitude. To combat log drift, we also develop an iterative rebalancing module, which periodically updates existing log groupings. We evaluate HELP extensively on 14 public large-scale datasets, showing that HELP achieves significantly higher F1-weighted grouping and parsing accuracy than current state-of-the-art online log parsers. We also implement HELP into Iudex’s production observability platform, confirming HELP’s practicality in a production environment. Our results show that HELP is effective and efficient for high-throughput real-world log parsing.

[LG-4] SLCA: Unleash the Power of Sequential Fine-tuning for Continual Learning with Pre-training ICCV23

点击查看摘要

[LG-5] Absence of Closed-Form Descriptions for Gradient Flow in Two-Layer Narrow Networks

链接: https://arxiv.org/abs/2408.08286
作者: Yeachan Park
关键词-EN: intricate training dynamics, neural networks poses, training dynamics, dynamics, network training dynamics
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:In the field of machine learning, comprehending the intricate training dynamics of neural networks poses a significant challenge. This paper explores the training dynamics of neural networks, particularly whether these dynamics can be expressed in a general closed-form solution. We demonstrate that the dynamics of the gradient flow in two-layer narrow networks is not an integrable system. Integrable systems are characterized by trajectories confined to submanifolds defined by level sets of first integrals (invariants), facilitating predictable and reducible dynamics. In contrast, non-integrable systems exhibit complex behaviors that are difficult to predict. To establish the non-integrability, we employ differential Galois theory, which focuses on the solvability of linear differential equations. We demonstrate that under mild conditions, the identity component of the differential Galois group of the variational equations of the gradient flow is non-solvable. This result confirms the system’s non-integrability and implies that the training dynamics cannot be represented by Liouvillian functions, precluding a closed-form solution for describing these dynamics. Our findings highlight the necessity of employing numerical methods to tackle optimization problems within neural networks. The results contribute to a deeper understanding of neural network training dynamics and their implications for machine learning optimization strategies.

[LG-6] Autonomous Behavior Planning For Humanoid Loco-manipulation Through Grounded Language Model IROS2024

点击查看摘要

[LG-7] BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts

链接: https://arxiv.org/abs/2408.08274
作者: Qizhen Zhang,Nikolas Gritsch,Dwaraknath Gnaneshwar,Simon Guo,David Cairuz,Bharat Venkitesh,Jakob Foerster,Phil Blunsom,Sebastian Ruder,Ahmet Ustun,Acyr Locatelli
关键词-EN: large language models, language models due, large language, Experts, attention
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Mixture of Experts (MoE) framework has become a popular architecture for large language models due to its superior performance over dense models. However, training MoEs from scratch in a large-scale regime is prohibitively expensive. Existing methods mitigate this by pre-training multiple dense expert models independently and using them to initialize an MoE. This is done by using experts’ feed-forward network (FFN) to initialize the MoE’s experts while merging other parameters. However, this method limits the reuse of dense model parameters to only the FFN layers, thereby constraining the advantages when “upcycling” these models into MoEs. We propose BAM (Branch-Attend-Mix), a simple yet effective method that addresses this shortcoming. BAM makes full use of specialized dense models by not only using their FFN to initialize the MoE layers but also leveraging experts’ attention parameters fully by initializing them into a soft-variant of Mixture of Attention (MoA) layers. We explore two methods for upcycling attention parameters: 1) initializing separate attention experts from dense models including all attention parameters for the best model performance; and 2) sharing key and value parameters across all experts to facilitate for better inference efficiency. To further improve efficiency, we adopt a parallel attention transformer architecture to MoEs, which allows the attention experts and FFN experts to be computed concurrently. Our experiments on seed models ranging from 590 million to 2 billion parameters demonstrate that BAM surpasses baselines in both perplexity and downstream task performance, within the same computational and data constraints.

[LG-8] Is Knowledge Power? On the (Im)possibility of Learning from Strategic Interaction

链接: https://arxiv.org/abs/2408.08272
作者: Nivasini Ananthakrishnan,Nika Haghtalab,Chara Podimata,Kunhe Yang
关键词-EN: Stackelberg, achieved absent, key question, game, achieve her Stackelberg
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When learning in strategic environments, a key question is whether agents can overcome uncertainty about their preferences to achieve outcomes they could have achieved absent any uncertainty. Can they do this solely through interactions with each other? We focus this question on the ability of agents to attain the value of their Stackelberg optimal strategy and study the impact of information asymmetry. We study repeated interactions in fully strategic environments where players’ actions are decided based on learning algorithms that take into account their observed histories and knowledge of the game. We study the pure Nash equilibria (PNE) of a meta-game where players choose these algorithms as their actions. We demonstrate that if one player has perfect knowledge about the game, then any initial informational gap persists. That is, while there is always a PNE in which the informed agent achieves her Stackelberg value, there is a game where no PNE of the meta-game allows the partially informed player to achieve her Stackelberg value. On the other hand, if both players start with some uncertainty about the game, the quality of information alone does not determine which agent can achieve her Stackelberg value. In this case, the concept of information asymmetry becomes nuanced and depends on the game’s structure. Overall, our findings suggest that repeated strategic interactions alone cannot facilitate learning effectively enough to earn an uninformed player her Stackelberg value.

[LG-9] InVAErt networks for amortized inference and identifiability analysis of lumped parameter hemodynamic models

点击查看摘要

[LG-10] GSVD-NMF: Recovering Missing Features in Non-negative Matrix Factorization

链接: https://arxiv.org/abs/2408.08260
作者: Youdong Guo,Timothy E. Holy
关键词-EN: Non-negative matrix factorization, separate mixed sources, Non-negative matrix, important tool, tool in signal
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Non-negative matrix factorization (NMF) is an important tool in signal processing and widely used to separate mixed sources into their components. However, NMF is NP-hard and thus may fail to discover the ideal factorization; moreover, the number of components may not be known in advance and thus features may be missed or incompletely separated. To recover missing components from under-complete NMF, we introduce GSVD-NMF, which proposes new components based on the generalized singular value decomposition (GSVD) between preliminary NMF results and the SVD of the original matrix. Simulation and experimental results demonstrate that GSVD-NMF often recovers missing features from under-complete NMF and helps NMF achieve better local optima.

[LG-11] Snuffy: Efficient Whole Slide Image Classifier ECCV2024

点击查看摘要

[LG-12] Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding

点击查看摘要

[LG-13] A Conflicts-free Speed-lossless KAN-based Reinforcement Learning Decision System for Interactive Driving in Roundabouts

点击查看摘要

[LG-14] Explaining an Agents Future Beliefs through Temporally Decomposing Future Reward Estimators ECAI2024

点击查看摘要

[LG-15] Enhancing Sharpness-Aware Minimization by Learning Perturbation Radius KDD2024 ECML

链接: https://arxiv.org/abs/2408.08222
作者: Xuehao Wang,Weisen Jiang,Shuai Fu,Yu Zhang
关键词-EN: perturbation radius, improve model generalization, SAM, SAM update consists, loss landscape
类目: Machine Learning (cs.LG)
*备注: Accepted by ECML PKDD 2024

点击查看摘要

Abstract:Sharpness-aware minimization (SAM) is to improve model generalization by searching for flat minima in the loss landscape. The SAM update consists of one step for computing the perturbation and the other for computing the update gradient. Within the two steps, the choice of the perturbation radius is crucial to the performance of SAM, but finding an appropriate perturbation radius is challenging. In this paper, we propose a bilevel optimization framework called LEarning the perTurbation radiuS (LETS) to learn the perturbation radius for sharpness-aware minimization algorithms. Specifically, in the proposed LETS method, the upper-level problem aims at seeking a good perturbation radius by minimizing the squared generalization gap between the training and validation losses, while the lower-level problem is the SAM optimization problem. Moreover, the LETS method can be combined with any variant of SAM. Experimental results on various architectures and benchmark datasets in computer vision and natural language processing demonstrate the effectiveness of the proposed LETS method in improving the performance of SAM.

[LG-16] RED-CT: A Systems Design Methodology for Using LLM-labeled Data to Train and Deploy Edge Classifiers for Computational Social Science

链接: https://arxiv.org/abs/2408.08217
作者: David Farr,Nico Manzonelli,Iain Cruickshank,Jevin West
关键词-EN: Large language models, classify unstructured natural, unstructured natural language, Large language, natural language data
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have enhanced our ability to rapidly analyze and classify unstructured natural language data. However, concerns regarding cost, network limitations, and security constraints have posed challenges for their integration into work processes. In this study, we adopt a systems design approach to employing LLMs as imperfect data annotators for downstream supervised learning tasks, introducing novel system intervention measures aimed at improving classification performance. Our methodology outperforms LLM-generated labels in seven of eight tests, demonstrating an effective strategy for incorporating LLMs into the design and deployment of specialized, supervised learning models present in many industry use cases.

[LG-17] Moving Healthcare AI-Support Systems for Visually Detectable Diseases onto Constrained Devices

点击查看摘要

[LG-18] Federated Fairness Analytics: Quantifying Fairness in Federated Learning

点击查看摘要

[LG-19] Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models

链接: https://arxiv.org/abs/2408.08210
作者: Javier González,Aditya V. Nori
关键词-EN: resemble human thinking, solve complex problems, Recent advances, large language models, human thinking
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in AI have been significantly driven by the capabilities of large language models (LLMs) to solve complex problems in ways that resemble human thinking. However, there is an ongoing debate about the extent to which LLMs are capable of actual reasoning. Central to this debate are two key probabilistic concepts that are essential for connecting causes to their effects: the probability of necessity (PN) and the probability of sufficiency (PS). This paper introduces a framework that is both theoretical and practical, aimed at assessing how effectively LLMs are able to replicate real-world reasoning mechanisms using these probabilistic measures. By viewing LLMs as abstract machines that process information through a natural language interface, we examine the conditions under which it is possible to compute suitable approximations of PN and PS. Our research marks an important step towards gaining a deeper understanding of when LLMs are capable of reasoning, as illustrated by a series of math examples.

[LG-20] Stochastic Semi-Gradient Descent for Learning Mean Field Games with Population-Aware Function Approximation

链接: https://arxiv.org/abs/2408.08192
作者: Chenyu Zhang,Xu Chen,Xuan Di
关键词-EN: large-population multi-agent system, population distribution, model the interactions, large-population multi-agent, multi-agent system
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Mean field games (MFGs) model the interactions within a large-population multi-agent system using the population distribution. Traditional learning methods for MFGs are based on fixed-point iteration (FPI), which calculates best responses and induced population distribution separately and sequentially. However, FPI-type methods suffer from inefficiency and instability, due to oscillations caused by the forward-backward procedure. This paper considers an online learning method for MFGs, where an agent updates its policy and population estimates simultaneously and fully asynchronously, resulting in a simple stochastic gradient descent (SGD) type method called SemiSGD. Not only does SemiSGD exhibit numerical stability and efficiency, but it also provides a novel perspective by treating the value function and population distribution as a unified parameter. We theoretically show that SemiSGD directs this unified parameter along a descent direction to the mean field equilibrium. Motivated by this perspective, we develop a linear function approximation (LFA) for both the value function and the population distribution, resulting in the first population-aware LFA for MFGs on continuous state-action space. Finite-time convergence and approximation error analysis are provided for SemiSGD equipped with population-aware LFA.

[LG-21] Not Every Image is Worth a Thousand Words: Quantifying Originality in Stable Diffusion ICML2024

点击查看摘要

[LG-22] Machine learning empowered Modulation detection for OFDM-based signals

链接: https://arxiv.org/abs/2408.08179
作者: Ali Pourranjbar,Georges Kaddoum,Verdier Assoume Mba,Sahil Garg,Satinder Singh
关键词-EN: ML-based modulation detection, blind ML-based modulation, modulation detection, blind ML-based, modulation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a blind ML-based modulation detection for OFDM-based technologies. Unlike previous works that assume an ideal environment with precise knowledge of subcarrier count and cyclic prefix location, we consider blind modulation detection while accounting for realistic environmental parameters and imperfections. Our approach employs a ResNet network to simultaneously detect the modulation type and accurately locate the cyclic prefix. Specifically, after eliminating the environmental impact from the signal and accurately extracting the OFDM symbols, we convert these symbols into scatter plots. Due to their unique shapes, these scatter plots are then classified using ResNet. As a result, our proposed modulation classification method can be applied to any OFDM-based technology without prior knowledge of the transmitted signal. We evaluate its performance across various modulation schemes and subcarrier numbers. Simulation results show that our method achieves a modulation detection accuracy exceeding 80% at an SNR of 10 dB and 95% at an SNR of 25 dB.

[LG-23] owards flexible perception with visual memory

点击查看摘要

[LG-24] DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

链接: https://arxiv.org/abs/2408.08152
作者: Huajian Xin,Z.Z. Ren,Junxiao Song,Zhihong Shao,Wanjia Zhao,Haocheng Wang,Bo Liu,Liyue Zhang,Xuan Lu,Qiushi Du,Wenjun Gao,Qihao Zhu,Dejian Yang,Zhibin Gou,Z.F. Wu,Fuli Luo,Chong Ruan
关键词-EN: open-source language model, language model designed, inference processes, optimizing both training, training and inference
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

[LG-25] P/D-Serve: Serving Disaggregated Large Language Model at Scale

链接: https://arxiv.org/abs/2408.08147
作者: Yibo Jin,Tao Wang,Huimin Lin,Mingyang Song,Peiyang Li,Yipeng Ma,Yicheng Shan,Zhengfan Yuan,Cailong Li,Yajing Sun,Tiandeng Wu,Xing Chu,Ruizhi Huan,Li Ma,Xiao You,Wenting Zhou,Yunpeng Ye,Wen Liu,Xiangkun Xu,Yongsheng Zhang,Tiantian Dong,Jiawei Zhu,Zhe Wang,Xijian Ju,Jianxun Song,Haoliang Cheng,Xiaojing Li,Jiandong Ding,Hefei Guo,Zhengyong Zhang
关键词-EN: Serving disaggregated large, faces multiple challenges, disaggregated large language, Serving disaggregated, large language models
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-26] Impact of Comprehensive Data Preprocessing on Predictive Modelling of COVID-19 Mortality

链接: https://arxiv.org/abs/2408.08142
作者: Sangita Das,Subhrajyoti Maji
关键词-EN: crucial for analysing, mortality trends, mortality, test RMSE, test R-squared
类目: Machine Learning (cs.LG)
*备注: 8 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Accurate predictive models are crucial for analysing COVID-19 mortality trends. This study evaluates the impact of a custom data preprocessing pipeline on ten machine learning models predicting COVID-19 mortality using data from Our World in Data (OWID). Our pipeline differs from a standard preprocessing pipeline through four key steps. Firstly, it transforms weekly reported totals into daily updates, correcting reporting biases and providing more accurate estimates. Secondly, it uses localised outlier detection and processing to preserve data variance and enhance accuracy. Thirdly, it utilises computational dependencies among columns to ensure data consistency. Finally, it incorporates an iterative feature selection process to optimise the feature set and improve model performance. Results show a significant improvement with the custom pipeline: the MLP Regressor achieved a test RMSE of 66.556 and a test R-squared of 0.991, surpassing the DecisionTree Regressor from the standard pipeline, which had a test RMSE of 222.858 and a test R-squared of 0.817. These findings highlight the importance of tailored preprocessing techniques in enhancing predictive modelling accuracy for COVID-19 mortality. Although specific to this study, these methodologies offer valuable insights into diverse datasets and domains, improving predictive performance across various contexts.

[LG-27] Normalized AOPC: Fixing Misleading Faithfulness Metrics for Feature Attribution Explainability

链接: https://arxiv.org/abs/2408.08137
作者: Joakim Edin,Andreas Geert Motzfeldt,Casper L. Christensen,Tuukka Ruotsalo,Lars Maaløe,Maria Maistro
关键词-EN: Deep neural network, Deep neural, neural network predictions, AOPC, notoriously difficult
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural network predictions are notoriously difficult to interpret. Feature attribution methods aim to explain these predictions by identifying the contribution of each input feature. Faithfulness, often evaluated using the area over the perturbation curve (AOPC), reflects feature attributions’ accuracy in describing the internal mechanisms of deep neural networks. However, many studies rely on AOPC to compare faithfulness across different models, which we show can lead to false conclusions about models’ faithfulness. Specifically, we find that AOPC is sensitive to variations in the model, resulting in unreliable cross-model comparisons. Moreover, AOPC scores are difficult to interpret in isolation without knowing the model-specific lower and upper limits. To address these issues, we propose a normalization approach, Normalized AOPC (NAOPC), enabling consistent cross-model evaluations and more meaningful interpretation of individual scores. Our experiments demonstrate that this normalization can radically change AOPC results, questioning the conclusions of earlier studies and offering a more robust framework for assessing feature attribution faithfulness.

[LG-28] EXPLAIN AGREE LEARN: Scaling Learning for Neural Probabilistic Logic

点击查看摘要

[LG-29] he Unreasonable Effectiveness of Solving Inverse Problems with Neural Networks

链接: https://arxiv.org/abs/2408.08119
作者: Philipp Holl,Nils Thuerey
关键词-EN: Finding model parameters, science and engineering, plasma control, essential task, task in science
类目: Machine Learning (cs.LG)
*备注: Source code to follow soon: this https URL

点击查看摘要

Abstract:Finding model parameters from data is an essential task in science and engineering, from weather and climate forecasts to plasma control. Previous works have employed neural networks to greatly accelerate finding solutions to inverse problems. Of particular interest are end-to-end models which utilize differentiable simulations in order to backpropagate feedback from the simulated process to the network weights and enable roll-out of multiple time steps. So far, it has been assumed that, while model inference is faster than classical optimization, this comes at the cost of a decrease in solution accuracy. We show that this is generally not true. In fact, neural networks trained to learn solutions to inverse problems can find better solutions than classical optimizers even on their training set. To demonstrate this, we perform both a theoretical analysis as well an extensive empirical evaluation on challenging problems involving local minima, chaos, and zero-gradient regions. Our findings suggest an alternative use for neural networks: rather than generalizing to new data for fast inference, they can also be used to find better solutions on known data.

[LG-30] Hearing Your Blood Sugar: Non-Invasive Glucose Measurement Through Simple Vocal Signals Transforming any Speech into a Sensor with Machine Learning

链接: https://arxiv.org/abs/2408.08109
作者: Nihat Ahmadli,Mehmet Ali Sarsil,Onur Ergen
关键词-EN: blood glucose levels, Effective diabetes management, management relies heavily, Effective diabetes, traditionally achieved
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 5 figure and 5 tables. This manuscript is a pre-print to be submitted to a journal or/and a conference. arXiv admin note: substantial text overlap with arXiv:2402.13812

点击查看摘要

Abstract:Effective diabetes management relies heavily on the continuous monitoring of blood glucose levels, traditionally achieved through invasive and uncomfortable methods. While various non-invasive techniques have been explored, such as optical, microwave, and electrochemical approaches, none have effectively supplanted these invasive technologies due to issues related to complexity, accuracy, and cost. In this study, we present a transformative and straightforward method that utilizes voice analysis to predict blood glucose levels. Our research investigates the relationship between fluctuations in blood glucose and vocal characteristics, highlighting the influence of blood vessel dynamics during voice production. By applying advanced machine learning algorithms, we analyzed vocal signal variations and established a significant correlation with blood glucose levels. We developed a predictive model using artificial intelligence, based on voice recordings and corresponding glucose measurements from participants, utilizing logistic regression and Ridge regularization. Our findings indicate that voice analysis may serve as a viable non-invasive alternative for glucose monitoring. This innovative approach not only has the potential to streamline and reduce the costs associated with diabetes management but also aims to enhance the quality of life for individuals living with diabetes by providing a painless and user-friendly method for monitoring blood sugar levels.

[LG-31] Adaptation of uncertainty-penalized Bayesian information criterion for parametric partial differential equation discovery

链接: https://arxiv.org/abs/2408.08106
作者: Pongpisit Thanasutives,Ken-ichi Fukui
关键词-EN: partial differential equations, deriving governing physics, Data-driven discovery, partial differential, promising approach
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:Data-driven discovery of partial differential equations (PDEs) has emerged as a promising approach for deriving governing physics when domain knowledge about observed data is limited. Despite recent progress, the identification of governing equations and their parametric dependencies using conventional information criteria remains challenging in noisy situations, as the criteria tend to select overly complex PDEs. In this paper, we introduce an extension of the uncertainty-penalized Bayesian information criterion (UBIC), which is adapted to solve parametric PDE discovery problems efficiently without requiring computationally expensive PDE simulations. This extended UBIC uses quantified PDE uncertainty over different temporal or spatial points to prevent overfitting in model selection. The UBIC is computed with data transformation based on power spectral densities to discover the governing parametric PDE that truly captures qualitative features in frequency space with a few significant terms and their parametric dependencies (i.e., the varying PDE coefficients), evaluated with confidence intervals. Numerical experiments on canonical PDEs demonstrate that our extended UBIC can identify the true number of terms and their varying coefficients accurately, even in the presence of noise. The code is available at \urlthis https URL.

[LG-32] An Efficient Replay for Class-Incremental Learning with Pre-trained Models

点击查看摘要

[LG-33] Independent Policy Mirror Descent for Markov Potential Games: Scaling to Large Number of Players

链接: https://arxiv.org/abs/2408.08075
作者: Pragnya Alatur,Anas Barakat,Niao He
关键词-EN: Markov Potential Games, Markov Potential, Markov games, Potential Games, model multi-agent reinforcement
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注: 16 pages, CDC 2024

点击查看摘要

Abstract:Markov Potential Games (MPGs) form an important sub-class of Markov games, which are a common framework to model multi-agent reinforcement learning problems. In particular, MPGs include as a special case the identical-interest setting where all the agents share the same reward function. Scaling the performance of Nash equilibrium learning algorithms to a large number of agents is crucial for multi-agent systems. To address this important challenge, we focus on the independent learning setting where agents can only have access to their local information to update their own policy. In prior work on MPGs, the iteration complexity for obtaining \epsilon -Nash regret scales linearly with the number of agents N . In this work, we investigate the iteration complexity of an independent policy mirror descent (PMD) algorithm for MPGs. We show that PMD with KL regularization, also known as natural policy gradient, enjoys a better \sqrtN dependence on the number of agents, improving over PMD with Euclidean regularization and prior work. Furthermore, the iteration complexity is also independent of the sizes of the agents’ action spaces.

[LG-34] A Survey on Integrated Sensing Communication and Computation

点击查看摘要

[LG-35] Extracting Sentence Embeddings from Pretrained Transformer Models

链接: https://arxiv.org/abs/2408.08073
作者: Lukas Stankevičius,Mantas Lukoševičius
关键词-EN: Pre-trained transformer models, Pre-trained transformer, natural language processing, transformer models shine, expected to bear
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-36] Universality of Real Minimal Complexity Reservoir

链接: https://arxiv.org/abs/2408.08071
作者: Robert Simon Fong,Boyu Li,Peter Tiňo
关键词-EN: non-trainable input layer, recurrent neural networks, dynamically coupled reservoir, static readout layer, subclass of recurrent
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Reservoir Computing (RC) models, a subclass of recurrent neural networks, are distinguished by their fixed, non-trainable input layer and dynamically coupled reservoir, with only the static readout layer being trained. This design circumvents the issues associated with backpropagating error signals through time, thereby enhancing both stability and training efficiency. RC models have been successfully applied across a broad range of application domains. Crucially, they have been demonstrated to be universal approximators of time-invariant dynamic filters with fading memory, under various settings of approximation norms and input driving sources. Simple Cycle Reservoirs (SCR) represent a specialized class of RC models with a highly constrained reservoir architecture, characterized by uniform ring connectivity and binary input-to-reservoir weights with an aperiodic sign pattern. For linear reservoirs, given the reservoir size, the reservoir construction has only one degree of freedom – the reservoir cycle weight. Such architectures are particularly amenable to hardware implementations without significant performance degradation in many practical tasks. In this study we endow these observations with solid theoretical foundations by proving that SCRs operating in real domain are universal approximators of time-invariant dynamic filters with fading memory. Our results supplement recent research showing that SCRs in the complex domain can approximate, to arbitrary precision, any unrestricted linear reservoir with a non-linear readout. We furthermore introduce a novel method to drastically reduce the number of SCR units, making such highly constrained architectures natural candidates for low-complexity hardware implementations. Our findings are supported by empirical studies on real-world time series datasets. Comments: 19 pages, 5 figures Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2408.08071 [cs.LG] (or arXiv:2408.08071v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.08071 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-37] Maximally Permissive Reward Machines ECAI

点击查看摘要

[LG-38] Navigating Data Scarcity using Foundation Models: A Benchmark of Few-Shot and Zero-Shot Learning Approaches in Medical Imaging MICCAI2024

点击查看摘要

[LG-39] DATTA: Towards Diversity Adaptive Test-Time Adaptation in Dynamic Wild World

链接: https://arxiv.org/abs/2408.08056
作者: Chuyang Ye,Dongyan Wei,Zhendong Liu,Yuanyi Pang,Yixi Lin,Jiarong Liao,Qinting Jiang,Xianghua Fu,Qing Li,Jingyan Jiang
关键词-EN: addresses distribution shifts, Adaptive Test-Time Adaptation, Diversity Adaptive, Diversity Adaptive Test-Time, diversity
类目: Machine Learning (cs.LG)
*备注: 16 pages, 2 figures

点击查看摘要

Abstract:Test-time adaptation (TTA) effectively addresses distribution shifts between training and testing data by adjusting models on test samples, which is crucial for improving model inference in real-world applications. However, traditional TTA methods typically follow a fixed pattern to address the dynamic data patterns (low-diversity or high-diversity patterns) often leading to performance degradation and consequently a decline in Quality of Experience (QoE). The primary issues we observed are:Different scenarios require different normalization methods (e.g., Instance Normalization is optimal in mixed domains but not in static domains). Model fine-tuning can potentially harm the model and waste time.Hence, it is crucial to design strategies for effectively measuring and managing distribution diversity to minimize its negative impact on model performance. Based on these observations, this paper proposes a new general method, named Diversity Adaptive Test-Time Adaptation (DATTA), aimed at improving QoE. DATTA dynamically selects the best batch normalization methods and fine-tuning strategies by leveraging the Diversity Score to differentiate between high and low diversity score batches. It features three key components: Diversity Discrimination (DD) to assess batch diversity, Diversity Adaptive Batch Normalization (DABN) to tailor normalization methods based on DD insights, and Diversity Adaptive Fine-Tuning (DAFT) to selectively fine-tune the model. Experimental results show that our method achieves up to a 21% increase in accuracy compared to state-of-the-art methodologies, indicating that our method maintains good model performance while demonstrating its robustness. Our code will be released soon.

[LG-40] COTODE: COntinuous Trajectory neural Ordinary Differential Equations for modelling event sequences

点击查看摘要

[LG-41] An Efficient Continuous Control Perspective for Reinforcement-Learning-based Sequential Recommendation

链接: https://arxiv.org/abs/2408.08047
作者: Jun Wang,Likang Wu,Qi Liu,Yu Yang
关键词-EN: sequential historical behaviors, sequential historical, historical behaviors, recommender systems, dynamically inferred
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommendation, where user preference is dynamically inferred from sequential historical behaviors, is a critical task in recommender systems (RSs). To further optimize long-term user engagement, offline reinforcement-learning-based RSs have become a mainstream technique as they provide an additional advantage in avoiding global explorations that may harm online users’ experiences. However, previous studies mainly focus on discrete action and policy spaces, which might have difficulties in handling dramatically growing items efficiently. To mitigate this issue, in this paper, we aim to design an algorithmic framework applicable to continuous policies. To facilitate the control in the low-dimensional but dense user preference space, we propose an \underline\textbfEfficient \underline\textbfContinuous \underline\textbfControl framework (ECoC). Based on a statistically tested assumption, we first propose the novel unified action representation abstracted from normalized user and item spaces. Then, we develop the corresponding policy evaluation and policy improvement procedures. During this process, strategic exploration and directional control in terms of unified actions are carefully designed and crucial to final recommendation decisions. Moreover, beneficial from unified actions, the conservatism regularization for policies and value functions are combined and perfectly compatible with the continuous framework. The resulting dual regularization ensures the successful offline training of RL-based recommendation policies. Finally, we conduct extensive experiments to validate the effectiveness of our framework. The results show that compared to the discrete baselines, our ECoC is trained far more efficiently. Meanwhile, the final policies outperform baselines in both capturing the offline data and gaining long-term rewards. Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR) Cite as: arXiv:2408.08047 [cs.LG] (or arXiv:2408.08047v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.08047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-42] he Clever Hans Effect in Unsupervised Learning

点击查看摘要

[LG-43] Adaptive User Journeys in Pharma E-Commerce with Reinforcement Learning: Insights from SwipeRx KDD KDD2024

点击查看摘要

[LG-44] Causal Discovery from Time-Series Data with Short-Term Invariance-Based Convolutional Neural Networks

点击查看摘要

[LG-45] Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization

点击查看摘要

[LG-46] Asteroid: Resource-Efficient Hybrid Pipeline Parallelism for Collaborative DNN Training on Heterogeneous Edge Devices

点击查看摘要

[LG-47] Inversion-DeepONet: A Novel DeepONet-Based Network with Encoder-Decoder for Full Waveform Inversion

链接: https://arxiv.org/abs/2408.08005
作者: Zekai Guo,Lihui Chai,Shengjun Huang,Ye Li
关键词-EN: Full waveform inversion, Full waveform, plays a crucial, field of geophysics, crucial role
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Full waveform inversion (FWI) plays a crucial role in the field of geophysics. There has been lots of research about applying deep learning (DL) methods to FWI. The success of DL-FWI relies significantly on the quantity and diversity of the datasets. Nevertheless, existing FWI datasets, like OpenFWI, where sources have fixed locations or identical frequencies, provide limited information and do not represent the complex real-world scene. For instance, low frequencies help in resolving larger-scale structures. High frequencies allow for a more detailed subsurface features. %A single source frequency is insufficient to describe subsurface structural properties. We consider that simultaneously using sources with different frequencies, instead of performing inversion using low frequencies data and then gradually introducing higher frequencies data, has rationale and potential advantages. Hence, we develop three enhanced datasets based on OpenFWI where each source have varying locations, frequencies or both. Moreover, we propose a novel deep operator network (DeepONet) architecture Inversion-DeepONet for FWI. We utilize convolutional neural network (CNN) to extract the features from seismic data in branch net. Source parameters, such as locations and frequencies, are fed to trunk net. Then another CNN is employed as the decoder of DeepONet to reconstruct the velocity models more effectively. Through experiments, we confirm the superior performance on accuracy and generalization ability of our network, compared with existing data-driven FWI methods.

[LG-48] Experimental evaluation of offline reinforcement learning for HVAC control in buildings

链接: https://arxiv.org/abs/2408.07986
作者: Jun Wang,Linyan Li,Qi Liu,Yu Yang
关键词-EN: dynamic HVAC control, dynamic HVAC, HVAC control, HVAC, RL-based HVAC
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) techniques have been increasingly investigated for dynamic HVAC control in buildings. However, most studies focus on exploring solutions in online or off-policy scenarios without discussing in detail the implementation feasibility or effectiveness of dealing with purely offline datasets or trajectories. The lack of these works limits the real-world deployment of RL-based HVAC controllers, especially considering the abundance of historical data. To this end, this paper comprehensively evaluates the strengths and limitations of state-of-the-art offline RL algorithms by conducting analytical and numerical studies. The analysis is conducted from two perspectives: algorithms and dataset characteristics. As a prerequisite, the necessity of applying offline RL algorithms is first confirmed in two building environments. The ability of observation history modeling to reduce violations and enhance performance is subsequently studied. Next, the performance of RL-based controllers under datasets with different qualitative and quantitative conditions is investigated, including constraint satisfaction and power consumption. Finally, the sensitivity of certain hyperparameters is also evaluated. The results indicate that datasets of a certain suboptimality level and relatively small scale can be utilized to effectively train a well-performed RL-based HVAC controller. Specifically, such controllers can reduce at most 28.5% violation ratios of indoor temperatures and achieve at most 12.1% power savings compared to the baseline controller. In summary, this paper presents our well-structured investigations and new findings when applying offline reinforcement learning to building HVAC systems.

[LG-49] Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning

点击查看摘要

[LG-50] Coupling without Communication and Drafter-Invariant Speculative Decoding

链接: https://arxiv.org/abs/2408.07978
作者: Majid Daliri,Christopher Musco,Ananda Theertha Suresh
关键词-EN: Suppose Alice, Alice and Bob, Bob, Alice, Suppose
类目: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

[LG-51] Addressing Skewed Heterogeneity via Federated Prototype Rectification with Personalization

链接: https://arxiv.org/abs/2408.07966
作者: Shunxin Guo,Hongsong Wang,Shuxia Lin,Zhiqiang Kou,Xin Geng
关键词-EN: efficient framework designed, facilitate collaborative model, collaborative model training, Federated learning, user data privacy
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning is an efficient framework designed to facilitate collaborative model training across multiple distributed devices while preserving user data privacy. A significant challenge of federated learning is data-level heterogeneity, i.e., skewed or long-tailed distribution of private data. Although various methods have been proposed to address this challenge, most of them assume that the underlying global data is uniformly distributed across all clients. This paper investigates data-level heterogeneity federated learning with a brief review and redefines a more practical and challenging setting called Skewed Heterogeneous Federated Learning (SHFL). Accordingly, we propose a novel Federated Prototype Rectification with Personalization which consists of two parts: Federated Personalization and Federated Prototype Rectification. The former aims to construct balanced decision boundaries between dominant and minority classes based on private data, while the latter exploits both inter-class discrimination and intra-class consistency to rectify empirical prototypes. Experiments on three popular benchmarks show that the proposed approach outperforms current state-of-the-art methods and achieves balanced performance in both personalization and generalization.

[LG-52] Meta SAC-Lag: Towards Deployable Safe Reinforcement Learning via MetaGradient-based Hyperparameter Tuning IROS

点击查看摘要

[LG-53] RandomNet: Clustering Time Series Using Untrained Deep Neural Networks

点击查看摘要

[LG-54] A Single Channel-Based Neonatal Sleep-Wake Classification using Hjorth Parameters and Improved Gradient Boosting

链接: https://arxiv.org/abs/2408.07925
作者: Muhammad Arslan,Muhammad Mubeen,Saadullah Farooq Abbasi,Muhammad Shahbaz Khan,Wadii Boulila,Jawad Ahmad
关键词-EN: Intensive Care Unit, Neonatal Intensive Care, plays a crucial, crucial role, Care Unit
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 8 pages, 5 figures, 3 tables, International Polydisciplinary Conference on Artificial Intelligence and New Technologies

点击查看摘要

Abstract:Sleep plays a crucial role in neonatal development. Monitoring the sleep patterns in neonates in a Neonatal Intensive Care Unit (NICU) is imperative for understanding the maturation process. While polysomnography (PSG) is considered the best practice for sleep classification, its expense and reliance on human annotation pose challenges. Existing research often relies on multichannel EEG signals; however, concerns arise regarding the vulnerability of neonates and the potential impact on their sleep quality. This paper introduces a novel approach to neonatal sleep stage classification using a single-channel gradient boosting algorithm with Hjorth features. The gradient boosting parameters are fine-tuned using random search cross-validation (randomsearchCV), achieving an accuracy of 82.35% for neonatal sleep-wake classification. Validation is conducted through 5-fold cross-validation. The proposed algorithm not only enhances existing neonatal sleep algorithms but also opens avenues for broader applications.

[LG-55] A Deep Features-Based Approach Using Modified ResNet50 and Gradient Boosting for Visual Sentiments Classification

点击查看摘要

[LG-56] Physics-Informed Neural Network for Predicting Out-of-Training-Range TCAD Solution with Minimized Domain Expertise

链接: https://arxiv.org/abs/2408.07921
作者: Albert Lu,Yu Foon Chau,Hiu Yung Wong
关键词-EN: technology computer-aided design, assisting technology computer-aided, prolonged simulation time, computer-aided design, prolonged simulation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) is promising in assisting technology computer-aided design (TCAD) simulations to alleviate difficulty in convergence and prolonged simulation time. While ML is widely used in TCAD, they either require access to the internal solver, require extensive domain expertise, are only trained by terminal quantities such as currents and voltages, and/or lack out-of-training-range prediction capability. In this paper, using Si nanowire as an example, we demonstrate that it is possible to use a physics-informed neural network (PINN) to predict out-of-training-range TCAD solutions without accessing the internal solver and with minimal domain expertise. The machine not only can predict a 2.5 times larger range than the training but also can predict the inversion region by only being trained with subthreshold region data. The physics-informed module is also trained with data without the need for human-coded equations making this easier to be extended to more sophisticated systems.

[LG-57] CEGRL-TKGR: A Causal Enhanced Graph Representation Learning Framework for Improving Temporal Knowledge Graph Extrapolation Reasoning

点击查看摘要

[LG-58] KAN versus MLP on Irregular or Noisy Functions

点击查看摘要

[LG-59] he Nah Bandit: Modeling User Non-compliance in Recommendation Systems

链接: https://arxiv.org/abs/2408.07897
作者: Tianyue Zhou,Jung-Hoon Cho,Cathy Wu
关键词-EN: Recommendation systems, ranging from advertising, advertising to entertainment, Recommendation, pervade the digital
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: 12 pages, 8 figures, under review

点击查看摘要

Abstract:Recommendation systems now pervade the digital world, ranging from advertising to entertainment. However, it remains challenging to implement effective recommendation systems in the physical world, such as in mobility or health. This work focuses on a key challenge: in the physical world, it is often easy for the user to opt out of taking any recommendation if they are not to her liking, and to fall back to her baseline behavior. It is thus crucial in cyber-physical recommendation systems to operate with an interaction model that is aware of such user behavior, lest the user abandon the recommendations altogether. This paper thus introduces the Nah Bandit, a tongue-in-cheek reference to describe a Bandit problem where users can say `nah’ to the recommendation and opt for their preferred option instead. As such, this problem lies in between a typical bandit setup and supervised learning. We model the user non-compliance by parameterizing an anchoring effect of recommendations on users. We then propose the Expert with Clustering (EWC) algorithm, a hierarchical approach that incorporates feedback from both recommended and non-recommended options to accelerate user preference learning. In a recommendation scenario with N users, T rounds per user, and K clusters, EWC achieves a regret bound of O(N\sqrtT\log K + NT) , achieving superior theoretical performance in the short term compared to LinUCB algorithm. Experimental results also highlight that EWC outperforms both supervised learning and traditional contextual bandit approaches. This advancement reveals that effective use of non-compliance feedback can accelerate preference learning and improve recommendation accuracy. This work lays the foundation for future research in Nah Bandit, providing a robust framework for more effective recommendation systems.

[LG-60] System States Forecasting of Microservices with Dynamic Spatio-Temporal Data

链接: https://arxiv.org/abs/2408.07894
作者: Yifei Xu,Jingguo Ge,Haina Tang,Shuai Ding,Tong Li,Hui Li
关键词-EN: Artificial Intelligence, accurately forecasting system, Artificial, Operations, forecasting system states
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the AIOps (Artificial Intelligence for IT Operations) era, accurately forecasting system states is crucial. In microservices systems, this task encounters the challenge of dynamic and complex spatio-temporal relationships among microservice instances, primarily due to dynamic deployments, diverse call paths, and cascading effects among instances. Current time-series forecasting methods, which focus mainly on intrinsic patterns, are insufficient in environments where spatial relationships are critical. Similarly, spatio-temporal graph approaches often neglect the nature of temporal trend, concentrating mostly on message passing between nodes. Moreover, current research in microservices domain frequently underestimates the importance of network metrics and topological structures in capturing the evolving dynamics of systems. This paper introduces STMformer, a model tailored for forecasting system states in microservices environments, capable of handling multi-node and multivariate time series. Our method leverages dynamic network connection data and topological information to assist in modeling the intricate spatio-temporal relationships within the system. Additionally, we integrate the PatchCrossAttention module to compute the impact of cascading effects globally. We have developed a dataset based on a microservices system and conducted comprehensive experiments with STMformer against leading methods. In both short-term and long-term forecasting tasks, our model consistently achieved a 8.6% reduction in MAE(Mean Absolute Error) and a 2.2% reduction in MSE (Mean Squared Error). The source code is available at this https URL.

[LG-61] Quantum-inspired Interpretable Deep Learning Architecture for Text Sentiment Analysis

点击查看摘要

[LG-62] IReCa: Intrinsic Reward-enhanced Context-aware Reinforcement Learning for Human-AI Coordination

点击查看摘要

[LG-63] Incremental Structure Discovery of Classification via Sequential Monte Carlo

链接: https://arxiv.org/abs/2408.07875
作者: Changze Huang,Di Wang
关键词-EN: Bayesian non-parametric learning, Gaussian Processes, Bayesian non-parametric, provide a powerful, non-parametric learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian Processes (GPs) provide a powerful framework for making predictions and understanding uncertainty for classification with kernels and Bayesian non-parametric learning. Building such models typically requires strong prior knowledge to define preselect kernels, which could be ineffective for online applications of classification that sequentially process data because features of data may shift during the process. To alleviate the requirement of prior knowledge used in GPs and learn new features from data that arrive successively, this paper presents a novel method to automatically discover models of classification on complex data with little prior knowledge. Our method adapts a recently proposed technique for GP-based time-series structure discovery, which integrates GPs and Sequential Monte Carlo (SMC). We extend the technique to handle extra latent variables in GP classification, such that our method can effectively and adaptively learn a-priori unknown structures of classification from continuous input. In addition, our method adapts new batch of data with updated structures of models. Our experiments show that our method is able to automatically incorporate various features of kernels on synthesized data and real-world data for classification. In the experiments of real-world data, our method outperforms various classification methods on both online and offline setting achieving a 10% accuracy improvement on one benchmark.

[LG-64] A Systematic Evaluation of Generated Time Series and Their Effects in Self-Supervised Pretraining CIKM2024

链接: https://arxiv.org/abs/2408.07869
作者: Audrey Der,Chin-Chia Michael Yeh,Xin Dai,Huiyuan Chen,Yan Zheng,Yujie Fan,Zhongfang Zhuang,Vivian Lai,Junpeng Wang,Liang Wang,Wei Zhang,Eamonn Keogh
关键词-EN: Self-supervised Pretrained Models, language processing tasks, natural language processing, Pretrained Models, Self-supervised Pretrained
类目: Machine Learning (cs.LG)
*备注: To appear in CIKM 2024 as a short paper; the version here is the self-contained version that includes the non-mandatory supplementary material available on the paper’s companion website

点击查看摘要

Abstract:Self-supervised Pretrained Models (PTMs) have demonstrated remarkable performance in computer vision and natural language processing tasks. These successes have prompted researchers to design PTMs for time series data. In our experiments, most self-supervised time series PTMs were surpassed by simple supervised models. We hypothesize this undesired phenomenon may be caused by data scarcity. In response, we test six time series generation methods, use the generated data in pretraining in lieu of the real data, and examine the effects on classification performance. Our results indicate that replacing a real-data pretraining set with a greater volume of only generated samples produces noticeable improvement.

[LG-65] CON-FOLD – Explainable Machine Learning with Confidence

点击查看摘要

[LG-66] raining Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability

链接: https://arxiv.org/abs/2408.07852
作者: Jiri Hron,Laura Culp,Gamaleldin Elsayed,Rosanne Liu,Ben Adlam,Maxwell Bileschi,Bernd Bohnet,JD Co-Reyes,Noah Fiedel,C. Daniel Freeman,Izzeddin Gur,Kathleen Kenealy,Jaehoon Lee,Peter J. Liu,Gaurav Mishra,Igor Mordatch,Azade Nova,Roman Novak,Aaron Parisi,Jeffrey Pennington,Alex Rizkowsky,Isabelle Simpson,Hanie Sedghi,Jascha Sohl-dickstein,Kevin Swersky,Sharad Vikram,Tris Warkentin,Lechao Xiao,Kelvin Xu,Jasper Snoek,Simon Kornblith
关键词-EN: increased training budget, capabilities of language, training budget, fully understood, increased training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at COLM 2024. 16 pages, 11 figures

点击查看摘要

[LG-67] Enhancing Equitable Access to AI in Housing and Homelessness System of Care through Federated Learning AAAI

点击查看摘要

[LG-68] SustainDC – Benchmarking for Sustainable Data Center Control NEURIPS2024

点击查看摘要

[LG-69] CarbonClipper: Optimal Algorithms for Carbon-Aware Spatiotemporal Workload Management

链接: https://arxiv.org/abs/2408.07831
作者: Adam Lechowicz,Nicolas Christianson,Bo Sun,Noman Bashir,Mohammad Hajiesmaili,Adam Wierman,Prashant Shenoy
关键词-EN: growing environmental impact, carbon-aware spatiotemporal workload, spatiotemporal workload management, study carbon-aware spatiotemporal, seeks to address
类目: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 50 pages, 21 figures

点击查看摘要

Abstract:We study carbon-aware spatiotemporal workload management, which seeks to address the growing environmental impact of data centers. We formalize this as an online problem called spatiotemporal online allocation with deadline constraints ( \mathsfSOAD ), in which an online player completes a workload (e.g., a batch compute job) by moving and scheduling the workload across a network subject to a deadline T . At each time step, a service cost function is revealed, representing, e.g., the carbon intensity of servicing a workload at each location, and the player must irrevocably decide the current allocation. Furthermore, whenever the player moves the allocation, it incurs a movement cost defined by a metric space (X,d) that captures, e.g., the overhead of migrating a compute job. \mathsfSOAD formalizes the open problem of combining general metrics and deadline constraints in the online algorithms literature, unifying problems such as metrical task systems and online search. We propose a competitive algorithm for \mathsfSOAD along with a matching lower bound that proves it is optimal. Our main algorithm, \rm C\scriptsize ARBONC\scriptsize LIPPER , is a learning-augmented algorithm that takes advantage of predictions (e.g., carbon intensity forecasts) and achieves an optimal consistency-robustness trade-off. We evaluate our proposed algorithms for carbon-aware spatiotemporal workload management on a simulated global data center network, showing that \rm C\scriptsize ARBONC\scriptsize LIPPER significantly improves performance compared to baseline methods and delivers meaningful carbon reductions.

[LG-70] Regularized Contrastive Partial Multi-view Outlier Detection

点击查看摘要

[LG-71] Differentiating Policies for Non-Myopic Bayesian Optimization

链接: https://arxiv.org/abs/2408.07812
作者: Darian Nwankwo,David Bindel
关键词-EN: acquisition function derived, acquisition functions, Bayesian optimization, points by optimizing, methods choose sample
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) methods choose sample points by optimizing an acquisition function derived from a statistical model of the objective. These acquisition functions are chosen to balance sampling regions with predicted good objective values against exploring regions where the objective is uncertain. Standard acquisition functions are myopic, considering only the impact of the next sample, but non-myopic acquisition functions may be more effective. In principle, one could model the sampling by a Markov decision process, and optimally choose the next sample by maximizing an expected reward computed by dynamic programming; however, this is infeasibly expensive. More practical approaches, such as rollout, consider a parametric family of sampling policies. In this paper, we show how to efficiently estimate rollout acquisition functions and their gradients, enabling stochastic gradient-based optimization of sampling policies.

[LG-72] Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference

链接: https://arxiv.org/abs/2408.07802
作者: Rohan Baskar Prabhakar,Hengrui Zhang,David Wentlzaff
关键词-EN: Large Transformer networks, Large Transformer, enable new applications, networks are increasingly, settings where low
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large Transformer networks are increasingly used in settings where low inference latency can improve the end-user experience and enable new applications. However, autoregressive inference is resource intensive and requires parallelism for efficiency. Parallelism introduces collective communication that is both expensive and represents a phase when hardware resources are underutilized. Towards mitigating this, Kraken is an evolution of the standard Transformer architecture that is designed to complement existing tensor parallelism schemes for efficient inference on multi-device systems. By introducing a fixed degree of intra-layer model parallelism, the architecture allows collective operations to be overlapped with compute, decreasing latency and increasing hardware utilization. When trained on OpenWebText, Kraken models reach a similar perplexity as standard Transformers while also preserving their language modeling capabilities when evaluated on the SuperGLUE benchmark. Importantly, when tested on multi-GPU systems using TensorRT-LLM engines, Kraken speeds up Time To First Token by a mean of 35.6% across a range of model sizes, context lengths, and degrees of tensor parallelism.

[LG-73] An Efficient and Explanatory Image and Text Clustering System with Multimodal Autoencoder Architecture

点击查看摘要

[LG-74] Knowledge-based Neural Ordinary Differential Equations for Cosserat Rod-based Soft Robots

链接: https://arxiv.org/abs/2408.07776
作者: Tom Z. Jiahao,Ryan Adolf,Cynthia Sung,M. Ani Hsieh
关键词-EN: Soft robots, Soft, passive nature, advantages over rigid, compliant and passive
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 11 figures, 4 tables

点击查看摘要

Abstract:Soft robots have many advantages over rigid robots thanks to their compliant and passive nature. However, it is generally challenging to model the dynamics of soft robots due to their high spatial dimensionality, making it difficult to use model-based methods to accurately control soft robots. It often requires direct numerical simulation of partial differential equations to simulate soft robots. This not only requires an accurate numerical model, but also makes soft robot modeling slow and expensive. Deep learning algorithms have shown promises in data-driven modeling of soft robots. However, these algorithms usually require a large amount of data, which are difficult to obtain in either simulation or real-world experiments of soft robots. In this work, we propose KNODE-Cosserat, a framework that combines first-principle physics models and neural ordinary differential equations. We leverage the best from both worlds – the generalization ability of physics-based models and the fast speed of deep learning methods. We validate our framework in both simulation and real-world experiments. In both cases, we show that the robot model significantly improves over the baseline models under different metrics.

[LG-75] MedTsLLM: Leveraging LLMs for Multimodal Medical Time Series Analysis

链接: https://arxiv.org/abs/2408.07773
作者: Nimeesha Chan,Felix Parker,William Bennett,Tianyi Wu,Mung Yao Jia,James Fackler,Kimia Ghobadi
关键词-EN: real-world applications pose, applications pose significant, pose significant challenges, signal processing techniques, traditional machine learning
类目: Machine Learning (cs.LG)
*备注: published in Proceedings of Machine Learning Research, MLHC 2024

点击查看摘要

Abstract:The complexity and heterogeneity of data in many real-world applications pose significant challenges for traditional machine learning and signal processing techniques. For instance, in medicine, effective analysis of diverse physiological signals is crucial for patient monitoring and clinical decision-making and yet highly challenging. We introduce MedTsLLM, a general multimodal large language model (LLM) framework that effectively integrates time series data and rich contextual information in the form of text to analyze physiological signals, performing three tasks with clinical relevance: semantic segmentation, boundary detection, and anomaly detection in time series. These critical tasks enable deeper analysis of physiological signals and can provide actionable insights for clinicians. We utilize a reprogramming layer to align embeddings of time series patches with a pretrained LLM’s embedding space and make effective use of raw time series, in conjunction with textual context. Given the multivariate nature of medical datasets, we develop methods to handle multiple covariates. We additionally tailor the text prompt to include patient-specific information. Our model outperforms state-of-the-art baselines, including deep learning models, other LLMs, and clinical methods across multiple medical domains, specifically electrocardiograms and respiratory waveforms. MedTsLLM presents a promising step towards harnessing the power of LLMs for medical time series analysis that can elevate data-driven tools for clinicians and improve patient outcomes.

[LG-76] Out-of-Distribution Learning with Human Feedback

链接: https://arxiv.org/abs/2408.07772
作者: Haoyue Bai,Xuefeng Du,Katie Rainey,Shibin Parameswaran,Yixuan Li
关键词-EN: real-world deployment environments, addressing multifaceted challenges, OOD, hindering their efficacy, deployment environments
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) learning often relies heavily on statistical approaches or predefined assumptions about OOD data distributions, hindering their efficacy in addressing multifaceted challenges of OOD generalization and OOD detection in real-world deployment environments. This paper presents a novel framework for OOD learning with human feedback, which can provide invaluable insights into the nature of OOD shifts and guide effective model adaptation. Our framework capitalizes on the freely available unlabeled data in the wild that captures the environmental test-time OOD distributions under both covariate and semantic shifts. To harness such data, our key idea is to selectively provide human feedback and label a small number of informative samples from the wild data distribution, which are then used to train a multi-class classifier and an OOD detector. By exploiting human feedback, we enhance the robustness and reliability of machine learning models, equipping them with the capability to handle OOD scenarios with greater precision. We provide theoretical insights on the generalization error bounds to justify our algorithm. Extensive experiments show the superiority of our method, outperforming the current state-of-the-art by a significant margin.

[LG-77] How to Solve Contextual Goal-Oriented Problems with Offline Datasets?

链接: https://arxiv.org/abs/2408.07753
作者: Ying Fan,Jingling Li,Adith Swaminathan,Aditya Modi,Ching-An Cheng
关键词-EN: goal-Oriented Data Augmentation, Contextual goal-Oriented Data, solve Contextual Goal-Oriented, Contextual goal-Oriented, Data Augmentation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel method, Contextual goal-Oriented Data Augmentation (CODA), which uses commonly available unlabeled trajectories and context-goal pairs to solve Contextual Goal-Oriented (CGO) problems. By carefully constructing an action-augmented MDP that is equivalent to the original MDP, CODA creates a fully labeled transition dataset under training contexts without additional approximation error. We conduct a novel theoretical analysis to demonstrate CODA’s capability to solve CGO problems in the offline data setup. Empirical results also showcase the effectiveness of CODA, which outperforms other baseline methods across various context-goal relationships of CGO problem. This approach offers a promising direction to solving CGO problems using offline datasets.

[LG-78] Enhancing Model Interpretability with Local Attribution over Global Exploration

点击查看摘要

[LG-79] Enhancing Adversarial Attacks via Parameter Adaptive Adversarial Attack

链接: https://arxiv.org/abs/2408.07733
作者: Zhibo Jin,Jiayu Zhang,Zhiyu Zhu,Chenyu Zhang,Jiahao Huang,Jianlong Zhou,Fang Chen
关键词-EN: captured widespread attention, model parameters, fine-tuning model parameters, adversarial attacks, DSP
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:In recent times, the swift evolution of adversarial attacks has captured widespread attention, particularly concerning their transferability and other performance attributes. These techniques are primarily executed at the sample level, frequently overlooking the intrinsic parameters of models. Such neglect suggests that the perturbations introduced in adversarial samples might have the potential for further reduction. Given the essence of adversarial attacks is to impair model integrity with minimal noise on original samples, exploring avenues to maximize the utility of such perturbations is imperative. Against this backdrop, we have delved into the complexities of adversarial attack algorithms, dissecting the adversarial process into two critical phases: the Directional Supervision Process (DSP) and the Directional Optimization Process (DOP). While DSP determines the direction of updates based on the current samples and model parameters, it has been observed that existing model parameters may not always be conducive to adversarial attacks. The impact of models on adversarial efficacy is often overlooked in current research, leading to the neglect of DSP. We propose that under certain conditions, fine-tuning model parameters can significantly enhance the quality of DSP. For the first time, we propose that under certain conditions, fine-tuning model parameters can significantly improve the quality of the DSP. We provide, for the first time, rigorous mathematical definitions and proofs for these conditions, and introduce multiple methods for fine-tuning model parameters within DSP. Our extensive experiments substantiate the effectiveness of the proposed P3A method. Our code is accessible at: https://anonymous.4open.science/r/P3A-A12C/

[LG-80] Graph neural network surrogate for strategic transport planning

点击查看摘要

[LG-81] “Normalized Stress” is Not Normalized: How to Interpret Stress Correctly

链接: https://arxiv.org/abs/2408.07724
作者: Kiran Smelser,Jacob Miller,Stephen Kobourov
关键词-EN: high dimensional data, high dimensional, optimization criteria, dimensional data, quality metrics
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stress is among the most commonly employed quality metrics and optimization criteria for dimension reduction projections of high dimensional data. Complex, high dimensional data is ubiquitous across many scientific disciplines, including machine learning, biology, and the social sciences. One of the primary methods of visualizing these datasets is with two dimensional scatter plots that visually capture some properties of the data. Because visually determining the accuracy of these plots is challenging, researchers often use quality metrics to measure projection accuracy or faithfulness to the full data. One of the most commonly employed metrics, normalized stress, is sensitive to uniform scaling of the projection, despite this act not meaningfully changing anything about the projection. We investigate the effect of scaling on stress and other distance based quality metrics analytically and empirically by showing just how much the values change and how this affects dimension reduction technique evaluations. We introduce a simple technique to make normalized stress scale invariant and show that it accurately captures expected behavior on a small benchmark.

[LG-82] Operator Feature Neural Network for Symbolic Regression

点击查看摘要

[LG-83] Impact of Inaccurate Contamination Ratio on Robust Unsupervised Anomaly Detection NEURIPS2024

点击查看摘要

[LG-84] An Introduction to Reinforcement Learning: Fundamental Concepts and Practical Applications

点击查看摘要

[LG-85] A Guide to Similarity Measures

点击查看摘要

[LG-86] Natural Language Outlines for Code: Literate Programming in the LLM Era

链接: https://arxiv.org/abs/2408.04820
作者: Kensen Shi,Deniz Altınbüken,Saswat Anand,Mihai Christodorescu,Katja Grünwedel,Alexa Koenings,Sai Naidu,Anurag Pathak,Marc Rasi,Fredde Ribeiro,Brandon Ruffin,Siddhant Sanyam,Maxim Tabachnyk,Sara Toth,Roy Tu,Tobias Welp,Pengcheng Yin,Manzil Zaheer,Satish Chandra,Charles Sutton
关键词-EN: software development process, natural language outlines, development process, natural language, modality and interaction
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-87] Aliasing and Label-Independent Decomposition of Risk: Beyond the bias-variance trade-off

链接: https://arxiv.org/abs/2408.08294
作者: Mark K. Transtrum,Gus L. W. Hart,Tyler J. Jarvis,Jared P. Whitehead
关键词-EN: potentially noisy samples, unseen inputs, potentially noisy, unknown function, predict function
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Mathematical Physics (math-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A central problem in data science is to use potentially noisy samples of an unknown function to predict function values for unseen inputs. In classical statistics, the predictive error is understood as a trade-off between the bias and the variance that balances model simplicity with its ability to fit complex functions. However, over-parameterized models exhibit counter-intuitive behaviors, such as “double descent” in which models of increasing complexity exhibit decreasing generalization error. We introduce an alternative paradigm called the generalized aliasing decomposition. We explain the asymptotically small error of complex models as a systematic “de-aliasing” that occurs in the over-parameterized regime. In the limit of large models, the contribution due to aliasing vanishes, leaving an expression for the asymptotic total error we call the invertibility failure of very large models on few training points. Because the generalized aliasing decomposition can be explicitly calculated from the relationship between model class and samples without seeing any data labels, it can answer questions related to experimental design and model selection before collecting data or performing experiments. We demonstrate this approach using several examples, including classical regression problems and a cluster expansion model used in materials science.

[LG-88] Accurate and efficient structure elucidation from routine one-dimensional NMR spectra using multitask machine learning

链接: https://arxiv.org/abs/2408.08284
作者: Frank Hu,Michael S. Chen,Grant M. Rotskoff,Matthew W. Kanan,Thomas E. Markland
关键词-EN: greatly accelerate workflows, Rapid determination, NMR spectra, greatly accelerate, accelerate workflows
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rapid determination of molecular structures can greatly accelerate workflows across many chemical disciplines. However, elucidating structure using only one-dimensional (1D) NMR spectra, the most readily accessible data, remains an extremely challenging problem because of the combinatorial explosion of the number of possible molecules as the number of constituent atoms is increased. Here, we introduce a multitask machine learning framework that predicts the molecular structure (formula and connectivity) of an unknown compound solely based on its 1D 1H and/or 13C NMR spectra. First, we show how a transformer architecture can be constructed to efficiently solve the task, traditionally performed by chemists, of assembling large numbers of molecular fragments into molecular structures. Integrating this capability with a convolutional neural network (CNN), we build an end-to-end model for predicting structure from spectra that is fast and accurate. We demonstrate the effectiveness of this framework on molecules with up to 19 heavy (non-hydrogen) atoms, a size for which there are trillions of possible structures. Without relying on any prior chemical knowledge such as the molecular formula, we show that our approach predicts the exact molecule 69.6% of the time within the first 15 predictions, reducing the search space by up to 11 orders of magnitude.

[LG-89] he Z-Gromov-Wasserstein Distance

链接: https://arxiv.org/abs/2408.08233
作者: Martin Bauer,Facundo Mémoli,Tom Needham,Mao Nishino
关键词-EN: found broad applications, machine learning, powerful tool, found broad, data science
类目: Metric Geometry (math.MG); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Gromov-Wasserstein (GW) distance is a powerful tool for comparing metric measure spaces which has found broad applications in data science and machine learning. Driven by the need to analyze datasets whose objects have increasingly complex structure (such as node and edge-attributed graphs), several variants of GW distance have been introduced in the recent literature. With a view toward establishing a general framework for the theory of GW-like distances, this paper considers a vast generalization of the notion of a metric measure space: for an arbitrary metric space Z , we define a Z -network to be a measure space endowed with a kernel valued in Z . We introduce a method for comparing Z -networks by defining a generalization of GW distance, which we refer to as Z -Gromov-Wasserstein ( Z -GW) distance. This construction subsumes many previously known metrics and offers a unified approach to understanding their shared properties. The paper demonstrates that the Z -GW distance defines a metric on the space of Z -networks which retains desirable properties of Z , such as separability, completeness, and geodesicity. Many of these properties were unknown for existing variants of GW distance that fall under our framework. Our focus is on foundational theory, but our results also include computable lower bounds and approximations of the distance which will be useful for practical applications.

[LG-90] Data-driven identification of latent port-Hamiltonian systems

链接: https://arxiv.org/abs/2408.08185
作者: Johannes Rettberg,Jonas Kneifl,Julius Herb,Patrick Buchfink,Jörg Fehr,Bernard Haasdonk
关键词-EN: Conventional physics-based modeling, involve high effort, physics-based modeling techniques, modeling techniques involve, techniques involve high
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注: 33 pages, 8 figures

点击查看摘要

Abstract:Conventional physics-based modeling techniques involve high effort, e.g., time and expert knowledge, while data-driven methods often lack interpretability, structure, and sometimes reliability. To mitigate this, we present a data-driven system identification framework that derives models in the port-Hamiltonian (pH) formulation. This formulation is suitable for multi-physical systems while guaranteeing the useful system theoretical properties of passivity and stability. Our framework combines linear and nonlinear reduction with structured, physics-motivated system identification. In this process, high-dimensional state data obtained from possibly nonlinear systems serves as input for an autoencoder, which then performs two tasks: (i) nonlinearly transforming and (ii) reducing this data onto a low-dimensional latent space. In this space, a linear pH system, that satisfies the pH properties per construction, is parameterized by the weights of a neural network. The mathematical requirements are met by defining the pH matrices through Cholesky factorizations. The neural networks that define the coordinate transformation and the pH system are identified in a joint optimization process to match the dynamics observed in the data while defining a linear pH system in the latent space. The learned, low-dimensional pH system can describe even nonlinear systems and is rapidly computable due to its small size. The method is exemplified by a parametric mass-spring-damper and a nonlinear pendulum example, as well as the high-dimensional model of a disc brake with linear thermoelastic behavior.

[LG-91] Learned denoising with simulated and experimental low-dose CT data

链接: https://arxiv.org/abs/2408.08115
作者: Maximilian B. Kiss,Ander Biguri,Carola-Bibiane Schönlieb,K. Joost Batenburg,Felix Lucka
关键词-EN: experimental noisy data, noisy data, developing machine learning, machine learning, computational imaging
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Like in many other research fields, recent developments in computational imaging have focused on developing machine learning (ML) approaches to tackle its main challenges. To improve the performance of computational imaging algorithms, machine learning methods are used for image processing tasks such as noise reduction. Generally, these ML methods heavily rely on the availability of high-quality data on which they are trained. This work explores the application of ML methods, specifically convolutional neural networks (CNNs), in the context of noise reduction for computed tomography (CT) imaging. We utilize a large 2D computed tomography dataset for machine learning to carry out for the first time a comprehensive study on the differences between the observed performances of algorithms trained on simulated noisy data and on real-world experimental noisy data. The study compares the performance of two common CNN architectures, U-Net and MSD-Net, that are trained and evaluated on both simulated and experimental noisy data. The results show that while sinogram denoising performed better with simulated noisy data if evaluated in the sinogram domain, the performance did not carry over to the reconstruction domain where training on experimental noisy data shows a higher performance in denoising experimental noisy data. Training the algorithms in an end-to-end fashion from sinogram to reconstruction significantly improved model performance, emphasizing the importance of matching raw measurement data to high-quality CT reconstructions. The study furthermore suggests the need for more sophisticated noise simulation approaches to bridge the gap between simulated and real-world data in CT image denoising applications and gives insights into the challenges and opportunities in leveraging simulated data for machine learning in computational imaging.

[LG-92] BINDy – Bayesian identification of nonlinear dynamics with reversible-jump Markov-chain Monte-Carlo

链接: https://arxiv.org/abs/2408.08062
作者: Max D. Champneys,Timothy J. Rogers
关键词-EN: cognitive bias, prevent over-fitting, data-driven modelling, modelling that aids, aids interpretability
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Model parsimony is an important \emphcognitive bias in data-driven modelling that aids interpretability and helps to prevent over-fitting. Sparse identification of nonlinear dynamics (SINDy) methods are able to learn sparse representations of complex dynamics directly from data, given a basis of library functions. In this work, a novel Bayesian treatment of dictionary learning system identification, as an alternative to SINDy, is envisaged. The proposed method – Bayesian identification of nonlinear dynamics (BINDy) – is distinct from previous approaches in that it targets the full joint posterior distribution over both the terms in the library and their parameterisation in the model. This formulation confers the advantage that an arbitrary prior may be placed over the model structure to produce models that are sparse in the model space rather than in parameter space. Because this posterior is defined over parameter vectors that can change in dimension, the inference cannot be performed by standard techniques. Instead, a Gibbs sampler based on reversible-jump Markov-chain Monte-Carlo is proposed. BINDy is shown to compare favourably to ensemble SINDy in three benchmark case-studies. In particular, it is seen that the proposed method is better able to assign high probability to correct model terms.

[LG-93] Hessian QM9: A quantum chemistry database of molecular Hessians in implicit solvents

链接: https://arxiv.org/abs/2408.08006
作者: Nicholas J. Williams,Lara Kabalan,Ljiljana Stojanovic,Viktor Zolyomi,Edward O. Pyzer-Knapp
关键词-EN: potential energy surface, approximations that accelerate, methods while preserving, preserving accuracy, significant challenge
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 7 pages, 2 figues

点击查看摘要

Abstract:A significant challenge in computational chemistry is developing approximations that accelerate \emphab initio methods while preserving accuracy. Machine learning interatomic potentials (MLIPs) have emerged as a promising solution for constructing atomistic potentials that can be transferred across different molecular and crystalline systems. Most MLIPs are trained only on energies and forces in vacuum, while an improved description of the potential energy surface could be achieved by including the curvature of the potential energy surface. We present Hessian QM9, the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the \omega B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as water, tetrahydrofuran, and toluene using an implicit solvation model. To demonstrate the utility of this dataset, we show that incorporating second derivatives of the potential energy surface into the loss function of a MLIP significantly improves the prediction of vibrational frequencies in all solvent environments, thus making this dataset extremely useful for studying organic molecules in realistic solvent environments for experimental characterization.

[LG-94] Robust Offline Active Learning on Graphs

链接: https://arxiv.org/abs/2408.07941
作者: Yuanchen Wu,Yubai Yuan
关键词-EN: proposed method, active learning, crucial applications, labeling node responses, real-world networks
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of active learning on graphs, which has crucial applications in many real-world networks where labeling node responses is expensive. In this paper, we propose an offline active learning method that selects nodes to query by explicitly incorporating information from both the network structure and node covariates. Building on graph signal recovery theories and the random spectral sparsification technique, the proposed method adopts a two-stage biased sampling strategy that takes both informativeness and representativeness into consideration for node querying. Informativeness refers to the complexity of graph signals that are learnable from the responses of queried nodes, while representativeness refers to the capacity of queried nodes to control generalization errors given noisy node-level information. We establish a theoretical relationship between generalization error and the number of nodes selected by the proposed method. Our theoretical results demonstrate the trade-off between informativeness and representativeness in active learning. Extensive numerical experiments show that the proposed method is competitive with existing graph-based active learning methods, especially when node covariates and responses contain noises. Additionally, the proposed method is applicable to both regression and classification tasks on graphs.

[LG-95] MobileMEF: Fast and Efficient Method for Multi-Exposure Fusion

点击查看摘要

[LG-96] Local Causal Discovery with Background Knowledge

链接: https://arxiv.org/abs/2408.07890
作者: Qingyuan Zheng,Yue Liu,Yangbo He
关键词-EN: Causality plays, fields of study, plays a pivotal, pivotal role, causal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causality plays a pivotal role in various fields of study. Based on the framework of causal graphical models, previous works have proposed identifying whether a variable is a cause or non-cause of a target in every Markov equivalent graph solely by learning a local structure. However, the presence of prior knowledge, often represented as a partially known causal graph, is common in many causal modeling applications. Leveraging this prior knowledge allows for the further identification of causal relationships. In this paper, we first propose a method for learning the local structure using all types of causal background knowledge, including direct causal information, non-ancestral information and ancestral information. Then we introduce criteria for identifying causal relationships based solely on the local structure in the presence of prior knowledge. We also apply out method to fair machine learning, and experiments involving local structure learning, causal relationship identification, and fair machine learning demonstrate that our method is both effective and efficient.

[LG-97] Capturing the Complexity of Human Strategic Decision-Making with Machine Learning

链接: https://arxiv.org/abs/2408.07865
作者: Jian-Qiao Zhu,Joshua C. Peterson,Benjamin Enke,Thomas L. Griffiths
关键词-EN: make decisions based, long-standing problem, strategic settings, make decisions, decisions based
类目: General Economics (econ.GN); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding how people behave in strategic settings–where they make decisions based on their expectations about the behavior of others–is a long-standing problem in the behavioral sciences. We conduct the largest study to date of strategic decision-making in the context of initial play in two-player matrix games, analyzing over 90,000 human decisions across more than 2,400 procedurally generated games that span a much wider space than previous datasets. We show that a deep neural network trained on these data predicts people’s choices better than leading theories of strategic behavior, indicating that there is systematic variation that is not explained by those theories. We then modify the network to produce a new, interpretable behavioral model, revealing what the original network learned about people: their ability to optimally respond and their capacity to reason about others are dependent on the complexity of individual games. This context-dependence is critical in explaining deviations from the rational Nash equilibrium, response times, and uncertainty in strategic decisions. More broadly, our results demonstrate how machine learning can be applied beyond prediction to further help generate novel explanations of complex human behavior.

[LG-98] me-inversion of spatiotemporal beam dynamics using uncertainty-aware latent evolution reversal

链接: https://arxiv.org/abs/2408.07847
作者: Mahindra Rautela,Alan Williams,Alexander Scheinker
关键词-EN: challenging spatiotemporal problem, charged particle beam, Charged particle, Charged particle dynamics, influence of electromagnetic
类目: Accelerator Physics (physics.acc-ph); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2403.13858

点击查看摘要

Abstract:Charged particle dynamics under the influence of electromagnetic fields is a challenging spatiotemporal problem. Many high performance physics-based simulators for predicting behavior in a charged particle beam are computationally expensive, limiting their utility for solving inverse problems online. The problem of estimating upstream six-dimensional phase space given downstream measurements of charged particles in an accelerator is an inverse problem of growing importance. This paper introduces a reverse Latent Evolution Model (rLEM) designed for temporal inversion of forward beam dynamics. In this two-step self-supervised deep learning framework, we utilize a Conditional Variational Autoencoder (CVAE) to project 6D phase space projections of a charged particle beam into a lower-dimensional latent distribution. Subsequently, we autoregressively learn the inverse temporal dynamics in the latent space using a Long Short-Term Memory (LSTM) network. The coupled CVAE-LSTM framework can predict 6D phase space projections across all upstream accelerating sections based on single or multiple downstream phase space measurements as inputs. The proposed model also captures the aleatoric uncertainty of the high-dimensional input data within the latent space. This uncertainty, which reflects potential uncertain measurements at a given module, is propagated through the LSTM to estimate uncertainty bounds for all upstream predictions, demonstrating the robustness of the LSTM against in-distribution variations in the input data.

[LG-99] Exploration of LLMs EEG and behavioral data to measure and support attention and sleep

点击查看摘要

[LG-100] Ranking and Combining Latent Structured Predictive Scores without Labeled Data

链接: https://arxiv.org/abs/2408.07796
作者: Shiva Afshar,Yinghan Chen,Shizhong Han,Ying Lin
关键词-EN: Combining multiple predictors, Combining multiple, achieve enhanced performance, multiple predictors obtained, distributed data sources
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Combining multiple predictors obtained from distributed data sources to an accurate meta-learner is promising to achieve enhanced performance in lots of prediction problems. As the accuracy of each predictor is usually unknown, integrating the predictors to achieve better performance is challenging. Conventional ensemble learning methods assess the accuracy of predictors based on extensive labeled data. In practical applications, however, the acquisition of such labeled data can prove to be an arduous task. Furthermore, the predictors under consideration may exhibit high degrees of correlation, particularly when similar data sources or machine learning algorithms were employed during their model training. In response to these challenges, this paper introduces a novel structured unsupervised ensemble learning model (SUEL) to exploit the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights. Two novel correlation-based decomposition algorithms are further proposed to estimate the SUEL model, constrained quadratic optimization (SUEL.CQO) and matrix-factorization-based (SUEL.MF) approaches. The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery. The results compellingly demonstrate that the proposed methods can efficiently integrate the dependent predictors to an ensemble model without the need of ground truth data.

[LG-101] Data Clustering and Visualization with Recursive Goemans-Williamson MaxCut Algorithm

链接: https://arxiv.org/abs/2408.07763
作者: An Ly,Raj Sawhney,Marina Chugunova
关键词-EN: Goemans-Williamson MaxCut algorithm, offering improved performance, classical Goemans-Williamson MaxCut, data clustering tasks, vectorized data clustering
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Published in the IEEE Conference, CSCI 2023 (Winter Session)

点击查看摘要

Abstract:In this article, we introduce a novel recursive modification to the classical Goemans-Williamson MaxCut algorithm, offering improved performance in vectorized data clustering tasks. Focusing on the clustering of medical publications, we employ recursive iterations in conjunction with a dimension relaxation method to significantly enhance density of clustering results. Furthermore, we propose a unique vectorization technique for articles, leveraging conditional probabilities for more effective clustering. Our methods provide advantages in both computational efficiency and clustering accuracy, substantiated through comprehensive experiments.

[LG-102] Pretrained-Guided Conditional Diffusion Models for Microbiome Data Analysis

链接: https://arxiv.org/abs/2408.07709
作者: Xinyuan Shi,Fangfang Zhu,Wenwen Min
关键词-EN: Emerging evidence, forming an inseparable, inseparable connection, intricately linked, Emerging
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Emerging evidence indicates that human cancers are intricately linked to human microbiomes, forming an inseparable connection. However, due to limited sample sizes and significant data loss during collection for various reasons, some machine learning methods have been proposed to address the issue of missing data. These methods have not fully utilized the known clinical information of patients to enhance the accuracy of data imputation. Therefore, we introduce mbVDiT, a novel pre-trained conditional diffusion model for microbiome data imputation and denoising, which uses the unmasked data and patient metadata as conditional guidance for imputating missing values. It is also uses VAE to integrate the the other public microbiome datasets to enhance model performance. The results on the microbiome datasets from three different cancer types demonstrate the performance of our methods in comparison with existing methods.

信息检索

[IR-0] DaRec: A Disentangled Alignment Framework for Large Language Model and Recommender System

链接: https://arxiv.org/abs/2408.08231
作者: Xihong Yang,Heming Jing,Zixing Zhang,Jindong Wang,Huakang Niu,Shuaiqiang Wang,Yu Lu,Junfeng Wang,Dawei Yin,Xinwang Liu,En Zhu,Defu Lian,Erxue Min
关键词-EN: Large language models, strong reasoning capabilities, Large language, demonstrated remarkable performance, collaborative models
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Benefiting from the strong reasoning capabilities, Large language models (LLMs) have demonstrated remarkable performance in recommender systems. Various efforts have been made to distill knowledge from LLMs to enhance collaborative models, employing techniques like contrastive learning for representation alignment. In this work, we prove that directly aligning the representations of LLMs and collaborative models is sub-optimal for enhancing downstream recommendation tasks performance, based on the information theorem. Consequently, the challenge of effectively aligning semantic representations between collaborative models and LLMs remains unresolved. Inspired by this viewpoint, we propose a novel plug-and-play alignment framework for LLMs and collaborative models. Specifically, we first disentangle the latent representations of both LLMs and collaborative models into specific and shared components via projection layers and representation regularization. Subsequently, we perform both global and local structure alignment on the shared representations to facilitate knowledge transfer. Additionally, we theoretically prove that the specific and shared representations contain more pertinent and less irrelevant information, which can enhance the effectiveness of downstream recommendation tasks. Extensive experimental results on benchmark datasets demonstrate that our method is superior to existing state-of-the-art algorithms.

[IR-1] Modeling Domain and Feedback Transitions for Cross-Domain Sequential Recommendation

链接: https://arxiv.org/abs/2408.08209
作者: Changshuo Zhang,Teng Shi,Xiao Zhang,Qi Liu,Ruobing Xie,Jun Xu,Ji-Rong Wen
关键词-EN: recommender systems encompass, user behaviors transitioning, recommender systems, systems encompass, Nowadays
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Nowadays, many recommender systems encompass various domains to cater to users’ diverse needs, leading to user behaviors transitioning across different domains. In fact, user behaviors across different domains reveal changes in preference toward recommended items. For instance, a shift from negative feedback to positive feedback indicates improved user satisfaction. However, existing cross-domain sequential recommendation methods typically model user interests by focusing solely on information about domain transitions, often overlooking the valuable insights provided by users’ feedback transitions. In this paper, we propose \textTransition^2 , a novel method to model transitions across both domains and types of user feedback. Specifically, \textTransition^2 introduces a transition-aware graph encoder based on user history, assigning different weights to edges according to the feedback type. This enables the graph encoder to extract historical embeddings that capture the transition information between different domains and feedback types. Subsequently, we encode the user history using a cross-transition multi-head self-attention, incorporating various masks to distinguish different types of transitions. Finally, we integrate these modules to make predictions across different domains. Experimental results on two public datasets demonstrate the effectiveness of \textTransition^2 .

[IR-2] LLM4DSR: Leveraing Large Language Model for Denoising Sequential Recommendation

点击查看摘要

[IR-3] From Clicks to Carbon: The Environmental Toll of Recommender Systems

链接: https://arxiv.org/abs/2408.08203
作者: Tobias Vente,Lukas Wegmeth,Alan Said,Joeran Beel
关键词-EN: recommender systems research, global warming soars, recommender systems, systems research, systems research papers
类目: Information Retrieval (cs.IR)
*备注: Accepted for presentation at the 18th ACM Conference on Recommender Systems in the Reproducibility Track

点击查看摘要

Abstract:As global warming soars, evaluating the environmental impact of research is more critical now than ever before. However, we find that few to no recommender systems research papers document their impact on the environment. Consequently, in this paper, we conduct a comprehensive analysis of the environmental impact of recommender system research by reproducing a characteristic recommender systems experimental pipeline. We focus on estimating the carbon footprint of recommender systems research papers, highlighting the evolution of the environmental impact of recommender systems research experiments over time. We thoroughly evaluated all 79 full papers from the ACM RecSys conference in the years 2013 and 2023 to analyze representative experimental pipelines for papers utilizing traditional, so-called good old-fashioned AI algorithms and deep learning algorithms, respectively. We reproduced these representative experimental pipelines, measured electricity consumption using a hardware energy meter, and converted the measured energy consumption into CO2 equivalents to estimate the environmental impact. Our results show that a recommender systems research paper utilizing deep learning algorithms emits approximately 42 times more CO2 equivalents than a paper utilizing traditional algorithms. Furthermore, on average, such a paper produces 3,297 kilograms of CO2 equivalents, which is more than one person produces by flying from New York City to Melbourne or the amount one tree sequesters in 300 years.

[IR-4] KGV: Integrating Large Language Models with Knowledge Graphs for Cyber Threat Intelligence Credibility Assessment

链接: https://arxiv.org/abs/2408.08088
作者: Zongzong Wu,Fengxiao Tang,Ming Zhao,Yufeng Li
关键词-EN: weaponized cyber attacks, Cyber threat intelligence, threat intelligence, Cyber threat, cyber attacks
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Cyber threat intelligence is a critical tool that many organizations and individuals use to protect themselves from sophisticated, organized, persistent, and weaponized cyber attacks. However, few studies have focused on the quality assessment of threat intelligence provided by intelligence platforms, and this work still requires manual analysis by cybersecurity experts. In this paper, we propose a knowledge graph-based verifier, a novel Cyber Threat Intelligence (CTI) quality assessment framework that combines knowledge graphs and Large Language Models (LLMs). Our approach introduces LLMs to automatically extract OSCTI key claims to be verified and utilizes a knowledge graph consisting of paragraphs for fact-checking. This method differs from the traditional way of constructing complex knowledge graphs with entities as nodes. By constructing knowledge graphs with paragraphs as nodes and semantic similarity as edges, it effectively enhances the semantic understanding ability of the model and simplifies labeling requirements. Additionally, to fill the gap in the research field, we created and made public the first dataset for threat intelligence assessment from heterogeneous sources. To the best of our knowledge, this work is the first to create a dataset on threat intelligence reliability verification, providing a reference for future research. Experimental results show that KGV (Knowledge Graph Verifier) significantly improves the performance of LLMs in intelligence quality assessment. Compared with traditional methods, we reduce a large amount of data annotation while the model still exhibits strong reasoning capabilities. Finally, our method can achieve XXX accuracy in network threat assessment.

[IR-5] Extracting Sentence Embeddings from Pretrained Transformer Models

链接: https://arxiv.org/abs/2408.08073
作者: Lukas Stankevičius,Mantas Lukoševičius
关键词-EN: Pre-trained transformer models, Pre-trained transformer, natural language processing, transformer models shine, expected to bear
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[IR-6] Mamba Retriever: Utilizing Mamba for Effective and Efficient Dense Retrieval

链接: https://arxiv.org/abs/2408.08066
作者: Hanqi Zhang,Chong Chen,Lang Mei,Qi Liu,Jiaxin Mao
关键词-EN: Mamba Retriever, deep learning techniques, Mamba, Retriever, Transformer-based PLMs
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the information retrieval (IR) area, dense retrieval (DR) models use deep learning techniques to encode queries and passages into embedding space to compute their semantic relations. It is important for DR models to balance both efficiency and effectiveness. Pre-trained language models (PLMs), especially Transformer-based PLMs, have been proven to be effective encoders of DR models. However, the self-attention component in Transformer-based PLM results in a computational complexity that grows quadratically with sequence length, and thus exhibits a slow inference speed for long-text retrieval. Some recently proposed non-Transformer PLMs, especially the Mamba architecture PLMs, have demonstrated not only comparable effectiveness to Transformer-based PLMs on generative language tasks but also better efficiency due to linear time scaling in sequence length. This paper implements the Mamba Retriever to explore whether Mamba can serve as an effective and efficient encoder of DR model for IR tasks. We fine-tune the Mamba Retriever on the classic short-text MS MARCO passage ranking dataset and the long-text LoCoV0 dataset. Experimental results show that (1) on the MS MARCO passage ranking dataset and BEIR, the Mamba Retriever achieves comparable or better effectiveness compared to Transformer-based retrieval models, and the effectiveness grows with the size of the Mamba model; (2) on the long-text LoCoV0 dataset, the Mamba Retriever can extend to longer text length than its pre-trained length after fine-tuning on retrieval task, and it has comparable or better effectiveness compared to other long-text retrieval models; (3) the Mamba Retriever has superior inference speed for long-text retrieval. In conclusion, Mamba Retriever is both effective and efficient, making it a practical model, especially for long-text retrieval.

[IR-7] An Efficient Continuous Control Perspective for Reinforcement-Learning-based Sequential Recommendation

点击查看摘要

[IR-8] AIE: Auction Information Enhanced Framework for CTR Prediction in Online Advertising

链接: https://arxiv.org/abs/2408.07907
作者: Yang Yang,Bo Chen,Chenxu Zhu,Menghui Zhu,Xinyi Dai,Huifeng Guo,Muyu Zhang,Zhenhua Dong,Ruiming Tang
关键词-EN: competitive auction process, Click-Through Rate, complex online competitive, CTR prediction, auction information
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Click-Through Rate (CTR) prediction is a fundamental technique for online advertising recommendation and the complex online competitive auction process also brings many difficulties to CTR optimization. Recent studies have shown that introducing posterior auction information contributes to the performance of CTR prediction. However, existing work doesn’t fully capitalize on the benefits of auction information and overlooks the data bias brought by the auction, leading to biased and suboptimal results. To address these limitations, we propose Auction Information Enhanced Framework (AIE) for CTR prediction in online advertising, which delves into the problem of insufficient utilization of auction signals and first reveals the auction bias. Specifically, AIE introduces two pluggable modules, namely Adaptive Market-price Auxiliary Module (AM2) and Bid Calibration Module (BCM), which work collaboratively to excavate the posterior auction signals better and enhance the performance of CTR prediction. Furthermore, the two proposed modules are lightweight, model-agnostic, and friendly to inference latency. Extensive experiments are conducted on a public dataset and an industrial dataset to demonstrate the effectiveness and compatibility of AIE. Besides, a one-month online A/B test in a large-scale advertising platform shows that AIE improves the base model by 5.76% and 2.44% in terms of eCPM and CTR, respectively.

[IR-9] he Nah Bandit: Modeling User Non-compliance in Recommendation Systems

点击查看摘要

[IR-10] SWaT: Statistical Modeling of Video Watch Time through User Behavior Analysis

链接: https://arxiv.org/abs/2408.07759
作者: Shentao Yang,Haichuan Yang,Linna Du,Adithya Ganesh,Bo Peng,Boying Liu,Serena Li,Ji Liu
关键词-EN: video watch time, mainstream social media, watch time, estimating video watch, video watch
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The significance of estimating video watch time has been highlighted by the rising importance of (short) video recommendation, which has become a core product of mainstream social media platforms. Modeling video watch time, however, has been challenged by the complexity of user-video interaction, such as different user behavior modes in watching the recommended videos and varying watching probabilities over the video horizon. Despite the importance and challenges, existing literature on modeling video watch time mostly focuses on relatively black-box mechanical enhancement of the classical regression/classification losses, without factoring in user behavior in a principled manner. In this paper, we for the first time take on a user-centric perspective to model video watch time, from which we propose a white-box statistical framework that directly translates various user behavior assumptions in watching (short) videos into statistical watch time models. These behavior assumptions are portrayed by our domain knowledge on users’ behavior modes in video watching. We further employ bucketization to cope with user’s non-stationary watching probability over the video horizon, which additionally helps to respect the constraint of video length and facilitate the practical compatibility between the continuous regression event of watch time and other binary classification events. We test our models extensively on two public datasets, a large-scale offline industrial dataset, and an online A/B test on a short video platform with hundreds of millions of daily-active users. On all experiments, our models perform competitively against strong relevant baselines, demonstrating the efficacy of our user-centric perspective and proposed framework.

[IR-11] A Guide to Similarity Measures

点击查看摘要

[IR-12] Enhancing Supply Chain Visibility with Knowledge Graphs and Large Language Models

链接: https://arxiv.org/abs/2408.07705
作者: Sara AlMahri,Liming Xu,Alexandra Brintrup
关键词-EN: today globalized economy, supply chain, Large Language Models, comprehensive supply chain, supply
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

[IR-13] Empathic Responding for Digital Interpersonal Emotion Regulation via Content Recommendation

点击查看摘要

附件下载

点击下载今日全部论文列表

今日(2024-08-16)Arxiv最新论文

目录

概览 (2024-08-16)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载