Arxiv今日论文 | 2024-10-31

本篇博文主要展示 2024-10-31 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决现有视频生成模型在处理长视频时，由于忽视快速学习阶段而导致的时间上远距离帧之间不一致的问题。解决方案的关键在于引入SlowFast-VGen，这是一个双速学习系统，结合了慢速学习的世界动态建模和基于时间LoRA模块的快速学习策略。慢速学习通过掩码条件视频扩散模型进行，而快速学习则在推理时更新其时间LoRA参数，以高效存储情景记忆。此外，提出的慢-快学习循环算法将内部快速学习循环无缝集成到外部慢速学习循环中，使得模型能够回忆先前的多情景经验，从而实现上下文感知的技能学习。通过收集包含语言动作注释的大规模视频数据集，并进行广泛的实验验证，SlowFast-VGen在动作驱动的视频生成任务中显著优于基线模型。

链接: https://arxiv.org/abs/2410.23277
作者: Yining Hong,Beide Liu,Maxine Wu,Yuanhao Zhai,Kai-Wei Chang,Lingjie Li,Kevin Lin,Chung-Ching Lin,Jianfeng Wang,Zhengyuan Yang,Yingnian Wu,Lijuan Wang
关键词-EN: learning, slow learning, episodic memory, episodic memory storage, fast learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model’s context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi-episode experiences for context-aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast-VGen outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well. Project Website: this https URL
摘要：人类拥有一种互补的学习系统，它将通用世界动态的缓慢学习与从新经验中快速存储的情景记忆相结合。然而，以往的视频生成模型主要通过在大规模数据上进行预训练来专注于缓慢学习，忽略了对于情景记忆存储至关重要的快速学习阶段。这种疏忽导致在生成较长视频时，时间上相距较远的帧之间出现不一致性，因为这些帧超出了模型的上下文窗口。为此，我们提出了SlowFast-VGen，这是一种用于动作驱动长视频生成的新型双速学习系统。我们的方法结合了用于世界动态缓慢学习的掩码条件视频扩散模型，以及基于时间LoRA模块的推理时快速学习策略。具体而言，快速学习过程根据局部输入和输出更新其时间LoRA参数，从而在其参数中高效地存储情景记忆。我们进一步提出了一种慢-快学习循环算法，该算法将内部快速学习循环无缝集成到外部缓慢学习循环中，使得能够回忆先前的多情景经验，以实现上下文感知的技能学习。为了促进近似世界模型的缓慢学习，我们收集了一个包含20万条带有语言动作注释的视频的大规模数据集，涵盖了广泛的场景。广泛的实验表明，SlowFast-VGen在各种动作驱动视频生成指标上优于基线，实现了FVD分数为514，相比782，并且在较长视频中保持一致性，平均场景切换次数为0.37，而基线为0.89。慢-快学习循环算法显著提升了长期规划任务的性能。项目网站：此https URL。

[NLP-1] OMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

【速读】：该论文试图解决现有基准测试对多模态基础模型 (Multimodal Foundation Models, MFMs) 在视频理解中视觉时间推理能力的过高估计问题。研究指出，许多现有基准测试中的问题可以通过使用单帧、少数帧或乱序帧来解决，从而导致对模型真正的时间推理能力的误判。解决方案的关键在于提出了三个评估原则及其相应的指标：(1) 多帧增益 (Multi-Frame Gain)，(2) 帧序敏感性 (Frame Order Sensitivity)，和 (3) 帧信息差异 (Frame Information Disparity)。基于这些原则，论文引入了新的基准测试 TOMATO (Temporal Reasoning Multimodal Evaluation)，旨在严格评估MFMs在视频理解中的时间推理能力。TOMATO包含1,484个精心设计的人工标注问题，涵盖六个任务，应用于1,417个视频，包括805个自录和自生成的视频，涉及以人为中心、现实世界和模拟场景。通过全面评估，发现最佳模型与人类表现之间存在57.3%的差距，并揭示了当前MFMs在解释连续帧序列方面的根本性局限。

链接: https://arxiv.org/abs/2410.23266
作者: Ziyao Shangguan,Chuhan Li,Yuxuan Ding,Yanan Zheng,Yilun Zhao,Tesca Fitzgerald,Arman Cohan
关键词-EN: Multimodal Foundation Models, leveraging temporal context, Multimodal Foundation, Foundation Models, temporal reasoning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, how well do the models truly perform visual temporal reasoning? Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) Multi-Frame Gain, (2) Frame Order Sensitivity, and (3) Frame Information Disparity. Following these principles, we introduce TOMATO, Temporal Reasoning Multimodal Evaluation, a novel benchmark crafted to rigorously assess MFMs’ temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks (i.e., action count, direction, rotation, shape trend, velocity frequency, and visual cues), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. While they can accurately recognize events in isolated frames, they fail to interpret these frames as a continuous sequence. We believe TOMATO will serve as a crucial testbed for evaluating the next-generation MFMs and as a call to the community to develop AI systems capable of comprehending human world dynamics through the video modality.
摘要：现有的基准测试通常强调最先进的多模态基础模型 (Multimodal Foundation Models, MFMs)在利用时间上下文进行视频理解方面取得的显著性能。然而，这些模型在视觉时间推理方面的表现究竟如何？我们对现有基准测试的研究表明，MFMs的这一能力可能被高估了，因为许多问题可以通过使用单个、少数或乱序的帧来解决。为了系统地考察当前的视觉时间推理任务，我们提出了三个原则及其相应的指标：(1) 多帧增益 (Multi-Frame Gain)，(2) 帧序敏感性 (Frame Order Sensitivity)，和 (3) 帧信息差异 (Frame Information Disparity)。基于这些原则，我们引入了时间推理多模态评估 (Temporal Reasoning Multimodal Evaluation, TOMATO)，这是一个新颖的基准测试，旨在严格评估MFMs在视频理解中的时间推理能力。TOMATO包含1,484个精心挑选、人工标注的问题，涵盖六个任务（即动作计数、方向、旋转、形状趋势、速度频率和视觉线索），应用于1,417个视频，其中包括805个自录制和生成的视频，涵盖以人为中心、现实世界和模拟场景。我们的全面评估显示，最佳表现模型与人类表现之间存在57.3%的差距。此外，我们的深入分析揭示了当前MFMs在理解连续帧序列方面存在更根本的局限性，尽管它们能够准确识别孤立帧中的事件，但无法将这些帧解释为连续序列。我们相信TOMATO将成为评估下一代MFMs的关键测试平台，并呼吁社区开发能够通过视频模态理解人类世界动态的AI系统。

[NLP-2] EMMA: End-to-End Multimodal Model for Autonomous Driving

【速读】：该论文试图解决自动驾驶系统中多模态数据处理和任务协同的问题。解决方案的关键在于引入了一个端到端的多模态模型EMMA (End-to-end Multimodal Model for Autonomous driving)，该模型基于多模态大型语言模型，能够直接将原始摄像头传感器数据映射到各种驾驶相关的输出，如规划轨迹、感知对象和道路图元素。EMMA的核心创新在于将所有非传感器输入（如导航指令和自车状态）和输出（如轨迹和3D位置）表示为自然语言文本，从而在统一的语言空间中联合处理多种驾驶任务，并通过任务特定的提示生成相应输出。这种方法不仅在运动规划和3D物体检测任务上取得了最先进的性能，还展示了在多个任务上协同训练的潜力，但也存在处理图像帧数量有限、未整合3D传感模态（如LiDAR或雷达）以及计算成本高等局限性。

链接: https://arxiv.org/abs/2410.23262
作者: Jyh-Jing Hwang,Runsheng Xu,Hubert Lin,Wei-Chih Hung,Jingwei Ji,Kristy Choi,Di Huang,Tong He,Paul Covington,Benjamin Sapp,James Guo,Dragomir Anguelov,Mingxing Tan
关键词-EN: Multimodal Model, EMMA, Waymo Open Dataset, Waymo Open, Waymo Open Motion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Blog post: this https URL

点击查看摘要

Abstract:We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA’s effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA’s potential as a generalist model for autonomous driving applications. However, EMMA also exhibits certain limitations: it can process only a small amount of image frames, does not incorporate accurate 3D sensing modalities like LiDAR or radar and is computationally expensive. We hope that our results will inspire further research to mitigate these issues and to further evolve the state of the art in autonomous driving model architectures.
摘要：我们介绍了 EMMA，一种端到端的自动驾驶多模态模型。基于多模态大语言模型的基础，EMMA 能够直接将原始摄像头传感器数据映射为多种驾驶专用输出，包括规划轨迹、感知对象和道路图元素。EMMA 通过将所有非传感器输入（如导航指令和自车状态）以及输出（如轨迹和三维位置）表示为自然语言文本，最大化利用了预训练大语言模型的世界知识。这种方法使得 EMMA 能够在统一的语言空间中联合处理多种驾驶任务，并使用任务特定的提示生成每个任务的输出。实证结果显示，EMMA 在 nuScenes 数据集上的运动规划任务中达到了最先进的性能，同时在 Waymo Open Motion Dataset (WOMD) 上也取得了有竞争力的结果。此外，EMMA 在 Waymo Open Dataset (WOD) 上的摄像头主导的三维物体检测任务中也表现出色。我们展示了通过联合训练 EMMA 在规划轨迹、物体检测和道路图任务上，能够在这三个领域中均取得改进，突显了 EMMA 作为自动驾驶应用通用模型的潜力。然而，EMMA 也存在一些局限性：它只能处理少量的图像帧，未整合如 LiDAR 或雷达等精确的三维感知模态，并且计算成本较高。我们希望这些结果能够激发进一步的研究，以解决这些问题并进一步推动自动驾驶模型架构的发展。

[NLP-3] 100K or 100 Days: Trade-offs when Pre-Training with Academic Resources

【速读】：该论文试图解决学术研究人员在计算资源有限的情况下能否进行模型预训练的问题。解决方案的关键在于通过调查学术研究人员的可用计算资源，并进行实证研究来测量在这些资源上复制模型的所需时间。论文引入了一个基准测试，用于测量在给定GPU上预训练模型的时间，并确定了最大化训练速度的理想设置。通过在多种模型和学术GPU上进行实验，花费了2000 GPU-小时，研究结果表明，学术预训练的前景比通常假设的要乐观，例如，Pythia-1B模型原本在64个GPU上训练3天，但在4个GPU上仅需18天即可复制。论文还进行了成本效益分析，以帮助研究人员权衡价格与预训练时间的关系，并公开了代码库以支持学术研究。

链接: https://arxiv.org/abs/2410.23261
作者: Apoorv Khandelwal,Tian Yun,Nihal V. Nayak,Jack Merullo,Stephen H. Bach,Chen Sun,Ellie Pavlick
关键词-EN: notoriously under-resourced, notoriously compute-intensive, notoriously, models, academic
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pre-training is notoriously compute-intensive and academic researchers are notoriously under-resourced. It is, therefore, commonly assumed that academics can’t pre-train models. In this paper, we seek to clarify this assumption. We first survey academic researchers to learn about their available compute and then empirically measure the time to replicate models on such resources. We introduce a benchmark to measure the time to pre-train models on given GPUs and also identify ideal settings for maximizing training speed. We run our benchmark on a range of models and academic GPUs, spending 2,000 GPU-hours on our experiments. Our results reveal a brighter picture for academic pre-training: for example, although Pythia-1B was originally trained on 64 GPUs for 3 days, we find it is also possible to replicate this model (with the same hyper-parameters) in 3x fewer GPU-days: i.e. on 4 GPUs in 18 days. We conclude with a cost-benefit analysis to help clarify the trade-offs between price and pre-training time. We believe our benchmark will help academic researchers conduct experiments that require training larger models on more data. We fully release our codebase at: this https URL.
摘要：预训练过程以计算密集著称，而学术研究者通常资源匮乏。因此，普遍认为学术界无法进行模型预训练。本文旨在澄清这一假设。我们首先对学术研究者进行调查，了解他们可用的计算资源，然后通过实验测量在这些资源上复制模型所需的时间。我们引入了一个基准测试，用于测量在给定GPU上预训练模型所需的时间，并确定了最大化训练速度的最佳设置。我们在一系列模型和学术GPU上运行了基准测试，实验共计消耗了2,000 GPU小时。我们的结果揭示了学术预训练的乐观前景：例如，尽管Pythia-1B最初在64个GPU上训练了3天，但我们发现也可以在3倍的GPU天内复制该模型（使用相同的超参数），即在4个GPU上训练18天。最后，我们进行了成本效益分析，以帮助澄清价格与预训练时间之间的权衡。我们相信，我们的基准测试将有助于学术研究者进行需要在大规模数据上训练更大模型的实验。我们已在以下链接完全公开了我们的代码库：this https URL。

[NLP-4] Evaluating Cultural and Social Awareness of LLM Web Agents

【速读】：该论文试图解决大型语言模型（LLMs）在作为现实世界应用的代理时，其对文化和社会规范的敏感性评估不足的问题。解决方案的关键在于引入CASA基准，该基准通过评估LLM代理在在线购物和社交讨论论坛两个网络任务中的表现，来衡量其对文化和社会规范的敏感性。具体方法包括评估代理对规范违反用户查询的检测和适当响应能力，以及在面对误导性网络内容时的违规率。论文还提出了一个综合评估框架，涵盖意识覆盖率、用户查询管理的有用性以及违规率。实验结果表明，当前LLMs在非代理环境下表现优于网络代理环境，因此论文探索了提示（prompting）和微调（fine-tuning）两种方法，发现结合这两种方法可以互补优势，特别是微调文化特定数据集能显著提升代理的跨区域泛化能力，而提示则增强了代理处理复杂任务的能力。

链接: https://arxiv.org/abs/2410.23252
作者: Haoyi Qiu,Alexander R. Fabbri,Divyansh Agarwal,Kung-Hsiang Huang,Sarah Tan,Nanyun Peng,Chien-Sheng Wu
关键词-EN: traditional NLP tasks, large language models, traditional NLP, NLP tasks, language models
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:As large language models (LLMs) expand into performing as agents for real-world applications beyond traditional NLP tasks, evaluating their robustness becomes increasingly important. However, existing benchmarks often overlook critical dimensions like cultural and social awareness. To address these, we introduce CASA, a benchmark designed to assess LLM agents’ sensitivity to cultural and social norms across two web-based tasks: online shopping and social discussion forums. Our approach evaluates LLM agents’ ability to detect and appropriately respond to norm-violating user queries and observations. Furthermore, we propose a comprehensive evaluation framework that measures awareness coverage, helpfulness in managing user queries, and the violation rate when facing misleading web content. Experiments show that current LLMs perform significantly better in non-agent than in web-based agent environments, with agents achieving less than 10% awareness coverage and over 40% violation rates. To improve performance, we explore two methods: prompting and fine-tuning, and find that combining both methods can offer complementary advantages – fine-tuning on culture-specific datasets significantly enhances the agents’ ability to generalize across different regions, while prompting boosts the agents’ ability to navigate complex tasks. These findings highlight the importance of constantly benchmarking LLM agents’ cultural and social awareness during the development cycle.
摘要：随着大语言模型（LLMs）扩展到作为智能体执行现实世界应用，超越传统的自然语言处理（NLP）任务，评估其鲁棒性变得愈发重要。然而，现有的基准测试往往忽视了文化和社会意识等关键维度。为此，我们引入了CASA，这是一个旨在评估LLM智能体对文化和社交规范敏感性的基准测试，涵盖了两个基于网络的任务：在线购物和社交讨论论坛。我们的方法评估了LLM智能体检测并适当回应违反规范的用户查询和观察的能力。此外，我们提出了一种综合评估框架，该框架测量了意识覆盖率、在处理用户查询时的有用性以及面对误导性网络内容时的违规率。实验表明，当前的LLMs在非智能体环境中表现显著优于基于网络的智能体环境，智能体的意识覆盖率不足10%，违规率超过40%。为了提升性能，我们探索了两种方法：提示（prompting）和微调（fine-tuning），并发现结合这两种方法可以提供互补优势——在特定文化数据集上的微调显著增强了智能体在不同地区间的泛化能力，而提示则提升了智能体处理复杂任务的能力。这些发现强调了在开发周期中持续基准测试LLM智能体的文化和社会意识的重要性。

[NLP-5] COMAL: A Convergent Meta-Algorithm for Aligning LLM s with General Preferences

【速读】：该论文试图解决现有对齐方法（如基于人类反馈的强化学习 (RLHF)）在捕捉广泛人类偏好方面的不足，特别是依赖于Bradley-Terry奖励假设的局限性。解决方案的关键在于将对齐问题建模为两玩家零和博弈，并通过提出的元算法——收敛元对齐算法 (Convergent Meta Alignment Algorithm, COMAL) 来寻找纳什均衡策略。该算法不仅理论上证明了在最后一次迭代中收敛到精确的纳什策略，而且具有简单性，能够与现有的RLHF和偏好优化方法无缝集成，实验结果显示其与现有方法结合时的有效性。

链接: https://arxiv.org/abs/2410.23223
作者: Yixin Liu,Argyris Oikonomou,Weiqiang Zheng,Yang Cai,Arman Cohan
关键词-EN: including reinforcement learning, Bradley-Terry reward assumption, general human preferences, Nash policy, human feedback
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is insufficient to capture the full range of general human preferences. To achieve robust alignment with general preferences, we model the alignment problem as a two-player zero-sum game, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous algorithms for finding the Nash policy either diverge or converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. Theoretically, we prove that our meta-algorithm converges to an exact Nash policy in the last iterate. Additionally, our meta-algorithm is simple and can be integrated with many existing methods designed for RLHF and preference optimization with minimal changes. Experimental results demonstrate the effectiveness of the proposed framework when combined with existing preference policy optimization methods.
摘要：许多对齐方法，包括基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF），依赖于Bradley-Terry奖励假设，这一假设不足以全面捕捉人类偏好的广泛范围。为了实现与普遍偏好的稳健对齐，我们将对齐问题建模为一个两人零和博弈，其中纳什均衡策略保证了对任何竞争策略的50%胜率。然而，先前用于寻找纳什策略的算法要么发散，要么在简单的合成环境中收敛到一个修改后的游戏中的纳什策略，从而无法维持对所有其他策略的50%胜率保证。我们提出了一种元算法，即收敛元对齐算法（Convergent Meta Alignment Algorithm, COMAL），用于大语言模型与普遍偏好的对齐，灵感来源于博弈论中的收敛算法。理论上，我们证明了我们的元算法在最后一次迭代中收敛到精确的纳什策略。此外，我们的元算法简单易行，可以与许多现有的RLHF和偏好优化方法进行最小化的改动集成。实验结果表明，当与现有的偏好策略优化方法结合时，所提出的框架具有显著的有效性。

[NLP-6] OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

【速读】：该论文试图解决在构建图形用户界面（GUI）代理时，开源视觉语言模型（VLM）在性能上与闭源商业模型（如GPT-4o和GeminiProVision）存在显著差距的问题，特别是在GUI定位和分布外（OOD）场景中的表现。解决方案的关键在于开发了OS-Atlas——一个基础的GUI动作模型，通过数据和建模方面的创新，显著提升了GUI定位和OOD任务的能力。论文的核心贡献包括：1) 开发了一个开源工具包，用于跨多个平台（Windows、Linux、MacOS、Android和Web）合成GUI定位数据；2) 发布了迄今为止最大的开源跨平台GUI定位语料库，包含超过1300万个GUI元素；3) 通过模型训练的创新，使OS-Atlas能够理解和泛化到未见过的界面。这些创新使得OS-Atlas在多个基准测试中显著超越了先前的最先进模型。

链接: https://arxiv.org/abs/2410.23218
作者: Zhiyong Wu,Zhenyu Wu,Fangzhi Xu,Yian Wang,Qiushi Sun,Chengyou Jia,Kanzhi Cheng,Zichen Ding,Liheng Chen,Paul Pu Liang,Yu Qiao
关键词-EN: agents heavily rely, robust commercial Vision-Language, building GUI agents, GUI agents heavily, GUI grounding
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.
摘要：现有的构建图形用户界面（GUI）智能体的努力在很大程度上依赖于强大的商用视觉语言模型（Vision-Language Models, VLMs），如 GPT-4o 和 GeminiProVision。由于开源 VLMs 在 GUI 定位和分布外（Out-Of-Distribution, OOD）场景中的性能显著落后于闭源模型，从业者往往不愿意使用开源 VLMs。为了促进该领域的未来研究，我们开发了 OS-Atlas——一个在 GUI 定位和 OOD 智能体任务中表现出色的基础 GUI 动作模型，这得益于我们在数据和建模方面的创新。我们投入了大量工程努力，开发了一个开源工具包，用于跨多个平台（包括 Windows、Linux、MacOS、Android 和 Web）合成 GUI 定位数据。利用这一工具包，我们发布了迄今为止最大的开源跨平台 GUI 定位语料库，其中包含超过 1300 万个 GUI 元素。该数据集与模型训练中的创新相结合，为 OS-Atlas 理解 GUI 截图并泛化到未见过的界面提供了坚实的基础。通过在涵盖移动、桌面和 Web 三个不同平台的六个基准测试中的广泛评估，OS-Atlas 展示了相对于之前最先进模型的显著性能提升。我们的评估还揭示了持续改进和扩展开源 VLMs 智能体能力的宝贵见解。

[NLP-7] Reliability of Topic Modeling

【速读】：该论文试图解决主题模型（Topic Models）在不同初始化、采样过程随机性或噪声数据影响下的可靠性问题。解决方案的关键在于引入并验证了三种新的可靠性评估指标，特别是McDonald’s ω，该指标在合成数据和真实世界数据中表现出了最佳的可靠性评估能力。这一发现强调了在基于主题模型的研究中，将可靠性验证作为标准流程的重要性。

链接: https://arxiv.org/abs/2410.23186
作者: Kayla Schroeder,Zach Wood-Doughty
关键词-EN: extract latent factors, downstream statistical analyses, Topic models, extract latent, latent factors
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Topic models allow researchers to extract latent factors from text data and use those variables in downstream statistical analyses. However, these methodologies can vary significantly due to initialization differences, randomness in sampling procedures, or noisy data. Reliability of these methods is of particular concern as many researchers treat learned topic models as ground truth for subsequent analyses. In this work, we show that the standard practice for quantifying topic model reliability fails to capture essential aspects of the variation in two widely-used topic models. Drawing from a extensive literature on measurement theory, we provide empirical and theoretical analyses of three other metrics for evaluating the reliability of topic models. On synthetic and real-world data, we show that McDonald’s \omega provides the best encapsulation of reliability. This metric provides an essential tool for validation of topic model methodologies that should be a standard component of any topic model-based research.
摘要：主题模型使研究人员能够从文本数据中提取潜在因素，并将这些变量用于下游的统计分析。然而，由于初始化差异、采样过程中的随机性或数据噪声，这些方法的实现可能会有显著差异。这些方法的可靠性尤为重要，因为许多研究人员将学习到的主题模型视为后续分析的基准事实。在本研究中，我们展示了当前用于量化主题模型可靠性的标准做法未能捕捉到两种广泛使用的主题模型中的关键变异方面。借鉴测量理论的广泛文献，我们提供了对三种其他评估主题模型可靠性的指标的实证和理论分析。在合成数据和真实世界数据上，我们发现 McDonald’s ω 提供了对可靠性的最佳封装。这一指标为验证主题模型方法提供了关键工具，应成为任何基于主题模型的研究的标准组成部分。

[NLP-8] ProTransformer: Robustify Transformers via Plug-and-Play Paradigm

【速读】：该论文试图解决基于Transformer架构的模型在面对各种攻击机制时的鲁棒性问题。解决方案的关键在于引入了一种新颖的鲁棒注意力机制（robust attention mechanism），该机制可以作为即插即用层（plug-and-play layer）集成到现有的Transformer模型中，显著提升其鲁棒性，而无需额外的训练或微调。通过广泛的实验和消融研究，论文展示了ProTransformer在多种预测任务、攻击机制、骨干架构和数据领域中显著增强了Transformer模型的鲁棒性。

链接: https://arxiv.org/abs/2410.23182
作者: Zhichao Hou,Weizhi Gao,Yuchen Shen,Feiyi Wang,Xiaorui Liu
关键词-EN: Transformer-based architectures, recent years, dominated various areas, areas of machine, machine learning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Transformer-based architectures have dominated various areas of machine learning in recent years. In this paper, we introduce a novel robust attention mechanism designed to enhance the resilience of transformer-based architectures. Crucially, this technique can be integrated into existing transformers as a plug-and-play layer, improving their robustness without the need for additional training or fine-tuning. Through comprehensive experiments and ablation studies, we demonstrate that our ProTransformer significantly enhances the robustness of transformer models across a variety of prediction tasks, attack mechanisms, backbone architectures, and data domains. Notably, without further fine-tuning, the ProTransformer consistently improves the performance of vanilla transformers by 19.5%, 28.3%, 16.1%, and 11.4% for BERT, ALBERT, DistilBERT, and RoBERTa, respectively, under the classical TextFooler attack. Furthermore, ProTransformer shows promising resilience in large language models (LLMs) against prompting-based attacks, improving the performance of T5 and LLaMA by 24.8% and 17.8%, respectively, and enhancing Vicuna by an average of 10.4% against the Jailbreaking attack. Beyond the language domain, ProTransformer also demonstrates outstanding robustness in both vision and graph domains.
摘要：近年来，基于 Transformer 的架构在机器学习的各个领域占据了主导地位。本文介绍了一种新颖的鲁棒注意力机制，旨在增强基于 Transformer 架构的韧性。关键在于，该技术可以作为即插即用的层集成到现有的 Transformer 中，从而在不需额外训练或微调的情况下提升其鲁棒性。通过全面的实验和消融研究，我们证明 ProTransformer 显著增强了 Transformer 模型在多种预测任务、攻击机制、骨干架构和数据领域中的鲁棒性。值得注意的是，在无需进一步微调的情况下，ProTransformer 在经典 TextFooler 攻击下，分别将 BERT、ALBERT、DistilBERT 和 RoBERTa 的性能提升了 19.5%、28.3%、16.1% 和 11.4%。此外，ProTransformer 在大语言模型 (LLM) 中显示出对基于提示攻击的显著韧性，分别将 T5 和 LLaMA 的性能提升了 24.8% 和 17.8%，并在 Jailbreaking 攻击下将 Vicuna 的性能平均提升了 10.4%。除了语言领域，ProTransformer 在视觉和图领域也展现了卓越的鲁棒性。

[NLP-9] SciPIP: An LLM -based Scientific Paper Idea Proposer

【速读】：该论文试图解决研究人员在知识爆炸和跨学科研究复杂性增加背景下所面临的信息过载和创新想法探索困难的问题。解决方案的关键在于提出了一个科学论文想法生成器 (SciPIP)，通过结合文献数据库检索和大型语言模型 (LLMs) 的能力，生成既新颖又可行的研究想法。具体来说，SciPIP 首先构建了一个包含多维度信息的文献检索数据库，并提出了一种基于语义、实体和引用共现的文献检索方法，以从多个角度检索与用户提供背景相关的文献。随后，SciPIP 采用双路径想法生成策略，一条路径从检索到的文献中推导解决方案，另一条路径通过模型头脑风暴生成原创想法，最终将两者结合以平衡可行性和原创性。实验结果表明，SciPIP 能够生成与顶级会议论文相似的引用和原创性想法，验证了其方法的有效性。

链接: https://arxiv.org/abs/2410.23166
作者: Wenxiao Wang,Lihui Gu,Liye Zhang,Yunxiang Luo,Yi Dai,Chen Shen,Liang Xie,Binbin Lin,Xiaofei He,Jieping Ye
关键词-EN: pose significant challenges, including information overload, interdisciplinary research pose, research pose significant, challenges for researchers
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 25 pages, 5 figures, 19 tables

点击查看摘要

Abstract:The exponential growth of knowledge and the increasing complexity of interdisciplinary research pose significant challenges for researchers, including information overload and difficulties in exploring novel ideas. The advancements in large language models (LLMs), such as GPT-4, have shown great potential in enhancing idea proposals, but how to effectively utilize large models for reasonable idea proposal has not been thoroughly explored. This paper proposes a scientific paper idea proposer (SciPIP). Based on a user-provided research background, SciPIP retrieves helpful papers from a literature database while leveraging the capabilities of LLMs to generate more novel and feasible ideas. To this end, 1) we construct a literature retrieval database, extracting lots of papers’ multi-dimension information for fast access. Then, a literature retrieval method based on semantics, entity, and citation co-occurrences is proposed to search relevant literature from multiple aspects based on the user-provided background. 2) After literature retrieval, we introduce dual-path idea proposal strategies, where one path infers solutions from the retrieved literature and the other path generates original ideas through model brainstorming. We then combine the two to achieve a good balance between feasibility and originality. Through extensive experiments on the natural language processing (NLP) field, we demonstrate that SciPIP can retrieve citations similar to those of existing top conference papers and generate many ideas consistent with them. Additionally, we evaluate the originality of other ideas generated by SciPIP using large language models, further validating the effectiveness of our proposed method. The code and the database are released at this https URL.
摘要：知识量的指数级增长和跨学科研究复杂性的增加，给研究人员带来了重大挑战，包括信息过载和探索新思想的困难。大语言模型（LLMs）如GPT-4的进步，展示了在增强思想提案方面的巨大潜力，但如何有效利用这些大模型进行合理的思想提案尚未得到深入探讨。本文提出了一种科学论文思想提案器（SciPIP）。基于用户提供的研究背景，SciPIP从文献数据库中检索有帮助的论文，同时利用LLMs的能力生成更创新和可行的思想。为此，1）我们构建了一个文献检索数据库，提取大量论文的多维度信息以实现快速访问。然后，提出了一种基于语义、实体和引用共现的文献检索方法，根据用户提供的背景从多方面搜索相关文献。2）在文献检索之后，我们引入了双路径思想提案策略，其中一条路径从检索到的文献中推断解决方案，另一条路径通过模型头脑风暴生成原创思想。我们将两者结合，以实现可行性与原创性之间的良好平衡。通过对自然语言处理（NLP）领域的广泛实验，我们证明了SciPIP能够检索到与现有顶级会议论文相似的引用，并生成许多与之一致的思想。此外，我们使用大语言模型评估了SciPIP生成的其他思想的原创性，进一步验证了我们提出方法的有效性。代码和数据库已在以下链接发布：https URL。

[NLP-10] he Good the Bad and the Ugly: The Role of AI Quality Disclosure in Lie Detection

【速读】：该论文试图解决低质量AI顾问在缺乏透明度的情况下如何助长文本谎言传播的问题。解决方案的关键在于揭示AI的真实有效性，并通过实验证明，当参与者了解AI的实际能力后，其真相检测率能够恢复到甚至超过其自身能力水平。此外，高质量的AI顾问无论是否披露其能力，都能显著提升真相检测效果。研究还发现，参与者对AI能力的预期与其对低质量、不透明AI顾问的过度依赖密切相关。

链接: https://arxiv.org/abs/2410.23143
作者: Haimanti Bhattacharya,Subhasish Dugar,Sanchaita Hazra,Bodhisattwa Prasad Majumder
关键词-EN: lacking quality disclosures, spread text-based lies, people detect lies, lacking quality, spread text-based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Order of the authors are in alphabetical order of their last names. All authors contributed equally. The manuscript is under review. 74 Pages, including appendices and references

点击查看摘要

Abstract:We investigate how low-quality AI advisors, lacking quality disclosures, can help spread text-based lies while seeming to help people detect lies. Participants in our experiment discern truth from lies by evaluating transcripts from a game show that mimicked deceptive social media exchanges on topics with objective truths. We find that when relying on low-quality advisors without disclosures, participants’ truth-detection rates fall below their own abilities, which recovered once the AI’s true effectiveness was revealed. Conversely, high-quality advisor enhances truth detection, regardless of disclosure. We discover that participants’ expectations about AI capabilities contribute to their undue reliance on opaque, low-quality advisors.
摘要：我们研究了低质量的 AI 顾问（缺乏质量披露）如何帮助传播基于文本的谎言，同时看似帮助人们检测谎言。实验中的参与者通过评估来自模仿具有客观真相话题的欺骗性社交媒体交流的游戏节目文本来辨别真相与谎言。我们发现，当依赖没有披露信息的低质量顾问时，参与者的真相检测率低于他们自身的能力，而一旦 AI 的真实有效性被揭示，这一情况得以恢复。相反，高质量顾问无论是否披露信息，都能提高真相检测率。我们发现，参与者对 AI 能力的预期导致了他们对不透明、低质量顾问的过度依赖。

[NLP-11] Crowdsourcing Lexical Diversity

【速读】：该论文试图解决词汇语义资源（Lexical-semantic resources, LSRs）中存在的偏见问题，特别是对英语和盎格鲁-撒克逊文化的偏见，以及跨语言词汇空缺（cross-lingual lexical gaps）的缺乏明确指示。解决方案的关键在于提出了一种新颖的众包方法（crowdsourcing methodology），通过LingoGap工具，让众包工作者比较两种语言中的词汇，重点关注词汇多样性丰富的领域，如亲属关系或食物。该方法通过微任务（microtasks）识别等价词、语言特定词和跨语言词汇空缺，从而减少LSRs中的偏见。通过在英语与阿拉伯语以及标准印尼语与班查尔语的食物相关术语上的实验，验证了该方法的有效性和工具的可用性。

链接: https://arxiv.org/abs/2410.23133
作者: Hadi Khalilia,Jahna Otterbacher,Gabor Bella,Rusma Noortyani,Shandy Darma,Fausto Giunchiglia
关键词-EN: language processing applications, Lexical-semantic resources, natural language processing, processing applications, fundamental for natural
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Lexical-semantic resources (LSRs), such as online lexicons or wordnets, are fundamental for natural language processing applications. In many languages, however, such resources suffer from quality issues: incorrect entries, incompleteness, but also, the rarely addressed issue of bias towards the English language and Anglo-Saxon culture. Such bias manifests itself in the absence of concepts specific to the language or culture at hand, the presence of foreign (Anglo-Saxon) concepts, as well as in the lack of an explicit indication of untranslatability, also known as cross-lingual \emphlexical gaps, when a term has no equivalent in another language. This paper proposes a novel crowdsourcing methodology for reducing bias in LSRs. Crowd workers compare lexemes from two languages, focusing on domains rich in lexical diversity, such as kinship or food. Our LingoGap crowdsourcing tool facilitates comparisons through microtasks identifying equivalent terms, language-specific terms, and lexical gaps across languages. We validated our method by applying it to two case studies focused on food-related terminology: (1) English and Arabic, and (2) Standard Indonesian and Banjarese. These experiments identified 2,140 lexical gaps in the first case study and 951 in the second. The success of these experiments confirmed the usability of our method and tool for future large-scale lexicon enrichment tasks.
摘要：词汇语义资源 (Lexical-semantic resources, LSRs)，如在线词典或词汇网络，是自然语言处理应用的基础。然而，在许多语言中，这些资源存在质量问题：条目错误、不完整，以及较少被提及的偏向英语和盎格鲁-撒克逊文化的偏见问题。这种偏见表现为缺乏特定语言或文化中的概念，存在外来（盎格鲁-撒克逊）概念，以及缺乏对不可翻译性的明确指示，即跨语言词汇空缺 (cross-lingual lexical gaps)，当一个术语在另一种语言中没有对应词时。本文提出了一种新颖的众包方法，用于减少 LSRs 中的偏见。众包工作者比较两种语言的词汇，重点关注词汇多样性丰富的领域，如亲属关系或食物。我们的 LingoGap 众包工具通过微任务促进跨语言的词汇比较，识别等价词、特定语言词汇以及词汇空缺。我们通过两个案例研究验证了这种方法，这两个研究均聚焦于食物相关术语：(1) 英语和阿拉伯语，以及 (2) 标准印尼语和班查尔语。这些实验在第一个案例研究中识别出 2,140 个词汇空缺，在第二个案例研究中识别出 951 个。这些实验的成功证实了我们的方法和工具在未来的大规模词典丰富任务中的可用性。

[NLP-12] On Memorization of Large Language Models in Logical Reasoning

【速读】：该论文试图解决大型语言模型（LLMs）在推理任务中表现出色但仍可能犯基本推理错误的问题，特别是理解LLMs推理能力的机制。论文的关键解决方案是通过量化测量记忆化在推理任务中的作用，使用基于骑士与无赖（Knights and Knaves, KK）谜题的动态生成逻辑推理基准进行系统研究。研究发现，尽管LLMs在微调后能够近乎完美地解决训练谜题（插值），但在谜题稍作扰动时会失败，表明模型严重依赖记忆化。然而，微调虽然导致记忆化，但也持续提升泛化性能。通过深入分析扰动测试、跨难度级别迁移性、模型内部探查以及使用错误答案进行微调，论文揭示了LLMs在KK谜题上学习推理的能力，尽管存在训练数据记忆化。这一现象表明LLMs在记忆化和真正推理能力之间存在复杂的相互作用。最终，通过每个样本的记忆化评分分析，论文阐明了LLMs在解决逻辑谜题时如何在推理和记忆化之间切换。

链接: https://arxiv.org/abs/2410.23123
作者: Chulin Xie,Yangsibo Huang,Chiyuan Zhang,Da Yu,Xinyun Chen,Bill Yuchen Lin,Bo Li,Badih Ghazi,Ravi Kumar
关键词-EN: Large language models, Large language, achieve good performance, basic reasoning mistakes, make basic reasoning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs’ reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (KK) puzzles. We found that LLMs could interpolate the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet fail when those puzzles are slightly perturbed, suggesting that the models heavily rely on memorization to solve those training puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. In-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers suggest that the LLMs learn to reason on KK puzzles despite training data memorization. This phenomenon indicates that LLMs exhibit a complex interplay between memorization and genuine reasoning abilities. Finally, our analysis with per-sample memorization score sheds light on how LLMs switch between reasoning and memorization in solving logical puzzles. Our code and data are available at this https URL.
摘要：大语言模型（LLMs）在复杂的推理基准测试中表现出色，但同时也可能犯下基本的推理错误。这种矛盾的行为在理解大语言模型推理能力的背后机制时显得颇为费解。一种假设认为，在常见推理基准测试中表现越来越高的近乎饱和的性能可能是由于对相似问题的记忆。本文中，我们系统地研究了这一假设，通过使用基于骑士与无赖（Knights and Knaves, KK）谜题的动态生成的逻辑推理基准，对推理任务中的记忆进行了定量测量。我们发现，经过微调后，大语言模型能够对训练中的谜题进行插值（达到近乎完美的准确率），但在这些谜题稍作扰动时却会失败，这表明模型在解决这些训练谜题时严重依赖于记忆。另一方面，我们展示了尽管微调导致大量记忆，但它也持续提升了泛化性能。通过扰动测试、跨难度级别迁移性、探查模型内部结构以及使用错误答案进行微调的深入分析，我们发现，尽管存在训练数据的记忆，大语言模型在KK谜题上学会了推理。这一现象表明，大语言模型在记忆与真正的推理能力之间展现出复杂的相互作用。最后，我们通过每个样本的记忆分数分析，揭示了大语言模型在解决逻辑谜题时如何在推理和记忆之间切换。我们的代码和数据可在以下链接获取：https URL。

[NLP-13] aching a Language Model to Distinguish Between Similar Details using a Small Adversarial Training Set

【速读】：该论文试图解决语言模型在自然语言推理任务（NLI）中对人工创建的对抗性示例（adversarial examples）表现不佳的问题。解决方案的关键在于通过在少量人工创建的对抗性训练集上进行微调（fine-tuning），帮助语言模型学会区分数据中相似的词语和短语，从而提高其在对抗性测试集上的准确率（+13%），同时保持其在原始NLI任务上的良好表现。此外，该方法还使得模型在SNLI测试集中最相似的矛盾句上的准确率从91.2%提升至92.9%。

链接: https://arxiv.org/abs/2410.23118
作者: Chris Achard
关键词-EN: Natural Language Inference, manually created adversarial, Stanford Natural Language, natural language tasks, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models can achieve high accuracy on natural language tasks such as NLI, but performance suffers on manually created adversarial examples. We investigate the performance of a language model trained on the Stanford Natural Language Inference (SNLI) corpus on a manually created adversarial test set. We then improve the model’s performance by fine tuning the model on a small, manually created adversarial training set, designed to help the language model to learn to differentiate between similar words and phrases in the data. We show an increase in accuracy on the adversarial test set (+ 13%) while still maintaining good performance on the original NLI task. We also show an increase in accuracy from 91.2% to 92.9% on the most similar contradictions in the SNLI test set (as judged by cosine similarity).
摘要：语言模型在自然语言推理（NLI）等自然语言任务上可以实现高准确率，但在手动创建的对抗性示例上表现不佳。我们研究了在斯坦福自然语言推理（SNLI）语料库上训练的语言模型在手动创建的对抗性测试集上的表现。随后，我们通过在小型手动创建的对抗性训练集上微调模型，以帮助语言模型学习区分数据中相似的词语和短语，从而提升了模型的表现。我们在对抗性测试集上的准确率提高了13%，同时在原始NLI任务上仍保持良好表现。此外，我们在SNLI测试集中最相似的矛盾项上的准确率从91.2%提升至92.9%（根据余弦相似度判断）。

[NLP-14] Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

【速读】：该论文试图解决大视觉语言模型（Large Vision-Language Models, LVLMs）在生成内容时可能出现的幻觉问题，特别是对象间关系（relation）的幻觉问题。解决方案的关键在于设计了一个统一的框架，通过评估从模型响应中提取的（对象, 关系, 对象）三元组（(object, relation, object) triplets）来同时测量对象和关系的幻觉。基于此框架，论文进一步引入了Tri-HE（Triplet-level Hallucination Evaluation）基准，用于同时研究对象和关系的幻觉问题。研究发现，现有LVLMs中关系幻觉问题比对象幻觉更为严重，并提出了一种简单且无需训练的方法来减轻幻觉，该方法在Tri-HE基准上超越了所有开源的竞争对手，达到了与GPT-4V相当的性能。

链接: https://arxiv.org/abs/2410.23114
作者: Junjie Wu,Tsz Ting Chung,Kai Chen,Dit-Yan Yeung
关键词-EN: Large Vision-Language Models, generate hallucinated contents, Large Vision-Language, Vision-Language Models, hallucination
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, in this paper we design a unified framework to measure object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to conduct hallucination evaluation on (object, relation, object) triplets extracted from LVLMs’ responses, and thus, could be easily generalized to different vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. We conduct comprehensive evaluations on Tri-HE and observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple yet effective training-free approach to mitigate hallucinations for LVLMs, with which, we exceed all open-sourced counterparts on Tri-HE, achieving comparable performance with the powerful GPT-4V. Our dataset and code for the reproduction of our experiments are available publicly at this https URL.
摘要：尽管大视觉语言模型 (Large Vision-Language Models, LVLMs) 在视觉语言推理方面表现出色，但它们可能会生成与给定图像不符的幻觉内容。大多数现有的 LVLM 幻觉基准测试仅限于评估与对象相关的幻觉。然而，两个对象之间关系的潜在幻觉，即关系幻觉，仍缺乏研究。为此，本文设计了一个统一的框架，以同时测量 LVLMs 中的对象和关系幻觉。该框架的核心思想是对从 LVLMs 响应中提取的 (对象, 关系, 对象) 三元组进行幻觉评估，因此可以轻松推广到不同的视觉语言任务。基于此框架，我们进一步引入了 Tri-HE，这是一个新颖的三元组级幻觉评估基准，可用于同时研究对象和关系幻觉。我们对 Tri-HE 进行了全面的评估，并观察到现有 LVLMs 中的关系幻觉问题比对象幻觉更为严重，突显了在构建可靠 LVLMs 时一个先前被忽视的问题。此外，基于我们的发现，我们设计了一种简单而有效的无训练方法来减轻 LVLMs 的幻觉，通过这种方法，我们在 Tri-HE 上超越了所有开源的同类模型，达到了与强大的 GPT-4V 相当的性能。我们的数据集和代码已公开，可用于复现实验，详见此 https URL。

[NLP-15] Comparative Analysis of Demonstration Selection Algorithms for LLM In-Context Learning

【速读】：该论文试图解决在大语言模型（LLMs）中，如何通过上下文学习（in-context learning）有效选择演示示例（demonstration examples）以优化模型性能的问题。解决方案的关键在于评估和比较现有的六种演示选择算法，从效率和效果两个角度出发，分析它们在不同任务和数据集上的表现。研究结果表明，算法性能在不同任务间存在显著差异，某些方法在特定场景下甚至不如随机选择。此外，增加演示数量并不总能提升性能，且在准确性和计算效率之间存在权衡。通过公开代码，该研究为未来改进演示选择算法提供了实验基础和参考。

链接: https://arxiv.org/abs/2410.23099
作者: Dong Shu,Mengnan Du
关键词-EN: Large Language Models, Language Models, Large Language, additional training, In-context learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:In-context learning can help Large Language Models (LLMs) to adapt new tasks without additional training. However, this performance heavily depends on the quality of the demonstrations, driving research into effective demonstration selection algorithms to optimize this process. These algorithms assist users in selecting the best k input-label pairs (demonstration examples) based on a given test input, enabling LLMs to in-context learn the relationship between the provided examples and the test inputs. Despite all the proposed demonstration selection algorithms, their efficiency and effectiveness remain unclear. This lack of clarity make it difficult to apply these algorithms in real-world scenarios and poses challenges for future research aimed at developing improved methods. This paper revisits six proposed algorithms, evaluating them on five datasets from both efficiency and effectiveness perspectives. Our experiments reveal significant variations in algorithm performance across different tasks, with some methods struggling to outperform random selection in certain scenarios. We also find that increasing the number of demonstrations does not always lead to better performance, and that there are often trade-offs between accuracy and computational efficiency. Our code is available at this https URL.
摘要：上下文学习可以帮助大语言模型 (LLMs) 在不进行额外训练的情况下适应新任务。然而，这种性能在很大程度上取决于演示的质量，从而推动了对有效演示选择算法的研究，以优化这一过程。这些算法帮助用户根据给定的测试输入选择最佳的 k 个输入-标签对（演示示例），使 LLMs 能够在上下文中学习提供的示例与测试输入之间的关系。尽管提出了多种演示选择算法，但其效率和有效性仍不明确。这种不明确性使得这些算法在实际应用中难以实施，并为未来旨在开发改进方法的研究带来了挑战。本文重新审视了六种提出的算法，从效率和有效性两个角度对五个数据集进行了评估。我们的实验揭示了不同任务中算法性能的显著差异，某些方法在某些情况下难以超越随机选择。我们还发现，增加演示的数量并不总是能带来更好的性能，并且在准确性和计算效率之间往往存在权衡。我们的代码可在以下链接获取：https URL。

[NLP-16] CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation

【速读】：该论文试图解决现有研究主要集中在单轮对话的检索增强生成（Retrieval-Augmented Generation, RAG）系统，而忽视了现实应用中复杂的多轮对话场景的问题。解决方案的关键在于引入了一个大规模基准测试CORAL，该基准专门设计用于评估RAG系统在真实多轮对话环境中的表现。CORAL通过从维基百科自动提取多样化的信息寻求对话，涵盖了开放领域覆盖、知识密集度、自由形式回复和话题转移等关键挑战，并支持对话RAG的三个核心任务：段落检索、回复生成和引用标注。论文提出了一种统一的框架来标准化各种对话RAG方法，并通过在CORAL上的全面评估，展示了现有方法改进的巨大潜力。

链接: https://arxiv.org/abs/2410.23090
作者: Yiruo Cheng,Kelong Mao,Ziliang Zhao,Guanting Dong,Hongjin Qian,Yongkang Wu,Tetsuya Sakai,Ji-Rong Wen,Zhicheng Dou
关键词-EN: large language models, enhancing large language, language models, powerful paradigm, paradigm for enhancing
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a powerful paradigm for enhancing large language models (LLMs) through external knowledge retrieval. Despite its widespread attention, existing academic research predominantly focuses on single-turn RAG, leaving a significant gap in addressing the complexities of multi-turn conversations found in real-world applications. To bridge this gap, we introduce CORAL, a large-scale benchmark designed to assess RAG systems in realistic multi-turn conversational settings. CORAL includes diverse information-seeking conversations automatically derived from Wikipedia and tackles key challenges such as open-domain coverage, knowledge intensity, free-form responses, and topic shifts. It supports three core tasks of conversational RAG: passage retrieval, response generation, and citation labeling. We propose a unified framework to standardize various conversational RAG methods and conduct a comprehensive evaluation of these methods on CORAL, demonstrating substantial opportunities for improving existing approaches.
摘要：检索增强生成 (Retrieval-Augmented Generation, RAG) 已成为通过外部知识检索增强大语言模型 (Large Language Models, LLMs) 的一种强大范式。尽管其受到广泛关注，现有学术研究主要集中在单轮 RAG 上，未能充分解决现实应用中多轮对话的复杂性问题。为填补这一空白，我们引入了 CORAL，这是一个大规模基准测试，旨在评估 RAG 系统在现实多轮对话环境中的表现。CORAL 包含了从维基百科自动导出的多样化信息寻求对话，并应对开放领域覆盖、知识密集度、自由形式回复和话题转换等关键挑战。它支持对话 RAG 的三个核心任务：段落检索、回复生成和引用标注。我们提出了一种统一框架，以标准化各种对话 RAG 方法，并在 CORAL 上对这些方法进行了全面评估，展示了改进现有方法的巨大潜力。

[NLP-17] BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference

【速读】：该论文试图解决大型语言模型（LLMs）在自然语言处理中面临的推理速度和计算效率问题，特别是在实时部署中的限制。解决方案的关键是提出了一种名为BUZZ的新型键值（KV）缓存算法，该算法通过利用结构化的上下文信息来最小化缓存内存使用，同时增强推理速度。BUZZ采用蜂窝结构的稀疏缓存，结合滑动窗口捕捉近期信息，并动态地将历史标记分块，以优先处理局部邻域中的重要标记。实验结果表明，BUZZ在减少缓存内存使用的同时，保持了高准确性，并在多文档问答任务中超越了现有技术水平。

链接: https://arxiv.org/abs/2410.23079
作者: Junqi Zhao,Zhijin Fang,Shu Li,Shaohui Yang,Shichao He
关键词-EN: Large language models, limiting real-time deployment, natural language processing, Large language, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are essential in natural language processing but often struggle with inference speed and computational efficiency, limiting real-time deployment. The key-value (KV) cache mechanism reduces computational overhead in transformer models, but challenges in maintaining contextual understanding remain. In this paper, we propose BUZZ, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed. BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information and dynamically segmenting historical tokens into chunks to prioritize important tokens in local neighborhoods. We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA. Our results demonstrate that BUZZ (1) reduces cache memory usage by \textbf2.5\times in LLM inference while maintaining over 99% accuracy in long-text summarization, and (2) surpasses state-of-the-art performance in multi-document question answering by \textbf7.69% under the same memory limit, where full cache methods encounter out-of-memory issues. Additionally, BUZZ achieves significant inference speedup with a \logn time complexity. The code is available at this https URL.
摘要：大语言模型（LLMs）在自然语言处理中至关重要，但在推理速度和计算效率方面常常面临挑战，限制了实时部署的可能性。键值（KV）缓存机制虽然减少了Transformer模型中的计算开销，但在维持上下文理解方面仍存在问题。本文提出了一种名为BUZZ的新型KV缓存算法，该算法利用结构化的上下文信息来最小化缓存内存使用，同时提升推理速度。BUZZ采用蜂巢结构的稀疏缓存，结合滑动窗口捕捉近期信息，并动态地将历史Token分段为块，以优先处理局部邻域中的重要Token。我们在四个真实世界的数据集上评估了BUZZ：CNN/Daily Mail、XSUM、Wikitext和10-QA。结果表明，BUZZ在LLM推理中将缓存内存使用减少了2.5倍，同时在长文本摘要中保持了超过99%的准确率；在相同内存限制下，BUZZ在多文档问答中的表现超越了最先进的方法7.69%，而全缓存方法在此情况下会遇到内存不足的问题。此外，BUZZ实现了显著的推理加速，时间复杂度为O(log n)。代码可在以下链接获取：https URL。

[NLP-18] Multi-Programming Language Sandbox for LLM s

【速读】：该论文试图解决大型语言模型（LLM）在生成代码时缺乏统一和全面的编译器及分析工具反馈的问题。解决方案的关键是引入MPLSandbox，这是一个开箱即用的多编程语言沙箱，能够自动识别代码的编程语言，并在隔离的子沙箱中编译和执行代码，以确保安全性和稳定性。MPLSandbox还集成了传统和基于LLM的代码分析工具，提供对生成代码的全面分析。通过无缝集成到LLM的训练和部署中，MPLSandbox旨在提高生成代码的质量和正确性，同时简化研究人员在LLM相关代码任务中的工作流程，降低开发成本。

链接: https://arxiv.org/abs/2410.23074
作者: Shihan Dou,Jiazheng Zhang,Jianxiang Zang,Yunbo Tao,Haoxiang Jia,Shichun Liu,Yuming Yang,Shenxi Wu,Shaoqing Zhang,Muling Wu,Changze Lv,Limao Xiong,Wenyu Zhan,Lin Zhang,Rongxiang Weng,Jingang Wang,Xunliang Cai,Yueming Wu,Ming Wen,Rui Zheng,Tao Ji,Yixin Cao,Tao Gui,Xipeng Qiu,Qi Zhang,Xuanjing Huang
关键词-EN: Large Language Models, multi-programming language sandbox, language sandbox designed, Language Models, Large Language
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 25 pages, 14 figures

点击查看摘要

Abstract:We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. In addition, MPLSandbox also integrates both traditional and LLM-based code analysis tools, providing a comprehensive analysis of generated code. MPLSandbox can be effortlessly integrated into the training and deployment of LLMs to improve the quality and correctness of their generated code. It also helps researchers streamline their workflows for various LLM-based code-related tasks, reducing the development cost. To validate the effectiveness of MPLSandbox, we integrate it into training and deployment approaches, and also employ it to optimize workflows for a wide range of real-world code-related tasks. Our goal is to enhance researcher productivity on LLM-based code-related tasks by simplifying and automating workflows through delegation to MPLSandbox.
摘要：我们介绍了 MPLSandbox，这是一个开箱即用的多编程语言沙箱，旨在为大语言模型 (LLM) 提供来自编译器和分析工具的统一且全面的反馈。MPLSandbox 能够自动识别代码的编程语言，并在隔离的子沙箱中进行编译和执行，以确保安全性和稳定性。此外，MPLSandbox 还集成了传统和基于 LLM 的代码分析工具，提供对生成代码的全面分析。MPLSandbox 可以轻松集成到 LLM 的训练和部署中，以提高其生成代码的质量和正确性。它还帮助研究人员简化各种基于 LLM 的代码相关任务的工作流程，降低开发成本。为了验证 MPLSandbox 的有效性，我们将其集成到训练和部署方法中，并使用它来优化广泛的实际代码相关任务的工作流程。我们的目标是简化并自动化工作流程，通过委托给 MPLSandbox 来提高研究人员在基于 LLM 的代码相关任务中的生产力。

[NLP-19] Dont Just Pay Attention PLANT It: Transfer L2R Models to Fine-tune Attention in Extreme Multi-Label Text Classification

【速读】：该论文试图解决极端多标签文本分类 (Extreme Multi-Label Text Classification, XMTC) 模型中获取最优注意力权重 (attention weights) 的挑战，这一过程既复杂又资源密集。解决方案的关键在于引入了一种名为 PLANT (Pretrained and Leveraged AtteNTion) 的新型迁移学习策略，用于微调 XMTC 解码器。PLANT 的核心创新包括：利用预训练的排序学习模型 (Learning-to-Rank model) 作为植入的注意力层，结合互信息增益 (mutual-information gain) 来增强注意力，引入一种“不注意”机制 (inattention mechanism)，以及实现一个状态解码器 (stateful-decoder) 以维持上下文。这些技术共同作用，使得 PLANT 在多个数据集上超越了现有的最先进方法，特别是在少样本 (few-shot) 场景下表现尤为突出。

链接: https://arxiv.org/abs/2410.23066
作者: Debjyoti Saharoy,Javed A. Aslam,Virgil Pavlu
关键词-EN: Extreme Multi-Label Text, Multi-Label Text Classification, Text Classification, Extreme Multi-Label, optimal attention weights
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:State-of-the-art Extreme Multi-Label Text Classification (XMTC) models rely heavily on multi-label attention layers to focus on key tokens in input text, but obtaining optimal attention weights is challenging and resource-intensive. To address this, we introduce PLANT – Pretrained and Leveraged AtteNTion – a novel transfer learning strategy for fine-tuning XMTC decoders. PLANT surpasses existing state-of-the-art methods across all metrics on mimicfull, mimicfifty, mimicfour, eurlex, and wikiten datasets. It particularly excels in few-shot scenarios, outperforming previous models specifically designed for few-shot scenarios by over 50 percentage points in F1 scores on mimicrare and by over 36 percentage points on mimicfew, demonstrating its superior capability in handling rare codes. PLANT also shows remarkable data efficiency in few-shot scenarios, achieving precision comparable to traditional models with significantly less data. These results are achieved through key technical innovations: leveraging a pretrained Learning-to-Rank model as the planted attention layer, integrating mutual-information gain to enhance attention, introducing an inattention mechanism, and implementing a stateful-decoder to maintain context. Comprehensive ablation studies validate the importance of these contributions in realizing the performance gains.
摘要：当前最先进的极多标签文本分类 (Extreme Multi-Label Text Classification, XMTC) 模型严重依赖多标签注意力层来聚焦于输入文本中的关键 Token，但获取最优注意力权重既具挑战性又资源密集。为此，我们提出了 PLANT——预训练与利用注意力 (Pretrained and Leveraged AtteNTion)——一种用于微调 XMTC 解码器的新型迁移学习策略。PLANT 在 mimicfull、mimicfifty、mimicfour、eurlex 和 wikiten 数据集上的所有指标均超越了现有的最先进方法。特别是在少样本场景下，PLANT 表现尤为突出，其在 mimicrare 数据集上的 F1 分数比专门为少样本场景设计的模型高出 50 个百分点以上，在 mimicfew 数据集上高出 36 个百分点以上，展示了其在处理稀有代码方面的卓越能力。PLANT 在少样本场景下也表现出显著的数据效率，能够以显著较少的数据量达到与传统模型相当的精度。这些成果的实现得益于关键的技术创新：利用预训练的排序学习模型作为植入的注意力层，整合互信息增益以增强注意力，引入非注意力机制，并实施状态解码器以维持上下文。全面的消融研究验证了这些贡献在实现性能提升中的重要性。

[NLP-20] Controlling Language and Diffusion Models by Transporting Activations

【速读】：该论文试图解决大型生成模型（如大型语言模型 (LLMs) 和文本到图像扩散模型 (T2Is)）在可靠性和安全性方面的问题，特别是模型输出的不可控性和潜在的滥用风险。解决方案的关键在于引入了一种名为激活传输 (Activation Transport, AcT) 的通用框架，该框架基于最优传输理论，能够通过调整模型激活来引导生成过程，从而实现对模型行为的精细控制。AcT 具有以下特点：1) 模态无关性，适用于多种生成模型；2) 提供细粒度的控制能力，能够在不影响模型性能的前提下，有效减少输出中的有害内容、引入特定概念或增强输出的真实性；3) 计算开销极小，几乎不影响模型的运行效率。通过实验，论文展示了 AcT 在 LLMs 和 T2Is 中的有效性和多功能性。

链接: https://arxiv.org/abs/2410.23054
作者: Pau Rodriguez,Arno Blaas,Michal Klein,Luca Zappella,Nicholas Apostoloff,Marco Cuturi,Xavier Suau
关键词-EN: potential misuse, increasing capabilities, widespread deployment, deployment have raised, raised concerns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.
摘要：随着大型生成式模型能力的不断提升及其广泛部署，其可靠性、安全性以及潜在的滥用问题引起了广泛关注。为应对这些问题，近期研究提出了通过调控模型激活来控制生成内容的方法，以有效引导或阻止特定概念或行为在输出中的出现。本文介绍了激活传输（Activation Transport, AcT），这是一个基于最优传输理论的通用框架，能够推广多种先前的激活调控方法。AcT 具有模态无关性，能够在几乎不增加计算开销的情况下，对模型行为进行精细控制，同时对模型能力的影响最小。我们通过实验展示了该方法在应对大语言模型（LLMs）和文本到图像扩散模型（T2Is）中的关键挑战时的有效性和多样性。对于 LLMs，我们展示了 AcT 在缓解毒性、引导任意概念生成以及提高输出真实性方面的有效性。在 T2Is 中，我们展示了 AcT 如何实现精细的风格控制和概念否定。

[NLP-21] Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

【速读】：该论文试图解决强化学习（RL）中从自然语言描述自动合成密集奖励的问题，特别是在稀疏奖励问题、开放式探索和分层技能设计中的应用。论文提出的解决方案之关键是ONI，一种分布式架构，它同时学习RL策略和内在奖励函数，利用大型语言模型（LLM）的反馈。ONI通过异步LLM服务器注释代理收集的经验，并将其提炼成内在奖励模型。该方法探索了不同复杂度的奖励建模算法选择，包括哈希、分类和排序模型，并研究了它们在稀疏奖励问题中的相对权衡。ONI在NetHack学习环境中的一系列具有挑战性的稀疏奖励任务上实现了最先进的性能，仅使用代理收集的经验，无需外部数据集或源代码。

链接: https://arxiv.org/abs/2410.23022
作者: Qinqing Zheng,Mikael Henaff,Amy Zhang,Aditya Grover,Brandon Amos
关键词-EN: Automatically synthesizing dense, Automatically synthesizing, natural language descriptions, synthesizing dense rewards, hierarchical skill design
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples; or are limited to reward functions expressible by compact code, which may require source code and have difficulty capturing nuanced semantics; or require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent’s collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. By studying their relative tradeoffs, we shed light on questions regarding intrinsic reward design for sparse reward problems. Our approach achieves state-of-the-art performance across a range of challenging, sparse reward tasks from the NetHack Learning Environment in a simple unified process, solely using the agent’s gathered experience, without requiring external datasets nor source code. We make our code available at \urlURL (coming soon).
摘要：自动从自然语言描述中合成密集奖励是强化学习（Reinforcement Learning, RL）中一个有前景的范式，适用于稀疏奖励问题、开放式探索以及分层技能设计。近期的工作通过利用大语言模型（Large Language Models, LLMs）的先验知识取得了显著进展。然而，这些方法存在重要局限性：它们要么无法扩展到需要数十亿环境样本的问题；要么局限于可通过紧凑代码表达的奖励函数，这可能需要源代码，并且难以捕捉微妙的语义；或者需要多样化的离线数据集，而这种数据集可能不存在或难以收集。在本研究中，我们通过算法和系统层面的贡献来解决这些局限性。我们提出了ONI，一种分布式架构，它同时学习RL策略和基于LLM反馈的内在奖励函数。我们的方法通过异步LLM服务器对智能体收集的经验进行标注，然后将其提炼成内在奖励模型。我们探索了一系列不同复杂度的奖励建模算法选择，包括哈希、分类和排序模型。通过研究它们的相对权衡，我们揭示了关于稀疏奖励问题内在奖励设计的相关问题。我们的方法在NetHack学习环境中一系列具有挑战性的稀疏奖励任务上实现了最先进的性能，整个过程简单统一，仅使用智能体收集的经验，无需外部数据集或源代码。我们将在\urlURL（即将上线）提供我们的代码。

[NLP-22] textscLong2RAG: Evaluating Long-Context Long-Form Retrieval-Augmented Generation with Key Point Recall EMNLP2024

【速读】：该论文试图解决当前检索增强生成 (Retrieval-augmented Generation, RAG) 系统评估基准的两个主要缺陷：(1) 缺乏能够充分反映检索文档特性的长上下文检索数据集，导致无法准确评估大语言模型 (Large Language Models, LLMs) 处理长上下文检索的能力；(2) 缺乏全面的评估方法来衡量 LLMs 生成利用检索信息的长篇回复的能力。解决方案的关键在于引入 \textscLong ^2 RAG 基准和 Key Point Recall (KPR) 指标。\textscLong ^2 RAG 包含 280 个跨 10 个领域和 8 种问题类别的问答对，每个问题附带 5 个平均长度为 2,444 字的检索文档。KPR 指标通过评估 LLMs 在生成回复时对检索文档中关键点的整合程度，提供了一种更为细致的评估方法，从而更有效地衡量 LLMs 利用检索信息的能力。

链接: https://arxiv.org/abs/2410.23000
作者: Zehan Qi,Rongwu Xu,Zhijiang Guo,Cunxiang Wang,Hao Zhang,Wei Xu
关键词-EN: large language models, Retrieval-augmented generation, language models, promising approach, limitations of fixed
类目: Computation and Language (cs.CL)
备注: Our paper has been accepted for EMNLP 2024

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in large language models (LLMs). However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs’ capability in handling \emphlong-context retrieval due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs’ ability to generate \emphlong-form responses that effectively exploits retrieved information. To address these shortcomings, we introduce the \textscLong ^2 RAG benchmark and the Key Point Recall (\textitKPR) metric. \textscLong ^2 RAG comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. \textitKPR evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information. Our dataset and scripts are available at this https URL.
摘要：检索增强生成 (Retrieval-augmented Generation, RAG) 是一种有前景的方法，用于解决大语言模型 (Large Language Models, LLMs) 中固定知识的局限性。然而，当前用于评估 RAG 系统的基准存在两个关键缺陷：(1) 它们未能充分衡量 LLMs 处理长上下文检索的能力，因为缺乏反映检索文档特征的数据集；(2) 它们缺乏一种全面的评估方法，用于评估 LLMs 生成利用检索信息的长篇响应的能力。为了解决这些不足，我们引入了 \textscLong ^2 RAG 基准和关键点召回 (\textitKPR) 指标。\textscLong ^2 RAG 包含 280 个问题，涵盖 10 个领域和 8 种问题类别，每个问题关联 5 个平均长度为 2,444 字的检索文档。\textitKPR 评估 LLMs 在生成响应时整合从检索文档中提取的关键点的程度，从而提供对其利用检索信息能力的更细致评估。我们的数据集和脚本可在以下链接获取：https URL。

[NLP-23] VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning

【速读】：该论文试图解决大型语言模型（LLMs）和大型多模态模型（LMMs）在处理视觉信息辅助的数学问题解决（MPS）过程中存在的不足，特别是模型在视觉辅助推理过程中的幻觉问题。解决方案的关键在于提出了VisAidMath基准，这是一个用于评估视觉信息辅助MPS过程的基准。该基准通过严格的自动化和手动标注数据处理流程，确保了数据的质量和可靠性，包含了从多种来源收集的1,200个具有挑战性的数学问题。基于此基准，论文对十个主流的LLMs和LMMs进行了全面评估，揭示了模型在视觉辅助推理任务中的不足，例如GPT-4V在该任务中的准确率仅为45.33%，并指出幻觉是导致这些不足的主要原因，为未来的研究指明了方向。

链接: https://arxiv.org/abs/2410.22995
作者: Jingkun Ma,Runzhe Zhan,Derek F. Wong,Yang Li,Di Sun,Hou Pong Chan,Lidia S. Chao
关键词-EN: large language models, large multi-modal models, problem-solving remains insufficient, explored mathematical problem-solving, systematically explored mathematical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 58 pages, 28 figures

点击查看摘要

Abstract:Although previous research on large language models (LLMs) and large multi-modal models (LMMs) has systematically explored mathematical problem-solving (MPS) within visual contexts, the analysis of how these models process visual information during problem-solving remains insufficient. To address this gap, we present VisAidMath, a benchmark for evaluating the MPS process related to visual information. We follow a rigorous data curation pipeline involving both automated processes and manual annotations to ensure data quality and reliability. Consequently, this benchmark includes 1,200 challenging problems from various mathematical branches, vision-aid formulations, and difficulty levels, collected from diverse sources such as textbooks, examination papers, and Olympiad problems. Based on the proposed benchmark, we conduct comprehensive evaluations on ten mainstream LLMs and LMMs, highlighting deficiencies in the visual-aided reasoning process. For example, GPT-4V only achieves 45.33% accuracy in the visual-aided reasoning task, even with a drop of 2 points when provided with golden visual aids. In-depth analysis reveals that the main cause of deficiencies lies in hallucination regarding the implicit visual reasoning process, shedding light on future research directions in the visual-aided MPS process.
摘要：尽管先前关于大语言模型（LLMs）和大多模态模型（LMMs）的研究已经系统性地探索了在视觉情境下的数学问题解决（MPS），但对于这些模型在问题解决过程中如何处理视觉信息的分析仍显不足。为了填补这一空白，我们提出了VisAidMath，一个用于评估与视觉信息相关的MPS过程的基准。我们遵循了一个严格的数据整理流程，包括自动化处理和人工标注，以确保数据的质量和可靠性。因此，该基准包含了从各种数学分支、视觉辅助表达和难度级别中收集的1,200个具有挑战性的问题，这些问题的来源包括教科书、考试试卷和奥林匹克竞赛题目。基于所提出的基准，我们对十个主流的LLMs和LMMs进行了全面的评估，突显了在视觉辅助推理过程中的不足。例如，GPT-4V在视觉辅助推理任务中仅达到45.33%的准确率，即使在提供黄金视觉辅助的情况下，准确率也下降了2个百分点。深入分析表明，主要缺陷在于对隐含视觉推理过程的幻觉现象，这为未来在视觉辅助MPS过程中的研究方向提供了启示。

[NLP-24] Bonafide at LegalLens 2024 Shared Task: Using Lightweight DeBERTa Based Encoder For Legal Violation Detection and Resolution

【速读】：该论文旨在解决非结构化文本数据中法律违规行为的检测及其与潜在受影响个体的关联问题。解决方案的关键在于提出了两个基于DeBERTa的轻量级编码器系统：命名实体识别系统 (Named Entity Resolution, NER) 和自然语言推理系统 (Natural Language Inference, NLI)。NER系统用于识别文本中的违规行为，在LegalLens挑战的Subtask A中达到了60.01%的F1分数，排名第六；NLI系统用于将这些违规行为与已有的集体诉讼案件中的法律投诉进行匹配，在Subtask B中达到了84.73%的F1分数，排名第五。这两个系统均优于大型语言模型 (LLM) 基线，并公开了训练模型和推理脚本。

链接: https://arxiv.org/abs/2410.22977
作者: Shikha Bordia
关键词-EN: Named Entity Resolution, Natural Language Inference, Named Entity, Entity Resolution, Natural Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we present two systems – Named Entity Resolution (NER) and Natural Language Inference (NLI) – for detecting legal violations within unstructured textual data and for associating these violations with potentially affected individuals, respectively. Both these systems are lightweight DeBERTa based encoders that outperform the LLM baselines. The proposed NER system achieved an F1 score of 60.01% on Subtask A of the LegalLens challenge, which focuses on identifying violations. The proposed NLI system achieved an F1 score of 84.73% on Subtask B of the LegalLens challenge, which focuses on resolving these violations by matching them with pre-existing legal complaints of class action cases. Our NER system ranked sixth and NLI system ranked fifth on the LegalLens leaderboard. We release the trained models and inference scripts.
摘要：在本研究中，我们提出了两个系统——命名实体解析 (Named Entity Resolution, NER) 和自然语言推理 (Natural Language Inference, NLI)，分别用于检测非结构化文本数据中的法律违规行为，并将这些违规行为与可能受影响的个人关联起来。这两个系统均基于轻量级的 DeBERTa 编码器，其性能优于大语言模型 (Large Language Model, LLM) 的基准。所提出的 NER 系统在 LegalLens 挑战赛的子任务 A 中达到了 60.01% 的 F1 分数，该子任务专注于识别违规行为。所提出的 NLI 系统在 LegalLens 挑战赛的子任务 B 中达到了 84.73% 的 F1 分数，该子任务专注于通过匹配这些违规行为与现有的集体诉讼案件中的预存法律投诉来解决这些违规行为。我们的 NER 系统在 LegalLens 排行榜上排名第六，NLI 系统排名第五。我们发布了训练好的模型和推理脚本。

[NLP-25] Private Synthetic Text Generation with Diffusion Models

【速读】：该论文试图解决在差分隐私（differential privacy）条件下，扩散模型（diffusion models）生成合成文本的能力问题。解决方案的关键在于通过广泛的实验，重新评估并实现先前关于使用大语言模型（LLMs）生成私有文本的研究，揭示其中可能存在的未满足的假设，这些假设可能导致差分隐私保证被违反。研究结果部分反驳了先前的非隐私条件下的发现，并表明在隐私保护机制下，完全开源的LLMs在生成合成文本方面优于扩散模型。

链接: https://arxiv.org/abs/2410.22971
作者: Sebastian Ochs,Ivan Habernal
关键词-EN: Abstract, generating synthetics texts, generating synthetic data, diffusion models, generating
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How capable are diffusion models of generating synthetics texts? Recent research shows their strengths, with performance reaching that of auto-regressive LLMs. But are they also good in generating synthetic data if the training was under differential privacy? Here the evidence is missing, yet the promises from private image generation look strong. In this paper we address this open question by extensive experiments. At the same time, we critically assess (and reimplement) previous works on synthetic private text generation with LLMs and reveal some unmet assumptions that might have led to violating the differential privacy guarantees. Our results partly contradict previous non-private findings and show that fully open-source LLMs outperform diffusion models in the privacy regime. Our complete source codes, datasets, and experimental setup is publicly available to foster future research.
摘要：扩散模型在生成合成文本方面的能力如何？最近的研究表明，它们的优势明显，性能已达到自回归大语言模型的水平。但如果训练是在差分隐私条件下进行的，它们在生成合成数据方面是否同样出色？目前这方面的证据尚不充分，但私有图像生成的潜力看起来很强。本文通过广泛的实验来探讨这一开放性问题。同时，我们批判性地评估（并重新实现了）之前关于使用大语言模型生成私有合成文本的研究，揭示了一些未满足的假设，这些假设可能导致违反差分隐私的保障。我们的部分结果与之前的非私有研究结论相矛盾，并表明在隐私保护机制下，完全开源的大语言模型优于扩散模型。我们完整的源代码、数据集和实验设置均公开，以促进未来研究。

[NLP-26] Focus On This Not That! Steering LLM s With Adaptive Feature Specification

【速读】：该论文试图解决大型语言模型（LLMs）在执行用户指定任务时，由于训练数据中的虚假或偏见特征而导致的不良行为问题。解决方案的关键是引入聚焦指令调优（Focus Instruction Tuning, FIT），通过训练LLMs在生成响应时聚焦于特定特征并忽略其他特征，从而根据指定的特征表现出不同的行为。实验结果表明，聚焦调优的模型在推理时可以通过聚焦于不同的特征来动态调整行为，例如通过聚焦于任务因果特征并忽略虚假特征来提高鲁棒性，或通过忽略人口统计类别来减少社会偏见。此外，FIT还能在新环境中引导行为，适应分布偏移并在推理时处理未见过的特征，从而在实际应用中实现更鲁棒、公平和可控的LLM应用。

链接: https://arxiv.org/abs/2410.22944
作者: Tom A. Lamb,Adam Davies,Alasdair Paren,Philip H.S. Torr,Francesco Pinto
关键词-EN: Focus Instruction Tuning, arbitrary user-specified tasks, training large language, Instruction Tuning, perform arbitrary user-specified
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28pages, 14 figures

点击查看摘要

Abstract:Despite the success of Instruction Tuning (IT) in training large language models (LLMs) to perform arbitrary user-specified tasks, these models often still leverage spurious or biased features learned from their training data, leading to undesired behaviours when deploying them in new contexts. In this work, we introduce Focus Instruction Tuning (FIT), which trains LLMs to condition their responses by focusing on specific features whilst ignoring others, leading to different behaviours based on what features are specified. Across several experimental settings, we show that focus-tuned models can be adaptively steered by focusing on different features at inference-time: for instance, robustness can be improved by focusing on task-causal features and ignoring spurious features, and social bias can be mitigated by ignoring demographic categories. Furthermore, FIT can steer behaviour in new contexts, generalising under distribution shift and to new unseen features at inference time, and thereby facilitating more robust, fair, and controllable LLM applications in real-world environments.
摘要：尽管指令调优 (Instruction Tuning) 在训练大语言模型 (LLMs) 以执行任意用户指定的任务方面取得了成功，但这些模型在部署到新环境中时，往往仍然会利用从训练数据中学到的虚假或偏见特征，导致不期望的行为。在本研究中，我们引入了聚焦指令调优 (Focus Instruction Tuning, FIT)，该方法训练 LLMs 在响应时聚焦于特定特征而忽略其他特征，从而根据所指定的特征产生不同的行为。在多个实验设置中，我们展示了聚焦调优的模型可以通过在推理时聚焦于不同特征来适应性地调整行为：例如，通过聚焦于任务因果特征并忽略虚假特征，可以提高鲁棒性；通过忽略人口统计类别，可以减轻社会偏见。此外，FIT 能够在新的环境中引导行为，在分布偏移和推理时对未见特征进行泛化，从而在现实世界环境中促进更鲁棒、公平和可控的 LLM 应用。

[NLP-27] Multi-Agent Large Language Models for Conversational Task-Solving

【速读】：该论文试图解决多智能体系统在对话任务解决中的局限性问题，特别是关于对话范式和个体智能体的影响。解决方案的关键在于系统性地评估多智能体系统在不同讨论范式下的表现，并提出一个分类框架来部署多智能体大型语言模型（LLMs）。论文通过实验展示了多智能体系统在复杂推理任务中的优势，但也揭示了三个主要挑战：1) 长讨论虽增强推理，但导致任务偏离；2) 长时间讨论可能引发对齐崩溃，带来新的安全问题；3) 讨论垄断问题，影响任务如总结的决策公平性。这些发现为未来研究提供了改进多智能体系统效率、性能和安全性的方向。

链接: https://arxiv.org/abs/2410.22932
作者: Jonas Becker
关键词-EN: single large language, large language models, intelligence for years, large language, dominated the landscape
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In an era where single large language models have dominated the landscape of artificial intelligence for years, multi-agent systems arise as new protagonists in conversational task-solving. While previous studies have showcased their potential in reasoning tasks and creative endeavors, an analysis of their limitations concerning the conversational paradigms and the impact of individual agents is missing. It remains unascertained how multi-agent discussions perform across tasks of varying complexity and how the structure of these conversations influences the process. To fill that gap, this work systematically evaluates multi-agent systems across various discussion paradigms, assessing their strengths and weaknesses in both generative tasks and question-answering tasks. Alongside the experiments, I propose a taxonomy of 20 multi-agent research studies from 2022 to 2024, followed by the introduction of a framework for deploying multi-agent LLMs in conversational task-solving. I demonstrate that while multi-agent systems excel in complex reasoning tasks, outperforming a single model by leveraging expert personas, they fail on basic tasks. Concretely, I identify three challenges that arise: 1) While longer discussions enhance reasoning, agents fail to maintain conformity to strict task requirements, which leads to problem drift, making shorter conversations more effective for basic tasks. 2) Prolonged discussions risk alignment collapse, raising new safety concerns for these systems. 3) I showcase discussion monopolization through long generations, posing the problem of fairness in decision-making for tasks like summarization. This work uncovers both the potential and challenges that arise with multi-agent interaction and varying conversational paradigms, providing insights into how future research could improve the efficiency, performance, and safety of multi-agent LLMs.
摘要：在单一大语言模型长期主导人工智能领域的时代，多智能体系统作为对话任务解决的新主角崭露头角。尽管先前的研究展示了它们在推理任务和创造性工作中的潜力，但对于其在对话范式中的局限性以及个体智能体的影响分析尚显不足。目前尚不清楚多智能体讨论在不同复杂度任务中的表现如何，以及这些对话结构如何影响任务解决过程。为了填补这一空白，本文系统地评估了多智能体系统在多种讨论范式下的表现，分析了其在生成任务和问答任务中的优缺点。在实验过程中，我提出了一份从2022年到2024年的20项多智能体研究分类，并引入了一个用于部署多智能体大语言模型进行对话任务解决的框架。实验表明，尽管多智能体系统在复杂推理任务中表现出色，通过利用专家角色超越了单一模型，但在基础任务上却表现不佳。具体来说，我识别出三个主要挑战：1) 虽然较长的讨论能提升推理能力，但智能体难以保持对严格任务要求的符合性，导致任务偏移，使得较短的对话在基础任务中更为有效。2) 长时间的讨论可能导致对齐崩溃，为这些系统带来了新的安全问题。3) 我展示了通过长时间生成导致的讨论垄断现象，提出了在诸如摘要等任务中决策公平性的问题。本研究揭示了多智能体交互和不同对话范式带来的潜力与挑战，为未来研究如何提升多智能体大语言模型的效率、性能和安全性提供了见解。

[NLP-28] Explainable Behavior Cloning: Teaching Large Language Model Agents through Learning by Demonstration

【速读】：该论文试图解决移动应用复杂性增加背景下，开发能够有效导航和交互的智能代理的挑战。解决方案的关键在于提出了一个名为Explainable Behavior Cloning LLM Agent (EBC-LLMAgent)的新方法，该方法结合了大型语言模型 (LLMs) 和行为克隆技术，通过学习演示来创建智能且可解释的代理。EBC-LLMAgent的核心模块包括演示编码、代码生成和UI映射，这些模块协同工作以捕捉用户演示、生成可执行代码并建立代码与UI元素之间的准确对应关系。论文还引入了行为克隆链融合技术，以增强代理的泛化能力。实验结果表明，EBC-LLMAgent在多个领域的流行移动应用中表现优异，能够高效完成任务、泛化到未见场景并生成有意义的解释。

链接: https://arxiv.org/abs/2410.22916
作者: Yanchu Guan,Dong Wang,Yan Wang,Haiqing Wang,Renen Sun,Chenyi Zhuang,Jinjie Gu,Zhixuan Chu
关键词-EN: Autonomous mobile app, mobile app interaction, Behavior Cloning, increasingly important, important with growing
类目: Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:Autonomous mobile app interaction has become increasingly important with growing complexity of mobile applications. Developing intelligent agents that can effectively navigate and interact with mobile apps remains a significant challenge. In this paper, we propose an Explainable Behavior Cloning LLM Agent (EBC-LLMAgent), a novel approach that combines large language models (LLMs) with behavior cloning by learning demonstrations to create intelligent and explainable agents for autonomous mobile app interaction. EBC-LLMAgent consists of three core modules: Demonstration Encoding, Code Generation, and UI Mapping, which work synergistically to capture user demonstrations, generate executable codes, and establish accurate correspondence between code and UI elements. We introduce the Behavior Cloning Chain Fusion technique to enhance the generalization capabilities of the agent. Extensive experiments on five popular mobile applications from diverse domains demonstrate the superior performance of EBC-LLMAgent, achieving high success rates in task completion, efficient generalization to unseen scenarios, and the generation of meaningful explanations.
摘要：随着移动应用程序复杂性的不断增加，自主移动应用交互变得越来越重要。开发能够有效导航和与移动应用交互的智能代理仍然是一个重大挑战。本文提出了一种可解释行为克隆大语言模型代理 (Explainable Behavior Cloning LLM Agent, EBC-LLMAgent)，这是一种结合了大语言模型 (Large Language Models, LLMs) 和行为克隆的新方法，通过学习演示来创建智能且可解释的代理，用于自主移动应用交互。EBC-LLMAgent 由三个核心模块组成：演示编码 (Demonstration Encoding)、代码生成 (Code Generation) 和 UI 映射 (UI Mapping)，这些模块协同工作，以捕捉用户演示、生成可执行代码，并建立代码与 UI 元素之间的准确对应关系。我们引入了行为克隆链融合技术 (Behavior Cloning Chain Fusion) 来增强代理的泛化能力。在来自不同领域的五个流行移动应用程序上的广泛实验表明，EBC-LLMAgent 在任务完成的成功率、对未见场景的高效泛化以及生成有意义的解释方面表现出色。

[NLP-29] From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

【速读】：该论文试图解决的问题是如何评估基于音素（phoneme）的训练对语言模型的影响，特别是在大多数基准测试都是基于正字法（orthographic）的情况下。解决方案的关键在于开发了一个将文本数据集转换为连续音素流的管道（pipeline），并将其应用于BabyLM挑战的1000万字预训练数据集以及标准语言和语法基准测试中。通过这种方式，研究者能够预训练并评估使用音素输入表示的模型，结果表明虽然音素基础训练在传统语言理解任务上略微降低了性能，但它提供了宝贵的分析和实际应用优势。

链接: https://arxiv.org/abs/2410.22906
作者: Zébulon Goriely,Richard Diehl Martinez,Andrew Caines,Lisa Beinborn,Paula Buttery
关键词-EN: default orthographic form, typically trained, trained on large, large corpora, orthographic form
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.
摘要：语言模型通常在其默认的正字法形式上训练于大量的文本语料库。然而，这并非唯一的选择；将数据表示为音素的流可以带来独特的优势，从更深入的音韵语言习得洞察到在基于声音的任务上性能的提升。挑战在于评估基于音素训练的影响，因为大多数基准测试也是正字法的。为了解决这一问题，我们开发了一个将文本数据集转换为连续音素流的管道。我们将此管道应用于BabyLM挑战中的1亿字预训练数据集，以及标准的语言和语法基准测试，使我们能够使用音素输入表示进行预训练和评估模型。我们的结果表明，尽管基于音素的训练在传统语言理解任务上略微降低了性能，但它提供了宝贵的分析和实际效益。

[NLP-30] Combining psychoanalysis and computer science: an empirical study of the relationship between emotions and the Lacanian discourses

【速读】：该论文试图解决的问题是如何在拉康理论（Lacanian theory）中系统地理解和应用情感（emotions），并将其与计算机科学方法相结合，以提高在文本中识别拉康话语（Lacanian discourses）的效率和准确性。解决方案的关键在于开发了一种名为拉康话语发现（Lacanian Discourse Discovery, LDD）的方法，该方法通过系统化和统计分析，揭示了情感与拉康话语之间的基本关系，并确定了每个话语中最具特征的情感。这一方法不仅在理论上丰富了拉康理论对情感的理解，还通过人工智能（AI）技术实现了在互动数字系统中自动化识别情感和相应话语的应用。

链接: https://arxiv.org/abs/2410.22895
作者: Minas Gadalla,Sotiris Nikoletseas,José Roberto de A. Amazonas
关键词-EN: mutually beneficial exchange, suggesting a mutually, beneficial exchange, Lacanian discourses, explores the interdisciplinary
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This research explores the interdisciplinary interaction between psychoanalysis and computer science, suggesting a mutually beneficial exchange. Indeed, psychoanalytic concepts can enrich technological applications involving unconscious, elusive aspects of the human factor, such as social media and other interactive digital platforms. Conversely, computer science, especially Artificial Intelligence (AI), can contribute quantitative concepts and methods to psychoanalysis, identifying patterns and emotional cues in human expression. In particular, this research aims to apply computer science methods to establish fundamental relationships between emotions and Lacanian discourses. Such relations are discovered in our approach via empirical investigation and statistical analysis, and are eventually validated in a theoretical (psychoanalytic) way. It is worth noting that, although emotions have been sporadically studied in Lacanian theory, to the best of our knowledge a systematic, detailed investigation of their role is missing. Such fine-grained understanding of the role of emotions can also make the identification of Lacanian discourses more effective and easy in practise. In particular, our methods indicate the emotions with highest differentiation power in terms of corresponding discourses; conversely, we identify for each discourse the most characteristic emotions it admits. As a matter of fact, we develop a method which we call Lacanian Discourse Discovery (LDD), that simplifies (via systematizing) the identification of Lacanian discourses in texts. Although the main contribution of this paper is inherently theoretical (psychoanalytic), it can also facilitate major practical applications in the realm of interactive digital systems. Indeed, our approach can be automated through Artificial Intelligence methods that effectively identify emotions (and corresponding discourses) in texts.
摘要：本研究探讨了精神分析与计算机科学之间的跨学科互动，提出了一种互惠互利的交流模式。精神分析的概念能够丰富涉及人类无意识、难以捉摸方面的技术应用，如社交媒体和其他互动数字平台。相反，计算机科学，特别是人工智能 (AI)，可以为精神分析贡献定量概念和方法，识别人类表达中的模式和情感线索。特别是，本研究旨在应用计算机科学方法来建立情感与拉康学说之间的基本关系。这些关系通过实证研究和统计分析在我们的方法中被发现，并最终在理论（精神分析）层面上得到验证。值得注意的是，尽管拉康理论中对情感的研究时有出现，但据我们所知，对其作用的系统、详细调查尚属空白。这种对情感作用的细致理解也能使拉康学说的识别在实践中更加有效和简便。特别是，我们的方法指出了在相应学说中具有最高区分力的情感；反之，我们为每种学说识别出最具特征的情感。事实上，我们开发了一种称为拉康学说发现 (Lacanian Discourse Discovery, LDD) 的方法，通过系统化简化了文本中拉康学说的识别。尽管本文的主要贡献本质上是理论性的（精神分析），但它也能促进互动数字系统领域中的重大实际应用。实际上，我们的方法可以通过人工智能方法自动化，有效地识别文本中的情感（及相应的学说）。

[NLP-31] VPO: Leveraging the Number of Votes in Preference Optimization

【速读】：该论文试图解决在直接偏好优化 (Direct Preference Optimization, DPO) 中如何更有效地利用用户投票数据以适应多样化的主观偏好问题。解决方案的关键在于引入了一种基于投票的偏好优化 (Vote-based Preference Optimization, VPO) 框架，该框架利用贝叶斯最小均方误差 (Bayesian Minimum Mean Square Error, Bayesian MMSE) 估计器来建模生成文本之间的偏好概率。通过将投票数据纳入优化过程，VPO 能够区分争议性生成对和明显偏好生成对，从而提升生成质量。此外，论文还展示了如何将这一框架应用于现有的 DPO 和身份偏好优化 (Identity Preference Optimization, IPO) 算法，形成 VDPO 和 VIPO，并通过实验验证了这些新算法在性能上的优越性。

链接: https://arxiv.org/abs/2410.22891
作者: Jae Hyeon Cho,Minkyung Park,Byung-Jun Lee
关键词-EN: Direct Preference Optimization, Reinforcement Learning, explicit reward modeling, reward modeling phase, phase of Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference dataset, DPO enhances generation quality by increasing the likelihood of producing preferred sentences over less favored ones. Preference datasets are typically created by selecting preferred sentences through a voting process involving multiple individuals, as opinions can vary due to the subjective nature of human preferences. While the number of votes offers insight into whether a sentence pair is clearly preferable or controversial, current methods do not fully leverage this information. In this paper, we introduce a technique that leverages user voting data to better align with diverse subjective preferences. We employ the Bayesian Minimum Mean Square Error (Bayesian MMSE) estimator to model the probability that one generation is preferable to another. Using this estimated probability as a target, we develop the Vote-based Preference Optimization (VPO) framework, which incorporates the number of votes on both sides to distinguish between controversial and obvious generation pairs. We show that previous algorithms, such as DPO and Identity Preference Optimization (IPO), can be extended using the proposed framework, termed VDPO and VIPO. Our experiments demonstrate that these proposed algorithms outperform various existing methods, including their base algorithms.
摘要：直接偏好优化 (Direct Preference Optimization, DPO) 利用人类偏好数据训练语言模型，绕过了从人类反馈中进行强化学习的显式奖励建模阶段。通过迭代处理偏好数据集中的句子对，DPO 通过增加生成偏好句子的概率来提升生成质量，从而减少生成不受欢迎句子的可能性。偏好数据集通常通过多人投票过程选择偏好句子来创建，因为人类偏好的主观性可能导致意见分歧。尽管投票数量提供了关于句子对是否明显偏好或存在争议的洞察，但现有方法并未充分利用这些信息。本文中，我们提出了一种利用用户投票数据来更好地与多样主观偏好对齐的技术。我们采用贝叶斯最小均方误差 (Bayesian Minimum Mean Square Error, Bayesian MMSE) 估计器来建模一个生成结果相对于另一个生成结果的偏好概率。利用这一估计概率作为目标，我们开发了基于投票的偏好优化 (Vote-based Preference Optimization, VPO) 框架，该框架结合了两边的投票数量，以区分有争议和明显的生成结果对。我们展示了先前的算法，如 DPO 和身份偏好优化 (Identity Preference Optimization, IPO)，可以通过提出的框架进行扩展，分别称为 VDPO 和 VIPO。我们的实验表明，这些提出的算法在各种现有方法中表现优异，包括其基础算法。

[NLP-32] Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector

【速读】：该论文旨在解决视觉语言模型 (Visual Language Models, VLMs) 在面对对抗性图像攻击时的脆弱性问题。解决方案的关键在于构建了一个新的、大规模的对抗性图像数据集 RADAR (laRge-scale Adervsarial images dataset with Diverse hArmful Responses)，并开发了一种基于实时嵌入的对抗性图像检测方法 NEARSIDE (iN-time Embedding-based AdveRSarial Image DEtection)。NEARSIDE 方法通过提取 VLM 隐藏状态中的单一向量，即攻击方向，来实现对抗性图像与良性图像的区分，从而有效检测对抗性攻击。实验结果表明，该方法在多个 VLM 模型上具有良好的效果、效率和跨模型可迁移性。

链接: https://arxiv.org/abs/2410.22888
作者: Youcheng Huang,Fengbin Zhu,Jingkun Tang,Pan Zhou,Wenqiang Lei,Jiancheng Lv,Tat-Seng Chua
关键词-EN: Visual Language Models, Visual Language, Language Models, Diverse hArmful Responses, under-explored in literature
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Visual Language Models (VLMs) are vulnerable to adversarial attacks, especially those from adversarial images, which is however under-explored in literature. To facilitate research on this critical safety problem, we first construct a new laRge-scale Adervsarial images dataset with Diverse hArmful Responses (RADAR), given that existing datasets are either small-scale or only contain limited types of harmful responses. With the new RADAR dataset, we further develop a novel and effective iN-time Embedding-based AdveRSarial Image DEtection (NEARSIDE) method, which exploits a single vector that distilled from the hidden states of VLMs, which we call the attacking direction, to achieve the detection of adversarial images against benign ones in the input. Extensive experiments with two victim VLMs, LLaVA and MiniGPT-4, well demonstrate the effectiveness, efficiency, and cross-model transferrability of our proposed method. Our code is available at this https URL
摘要：视觉语言模型 (Visual Language Models, VLMs) 容易受到对抗攻击，特别是来自对抗图像的攻击，然而这一问题在文献中尚未得到充分研究。为了促进对此关键安全问题的研究，我们首先构建了一个新的、大规模的对抗图像数据集，名为“多样化有害响应对抗图像数据集 (RADAR)”，因为现有的数据集要么规模较小，要么仅包含有限类型的有害响应。基于新的 RADAR 数据集，我们进一步开发了一种新颖且有效的“实时嵌入式对抗图像检测 (NEARSIDE)”方法，该方法利用从 VLMs 的隐藏状态中提取的单一向量，我们称之为“攻击方向”，来实现对抗图像与良性图像在输入中的检测。通过在两个受害者 VLMs（LLaVA 和 MiniGPT-4）上进行的广泛实验，充分证明了我们提出的方法在有效性、效率以及跨模型可转移性方面的优势。我们的代码可在以下链接获取：https URL。

[NLP-33] Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies EMNLP2024

【速读】：该论文试图解决的问题是如何通过更精细的课程学习策略（Curriculum Learning）来提升小规模语言模型（SSLMs）在BabyLM挑战中的认知合理性。解决方案的关键在于利用语言习得理论（linguistic acquisition theories）来设计更细粒度的跨语言课程学习策略。具体来说，研究者创建了四个语系相距较远的儿童导向语料库（Child-Directed Speech），并实施了三种基于习得理论的课程（Growing, Inwards和MMM），这些课程精确地复制了习得理论的预测。实验结果表明，这种细粒度的习得启发式课程能够超越非课程基线模型，并证明了通过指定精确复现语言习得理论的语言特定课程，可以有效提升SSLMs的性能。

链接: https://arxiv.org/abs/2410.22886
作者: Suchir Salhan,Richard Diehl Martinez,Zébulon Goriely,Paula Buttery
关键词-EN: BabyLM Challenge, Small-Scale Language Models, Curriculum Learning, popular strategy, strategy to improve
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: BabyLM Shared Task 2024 (Accepted, Poster), co-located in EMNLP 2024

点击查看摘要

Abstract:Curriculum Learning has been a popular strategy to improve the cognitive plausibility of Small-Scale Language Models (SSLMs) in the BabyLM Challenge. However, it has not led to considerable improvements over non-curriculum models. We assess whether theoretical linguistic acquisition theories can be used to specify more fine-grained curriculum learning strategies, creating age-ordered corpora of Child-Directed Speech for four typologically distant language families to implement SSLMs and acquisition-inspired curricula cross-lingually. Comparing the success of three objective curricula (Growing, Inwards and MMM) that precisely replicate the predictions of acquisition theories on a standard SSLM architecture, we find fine-grained acquisition-inspired curricula can outperform non-curriculum baselines and performance benefits of curricula strategies in SSLMs can be derived by specifying fine-grained language-specific curricula that precisely replicate language acquisition theories.
摘要：课程学习（Curriculum Learning）在提升小规模语言模型（Small-Scale Language Models, SSLMs）在BabyLM挑战中的认知合理性方面已成为一种流行策略。然而，它并未带来显著优于非课程模型的改进。我们评估了是否可以利用理论语言习得理论来制定更精细的课程学习策略，创建按年龄排序的儿童导向语言语料库，用于在四种语系差异较大的语言中实施SSLMs和受习得启发的跨语言课程。通过比较三种目标课程（Growing、Inwards和MMM）在标准SSLM架构上的成功率，这些课程精确复制了习得理论的预测，我们发现精细化的受习得启发的课程可以超越非课程基线，并且通过制定精确复制语言习得理论的细粒度语言特定课程，可以推导出课程策略在SSLMs中的性能优势。

[NLP-34] Stealing User Prompts from Mixture of Experts

【速读】：该论文旨在揭示混合专家模型（Mixture-of-Experts, MoE）中存在的安全漏洞，即攻击者可以通过安排其查询与受害者的查询出现在同一批次中，利用专家选择路由（Expert-Choice-Routing）机制完全泄露受害者的提示信息。解决方案的关键在于识别并利用MoE模型在处理相同批次查询时的路由行为，特别是CUDA实现中的平局处理机制。通过这种方式，攻击者可以在O(VM^2)查询次数内（其中V为词汇量大小，M为提示长度）或平均每个提示词100次查询的情况下，提取出完整的提示信息。这是首次针对LLM架构缺陷进行攻击以提取用户提示的研究，揭示了LLM面临的新一类安全威胁。

链接: https://arxiv.org/abs/2410.22884
作者: Itay Yona,Ilia Shumailov,Jamie Hayes,Nicholas Carlini
关键词-EN: dense language models, improve the efficiency, efficiency and scalability, scalability of dense, dense language
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models improve the efficiency and scalability of dense language models by routing each token to a small number of experts in each layer. In this paper, we show how an adversary that can arrange for their queries to appear in the same batch of examples as a victim’s queries can exploit Expert-Choice-Routing to fully disclose a victim’s prompt. We successfully demonstrate the effectiveness of this attack on a two-layer Mixtral model, exploiting the tie-handling behavior of the this http URL CUDA implementation. Our results show that we can extract the entire prompt using O(VM^2) queries (with vocabulary size V and prompt length M ) or 100 queries on average per token in the setting we consider. This is the first attack to exploit architectural flaws for the purpose of extracting user prompts, introducing a new class of LLM vulnerabilities.
摘要：混合专家模型（Mixture-of-Experts, MoE）通过在每一层将每个 Token 路由到少数专家，提高了密集语言模型的效率和可扩展性。本文展示了如何利用专家选择路由（Expert-Choice-Routing），使攻击者能够在同一批次示例中安排其查询与受害者的查询同时出现，从而完全披露受害者的提示信息。我们在一个两层 Mixtral 模型上成功演示了这种攻击的有效性，利用了该 http URL CUDA 实现的平局处理行为。我们的结果表明，可以在考虑的设置中使用 O(VM^2) 次查询（词汇表大小为 V，提示长度为 M）或平均每个 Token 100 次查询来提取整个提示信息。这是首次利用架构缺陷来提取用户提示的攻击，引入了大语言模型（LLM）的新一类漏洞。

[NLP-35] Eliciting Critical Reasoning in Retrieval-Augmented Language Models via Contrastive Explanations

【速读】：该论文试图解决检索增强生成 (Retrieval-augmented Generation, RAG) 机制在大型语言模型 (Large Language Models, LLMs) 中应用时，由于检索到的上下文可能包含噪声，导致模型难以进行批判性分析，从而引发错误推断和幻觉的问题。解决方案的关键在于提出对比性检索增强生成框架 (Contrastive-RAG, C-RAG)，该框架通过以下步骤实现：(i) 根据查询检索相关文档，(ii) 选择并举例相关段落，(iii) 生成对比性解释以明确段落的相关性，(iv) 支持最终答案的生成。C-RAG 通过构建对比性推理示范，指导较小模型进行检索增强任务，显著减少了所需的提示和示范数量，并增强了模型对检索文档中扰动的鲁棒性。

链接: https://arxiv.org/abs/2410.22874
作者: Leonardo Ranaldi,Marco Valentino,Andrè Freitas
关键词-EN: Large Language Models, support Large Language, Large Language, systematically accessing richer, accessing richer factual
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a critical mechanism in contemporary NLP to support Large Language Models(LLMs) in systematically accessing richer factual context. However, the integration of RAG mechanisms brings its inherent challenges, as LLMs need to deal with potentially noisy contexts. Recent studies have shown that LLMs still struggle to critically analyse RAG-based in-context information, a limitation that may lead to incorrect inferences and hallucinations. In this paper, we investigate how to elicit critical reasoning in RAG via contrastive explanations. In particular, we propose Contrastive-RAG (C-RAG), a framework that (i) retrieves relevant documents given a query, (ii) selects and exemplifies relevant passages, and (iii) generates explanations that explicitly contrast the relevance of the passages to (iv) support the final answer. We show the impact of C-RAG building contrastive reasoning demonstrations from LLMs to instruct smaller models for retrieval-augmented tasks. Extensive experiments demonstrate that C-RAG improves state-of-the-art RAG models while (a) requiring significantly fewer prompts and demonstrations and (b) being robust to perturbations in the retrieved documents.
摘要：检索增强生成 (Retrieval-augmented Generation, RAG) 已成为当代自然语言处理 (NLP) 中的关键机制，支持大语言模型 (Large Language Models, LLMs) 系统性地访问更丰富的实证上下文。然而，RAG 机制的整合带来了其固有的挑战，因为 LLMs 需要处理可能存在噪声的上下文。最近的研究表明，LLMs 在批判性分析基于 RAG 的上下文信息方面仍存在困难，这一局限可能导致错误的推断和幻觉现象。本文探讨了如何通过对比解释来激发 RAG 中的批判性推理。具体而言，我们提出了对比-RAG (Contrastive-RAG, C-RAG) 框架，该框架 (i) 根据查询检索相关文档，(ii) 选择并举例相关段落，(iii) 生成明确对比段落相关性的解释，(iv) 支持最终答案的生成。我们展示了 C-RAG 通过构建对比推理示范，指导较小模型进行检索增强任务的效果。广泛的实验表明，C-RAG 在提升现有最先进 RAG 模型的同时，(a) 显著减少了所需的提示和示范数量，(b) 对检索文档中的扰动具有鲁棒性。

[NLP-36] Danoliteracy of Generative Large Language Models ALT

【速读】：该论文试图解决生成式大型语言模型（GLLMs）在低资源语言如丹麦语中的能力评估问题。由于缺乏适用的评估语料库，这些模型在丹麦语中的能力难以通过定量方法验证。论文的关键解决方案是提出了一个GLLM基准，用于评估丹麦语和文化素养（Danoliteracy），涵盖了八个不同的场景，如丹麦公民测试和社交媒体问答。该基准通过与人类反馈的相关性（相关系数约为0.8）验证了其鲁棒性，并发现GPT-4和Claude Opus模型在此基准上表现最佳。此外，分析结果显示，模型在不同场景中的表现差异有95%可由一个“g因子”解释，该因子反映了模型在语言适应性上的内在一致性。

链接: https://arxiv.org/abs/2410.22839
作者: Søren Vejlgaard Holm,Lars Kai Hansen,Martin Carsten Nielsen
关键词-EN: technology moonshot moment, Large Language Models, moment of Generative, limited to English, language technology moonshot
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 13 figures, submitted to: NoDaLiDa/Baltic-HLT 2025

点击查看摘要

Abstract:The language technology moonshot moment of Generative, Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were until recently difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate Danoliteracy, a measure of Danish language and cultural competency, across eight diverse scenarios such Danish citizenship tests and abstractive social media question answering. This limited-size benchmark is found to produce a robust ranking that correlates to human feedback at \rho \sim 0.8 with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining 95% of scenario performance variance for GLLMs in Danish, suggesting a g factor of model consistency in language adaption.
摘要：生成式大语言模型 (Generative Large Language Models, GLLMs) 的语言技术突破不仅限于英语：这些模型为低资源语言带来了技术应用、投资和炒作的高潮。然而，由于缺乏适用的评估语料库，这些模型在丹麦语等语言中的能力直到最近仍难以通过定性演示之外的方式进行验证。我们提出了一个 GLLM 基准，用于评估丹麦语素养 (Danoliteracy)，即丹麦语言和文化能力的衡量标准，涵盖了八个不同的场景，如丹麦公民测试和抽象社交媒体问答。研究发现，这个有限规模的基准能够产生一个与人类反馈高度相关的稳健排名，相关系数约为 ρ ∼ 0.8，其中 GPT-4 和 Claude Opus 模型取得了最高的排名。通过对这些模型在不同场景中的结果进行分析，我们发现一个强有力的潜在因素解释了 GLLMs 在丹麦语中 95% 的场景性能变异，这表明了模型在语言适应性上的一致性因子 (g factor)。

[NLP-37] How Well Do Large Language Models Disambiguate Swedish Words?

【速读】：该论文试图解决瑞典语中的词义消歧问题，并评估了近期大型语言模型在此任务上的表现。解决方案的关键在于比较不同的提示方法，特别是如何表达给定上下文中的可能词义集合。研究结果表明，当在提示中包含人类编写的词义定义时，模型能够达到最高的准确率。

链接: https://arxiv.org/abs/2410.22827
作者: Richard Johansson
关键词-EN: recent large language, disambiguation in Swedish, large language models, word sense disambiguation, evaluate a battery
类目: Computation and Language (cs.CL)
备注: SLTC 2024 extended abstract

点击查看摘要

Abstract:We evaluate a battery of recent large language models on two benchmarks for word sense disambiguation in Swedish. At present, all current models are less accurate than the best supervised disambiguators in cases where a training set is available, but most models outperform graph-based unsupervised systems. Different prompting approaches are compared, with a focus on how to express the set of possible senses in a given context. The best accuracies are achieved when human-written definitions of the senses are included in the prompts.
摘要：我们评估了一系列近期的大语言模型在瑞典语词义消歧的两个基准测试中的表现。目前，在有训练集可用的情况下，所有现有模型在准确性上均不及最佳的监督式消歧器，但大多数模型优于基于图的无监督系统。我们比较了不同的提示方法，重点关注如何在给定上下文中表达可能的词义集合。当在提示中包含人类编写的词义定义时，模型达到了最高的准确率。

[NLP-38] EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations NEURIPS2024

【速读】：该论文试图解决现有大型语言模型（LLMs）在代码生成评估中的两个主要问题：数据泄露和缺乏领域特定评估。解决方案的关键在于提出了一个新的基准——EvoCodeBench，其核心特点包括：1) 动态更新的数据集，每6个月更新一次以避免数据泄露；2) 基于开源社区统计设计的编程领域分类法，包含10个流行领域，并为每个样本标注领域标签；3) 领域特定的评估方法，除了传统的Pass@k指标外，还计算领域特定改进（DSI），并定义了LLMs的舒适域和陌生域。这些改进帮助从业者在特定编程领域选择更优的LLMs，并揭示现有LLMs的不足。

链接: https://arxiv.org/abs/2410.22821
作者: Jia Li,Ge Li,Xuanming Zhang,Yunfei Zhao,Yihong Dong,Zhi Jin,Binhua Li,Fei Huang,Yongbin Li
关键词-EN: Large Language Models, evaluate Large Language, Language Models, Large Language, code generation remains
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted by the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:How to evaluate Large Language Models (LLMs) in code generation remains an open question. Existing benchmarks have two limitations - data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the latter hinders practitioners from selecting superior LLMs for specific programming domains. To address these two limitations, we propose a new benchmark - EvoCodeBench, which has the following advances: (1) Evolving data. EvoCodeBench will be dynamically updated every period (e.g., 6 months) to avoid data leakage. This paper releases the first version - EvoCodeBench-2403, containing 275 samples from 25 repositories. (2) A domain taxonomy and domain labels. Based on the statistics of open-source communities, we design a programming domain taxonomy consisting of 10 popular domains. Based on the taxonomy, we annotate each sample in EvoCodeBench with a domain label. (3) Domain-specific evaluations. Besides the Pass@k, we compute the Domain-Specific Improvement (DSI) and define LLMs’ comfort and strange domains. These evaluations help practitioners select superior LLMs in specific domains and discover the shortcomings of existing LLMs. We evaluate 8 popular LLMs (e.g., gpt-4, DeepSeek Coder) on EvoCodeBench and summarize some insights. EvoCodeBench reveals the actual abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 on EvoCodeBench-2403 is only 20.74%. Besides, we evaluate LLMs in different domains and discover their comfort and strange domains. For example, gpt-4 performs best in most domains but falls behind others in the Internet domain. StarCoder 2-15B unexpectedly performs well in the Database domain and even outperforms 33B LLMs. EvoCodeBench has been released.
摘要：如何评估大语言模型 (LLM) 在代码生成方面的表现仍然是一个开放的问题。现有的基准测试存在两个主要限制：数据泄露和缺乏领域特定的评估。前者损害了基准测试的公平性，后者则阻碍了从业者为特定编程领域选择更优的 LLM。为了解决这两个限制，我们提出了一种新的基准测试——EvoCodeBench，它具有以下优势：(1) 数据动态更新。EvoCodeBench 将每段时间（例如 6 个月）进行动态更新，以避免数据泄露。本文发布了首个版本——EvoCodeBench-2403，包含来自 25 个代码库的 275 个样本。(2) 领域分类和标签。基于开源社区的统计数据，我们设计了一个包含 10 个流行编程领域的分类体系。基于此分类体系，我们为 EvoCodeBench 中的每个样本标注了领域标签。(3) 领域特定的评估。除了 Pass@k 之外，我们还计算了领域特定的改进 (DSI)，并定义了 LLM 的舒适领域和陌生领域。这些评估帮助从业者在特定领域选择更优的 LLM，并发现现有 LLM 的不足之处。我们对 8 个流行的大语言模型（例如 gpt-4、DeepSeek Coder）在 EvoCodeBench 上进行了评估，并总结了一些见解。EvoCodeBench 揭示了这些 LLM 在真实世界代码库中的实际能力。例如，gpt-4 在 EvoCodeBench-2403 上的最高 Pass@1 仅为 20.74%。此外，我们对 LLM 在不同领域的表现进行了评估，发现了它们的舒适领域和陌生领域。例如，gpt-4 在大多数领域表现最佳，但在互联网领域落后于其他模型。StarCoder 2-15B 在数据库领域表现出色，甚至超过了 33B 的 LLM。EvoCodeBench 已经发布。

[NLP-39] MALoRA: Mixture of Asymmetric Low-Rank Adaptation for Enhanced Multi-Task Learning

【速读】：该论文试图解决多任务场景下参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方法在训练不平衡和跷跷板效应（seesaw effect）方面的问题。解决方案的关键是提出了一种名为混合非对称低秩适应（Mixture of Asymmetric Low-Rank Adaptation, MALoRA）的灵活微调框架。MALoRA通过在LoRA专家之间引入非对称优化，减少了可训练参数的数量（减少30%至48%），提高了训练速度（增加1.2倍），并匹配了单任务LoRA模型的计算效率。此外，MALoRA还解决了在高秩配置中常见的过拟合问题，增强了性能稳定性。实验结果表明，MALoRA在跨领域和领域内多任务学习场景中均优于所有基线方法。

链接: https://arxiv.org/abs/2410.22782
作者: Xujia Wang,Haiyan Zhao,Shuo Wang,Hanqing Wang,Zhiyuan Liu
关键词-EN: resource-efficient manner, significantly improved, improved the adaptation, adaptation of LLMs, LLMs to downstream
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have significantly improved the adaptation of LLMs to downstream tasks in a resource-efficient manner. However, in multi-task scenarios, challenges such as training imbalance and the seesaw effect frequently emerge. Mixture-of-LoRA (MoLoRA), which combines LoRA with sparse Mixture-of-Experts, mitigates some of these issues by promoting task-specific learning across experts. Despite this, MoLoRA remains inefficient in terms of training speed, parameter utilization, and overall multi-task performance. In this paper, we propose Mixture of Asymmetric Low-Rank Adaptaion (MALoRA), a flexible fine-tuning framework that leverages asymmetric optimization across LoRA experts. MALoRA reduces the number of trainable parameters by 30% to 48%, increases training speed by 1.2x, and matches the computational efficiency of single-task LoRA models. Additionally, MALoRA addresses overfitting issues commonly seen in high-rank configurations, enhancing performance stability. Extensive experiments across diverse multi-task learning scenarios demonstrate that MALoRA consistently outperforms all baseline methods in both inter-domain and intra-domain tasks.
摘要：参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 方法如 LoRA，显著提升了大语言模型 (LLM) 对下游任务的适应能力，同时保持了资源的高效利用。然而，在多任务场景中，训练不平衡和跷跷板效应等问题频繁出现。Mixture-of-LoRA (MoLoRA) 结合了 LoRA 与稀疏的 Mixture-of-Experts，通过促进专家间的任务特定学习，缓解了部分问题。尽管如此，MoLoRA 在训练速度、参数利用率以及整体多任务性能方面仍显不足。本文提出了一种灵活的微调框架——非对称低秩适应混合模型 (Mixture of Asymmetric Low-Rank Adaptaion, MALoRA)，该框架利用 LoRA 专家间的非对称优化。MALoRA 通过减少 30% 至 48% 的可训练参数，将训练速度提升 1.2 倍，并达到了单任务 LoRA 模型的计算效率。此外，MALoRA 解决了高秩配置中常见的过拟合问题，增强了性能的稳定性。在多种多任务学习场景中的广泛实验表明，MALoRA 在跨领域和领域内任务中均持续优于所有基线方法。

[NLP-40] InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

【速读】：该论文试图解决大语言模型（LLMs）在面对提示注入攻击（Prompt Injection Attacks）时，现有的提示防护模型（Prompt Guard Models）存在的过度防御问题，即误将良性输入标记为恶意输入。解决方案的关键在于提出了一个名为NotInject的评估数据集，用于系统性地测量不同提示防护模型的过度防御情况，并在此基础上提出了一个新的提示防护模型InjecGuard。InjecGuard通过引入一种新的训练策略——“免费缓解过度防御”（Mitigating Over-defense for Free, MOF），显著减少了触发词偏差，从而在多个基准测试中表现出色，超越了现有的最佳模型，提供了一个鲁棒且开源的解决方案来检测提示注入攻击。

链接: https://arxiv.org/abs/2410.22770
作者: Hao Li,Xiaogeng Liu,Chaowei Xiao
关键词-EN: enabling goal hijacking, large language models, Prompt injection attacks, data leakage, Prompt guard
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense – falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose InjecGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. InjecGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks. The code and datasets are released at this https URL.
摘要：提示注入攻击对大语言模型（LLM）构成了严重的威胁，可能导致目标劫持和数据泄露。尽管提示防护模型在防御方面有效，但存在过度防御的问题——由于触发词偏见，错误地将良性输入标记为恶意。为解决这一问题，我们引入了 NotInject，这是一个评估数据集，系统地衡量了各种提示防护模型的过度防御情况。NotInject 包含 339 个带有常见提示注入攻击触发词的良性样本，支持细粒度的评估。我们的研究结果表明，最先进的模型存在过度防御问题，准确率接近随机猜测水平（60%）。为缓解这一问题，我们提出了 InjecGuard，这是一种新型的提示防护模型，采用了新的训练策略——免费缓解过度防御（MOF），显著减少了触发词的偏见。InjecGuard 在包括 NotInject 在内的多个基准测试中展示了最先进的性能，比现有最佳模型高出 30.8%，为检测提示注入攻击提供了强大且开源的解决方案。代码和数据集已在此 https URL 发布。

[NLP-41] Beyond Ontology in Dialogue State Tracking for Goal-Oriented Chatbot

【速读】：该论文试图解决目标导向型聊天机器人中对话状态跟踪 (Dialogue State Tracking, DST) 的适应性问题，特别是在开放领域对话中，现有方法依赖于固定的本体和手动编译的槽值，限制了其灵活性。解决方案的关键在于利用指令微调 (instruction tuning) 和先进的提示策略 (prompt strategies)，使大型语言模型 (Large Language Model, LLM) 能够在没有预定义本体的情况下推断对话状态。该方法通过精心设计的提示和反幻觉机制 (anti-hallucination mechanism) 确保在多样化的对话上下文中进行准确跟踪，并采用变分图自编码器 (Variational Graph Auto-Encoder, VGAE) 来建模和预测后续用户意图。该方法在开放领域真实对话中表现出色，达到了最先进的性能，JGA 达到 42.57%。

链接: https://arxiv.org/abs/2410.22767
作者: Sejin Lee,Dongha Kim,Min Song
关键词-EN: making restaurant reservations, automating user tasks, restaurant reservations, essential for automating, booking flights
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: There are 10 chapters, including references, and 2 figures used. To be presented at the 15th IEEE International Conference on Knowledge Graphs (ICKG2024)

点击查看摘要

Abstract:Goal-oriented chatbots are essential for automating user tasks, such as booking flights or making restaurant reservations. A key component of these systems is Dialogue State Tracking (DST), which interprets user intent and maintains the dialogue state. However, existing DST methods often rely on fixed ontologies and manually compiled slot values, limiting their adaptability to open-domain dialogues. We propose a novel approach that leverages instruction tuning and advanced prompt strategies to enhance DST performance, without relying on any predefined ontologies. Our method enables Large Language Model (LLM) to infer dialogue states through carefully designed prompts and includes an anti-hallucination mechanism to ensure accurate tracking in diverse conversation contexts. Additionally, we employ a Variational Graph Auto-Encoder (VGAE) to model and predict subsequent user intent. Our approach achieved state-of-the-art with a JGA of 42.57% outperforming existing ontology-less DST models, and performed well in open-domain real-world conversations. This work presents a significant advancement in creating more adaptive and accurate goal-oriented chatbots.
摘要：面向目标的聊天机器人对于自动化用户任务（如预订航班或餐厅预订）至关重要。这些系统的关键组件之一是对话状态跟踪 (Dialogue State Tracking, DST)，它用于解释用户意图并维护对话状态。然而，现有的 DST 方法通常依赖于固定的本体论和手动编译的槽位值，这限制了它们在开放领域对话中的适应性。我们提出了一种新颖的方法，利用指令调优和先进的提示策略来增强 DST 性能，而无需依赖任何预定义的本体论。我们的方法使大语言模型 (Large Language Model, LLM) 能够通过精心设计的提示推断对话状态，并包含一个反幻觉机制，以确保在多样化的对话上下文中进行准确的跟踪。此外，我们采用变分图自编码器 (Variational Graph Auto-Encoder, VGAE) 来建模和预测后续用户意图。我们的方法在 JGA 指标上达到了 42.57% 的最新水平，超过了现有的无本体论 DST 模型，并且在开放领域的实际对话中表现出色。这项工作在创建更具适应性和准确性的面向目标的聊天机器人方面取得了显著进展。

[NLP-42] Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model

【速读】：该论文试图解决非英语语言（如日语）在视觉语言模型 (Visual Language Models, VLMs) 训练中缺乏多模态资源的问题。解决方案的关键在于提出了一种从零开始快速创建日语多模态数据集的方法。具体步骤包括从网络档案中收集日语图像-文本对和交错数据，并利用现有的VLM直接从图像生成日语指令数据。实验结果表明，基于这些本地数据集训练的VLM性能优于依赖机器翻译内容训练的模型。

链接: https://arxiv.org/abs/2410.22736
作者: Keito Sasagawa,Koki Maeda,Issa Sugiura,Shuhei Kurita,Naoaki Okazaki,Daisuke Kawahara
关键词-EN: Visual Language Models, develop high-performing Visual, high-performing Visual Language, high-performing Visual, Language Models
类目: Computation and Language (cs.CL)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:To develop high-performing Visual Language Models (VLMs), it is essential to prepare multimodal resources, such as image-text pairs, interleaved data, and instruction data. While multimodal resources for English are abundant, there is a significant lack of corresponding resources for non-English languages, such as Japanese. To address this problem, we take Japanese as a non-English language and propose a method for rapidly creating Japanese multimodal datasets from scratch. We collect Japanese image-text pairs and interleaved data from web archives and generate Japanese instruction data directly from images using an existing VLM. Our experimental results show that a VLM trained on these native datasets outperforms those relying on machine-translated content.
摘要：为了开发高性能的视觉语言模型 (Visual Language Models, VLMs)，准备多模态资源（如图像-文本对、交错数据和指令数据）是至关重要的。尽管英语的多模态资源丰富，但对于日语等非英语语言，相应的资源却严重缺乏。为了解决这一问题，我们以日语为例，提出了一种从零开始快速创建日语多模态数据集的方法。我们从网络档案中收集日语图像-文本对和交错数据，并利用现有的 VLM 直接从图像生成日语指令数据。我们的实验结果表明，基于这些本土数据集训练的 VLM 优于依赖机器翻译内容训练的模型。

[NLP-43] Improving Uncertainty Quantification in Large Language Models via Semantic Embeddings

【速读】：该论文试图解决在大语言模型 (LLMs) 中准确量化语义不确定性 (semantic uncertainty) 的问题，尤其是在高风险应用中的可靠性部署。当前最先进的方法依赖于多个生成响应之间的严格双向蕴含标准以及序列似然性，但这些方法往往由于对细微措辞差异、额外正确信息和不重要词汇的敏感性而高估不确定性。论文提出的解决方案关键在于利用语义嵌入 (semantic embeddings) 来实现更平滑和稳健的语义不确定性估计。通过捕捉语义相似性而不依赖于序列似然性，该方法自然减少了由答案中无关词汇引入的偏差。此外，通过在联合概率模型中将语义显式建模为潜在变量，引入了一种摊销版本的方法，使得在嵌入空间中的不确定性估计只需一次前向传递，显著降低了计算开销。实验结果表明，基于嵌入的方法在多个问答数据集和前沿 LLMs 中提供了比传统方法更准确和细致的不确定性量化。

链接: https://arxiv.org/abs/2410.22685
作者: Yashvir S. Grewal,Edwin V. Bonilla,Thang D. Bui
关键词-EN: Accurately quantifying uncertainty, Accurately quantifying, large language models, reliable deployment, high-stakes applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurately quantifying uncertainty in large language models (LLMs) is crucial for their reliable deployment, especially in high-stakes applications. Current state-of-the-art methods for measuring semantic uncertainty in LLMs rely on strict bidirectional entailment criteria between multiple generated responses and also depend on sequence likelihoods. While effective, these approaches often overestimate uncertainty due to their sensitivity to minor wording differences, additional correct information, and non-important words in the sequence. We propose a novel approach that leverages semantic embeddings to achieve smoother and more robust estimation of semantic uncertainty in LLMs. By capturing semantic similarities without depending on sequence likelihoods, our method inherently reduces any biases introduced by irrelevant words in the answers. Furthermore, we introduce an amortised version of our approach by explicitly modelling semantics as latent variables in a joint probabilistic model. This allows for uncertainty estimation in the embedding space with a single forward pass, significantly reducing computational overhead compared to existing multi-pass methods. Experiments across multiple question-answering datasets and frontier LLMs demonstrate that our embedding-based methods provide more accurate and nuanced uncertainty quantification than traditional approaches.
摘要：在大语言模型 (LLM) 中准确量化不确定性对于其可靠部署至关重要，尤其是在高风险应用中。当前最先进的方法用于测量 LLM 中的语义不确定性依赖于多个生成响应之间的严格双向蕴涵标准，并且还依赖于序列似然性。尽管这些方法有效，但由于它们对细微的措辞差异、额外正确信息以及序列中不重要的词语的敏感性，往往高估了不确定性。我们提出了一种利用语义嵌入的新方法，以实现对 LLM 中语义不确定性的更平滑和更稳健的估计。通过捕捉语义相似性而不依赖于序列似然性，我们的方法本质上减少了由答案中无关词语引入的任何偏差。此外，我们通过在联合概率模型中将语义显式建模为潜在变量，引入了我们方法的摊销版本。这使得在嵌入空间中进行不确定性估计只需一次前向传递，与现有的多遍方法相比，显著减少了计算开销。在多个问答数据集和前沿 LLM 上的实验表明，我们的基于嵌入的方法比传统方法提供了更准确和细致的不确定性量化。

[NLP-44] Automated Trustworthiness Oracle Generation for Machine Learning Text Classifiers

【速读】：该论文试图解决机器学习（ML）文本分类模型在实际应用中的可信度问题。传统评估指标如模型置信度和准确率不足以建立人类对ML模型的信任，因为这些模型在训练过程中可能学习到虚假的相关性，在实际应用中表现不佳。解决方案的关键在于提出了TOKI，这是一种自动化的可信度预言生成方法，通过解释方法和词嵌入技术自动检查预测相关词是否与预测类别相关。TOKI不仅提高了模型预测的可信度评估准确性（比基于模型置信度的基线方法高出142%），还引入了一种新的对抗攻击方法，该方法在减少扰动的情况下比现有的最先进（SOTA）方法A2T更为有效。

链接: https://arxiv.org/abs/2410.22663
作者: Lam Nguyen Tung,Steven Cho,Xiaoning Du,Neelofar Neelofar,Valerio Terragni,Stefano Ruberto,Aldeida Aleti
关键词-EN: Machine learning, chatbot consulting, toxicity detection, review analysis, adversarial attack method
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Machine learning (ML) for text classification has been widely used in various domains, such as toxicity detection, chatbot consulting, and review analysis. These applications can significantly impact ethics, economics, and human behavior, raising serious concerns about trusting ML decisions. Several studies indicate that traditional metrics, such as model confidence and accuracy, are insufficient to build human trust in ML models. These models often learn spurious correlations during training and predict based on them during inference. In the real world, where such correlations are absent, their performance can deteriorate significantly. To avoid this, a common practice is to test whether predictions are reasonable. Along with this, a challenge known as the trustworthiness oracle problem has been introduced. Due to the lack of automated trustworthiness oracles, the assessment requires manual validation of the decision process disclosed by explanation methods, which is time-consuming and not scalable. We propose TOKI, the first automated trustworthiness oracle generation method for text classifiers, which automatically checks whether the prediction-contributing words are related to the predicted class using explanation methods and word embeddings. To demonstrate its practical usefulness, we introduce a novel adversarial attack method targeting trustworthiness issues identified by TOKI. We compare TOKI with a naive baseline based solely on model confidence using human-created ground truths of 6,000 predictions. We also compare TOKI-guided adversarial attack method with A2T, a SOTA adversarial attack method. Results show that relying on prediction uncertainty cannot distinguish between trustworthy and untrustworthy predictions, TOKI achieves 142% higher accuracy than the naive baseline, and TOKI-guided adversarial attack method is more effective with fewer perturbations than A2T.
摘要：文本分类的机器学习 (ML) 在多个领域中得到了广泛应用，如毒性检测、聊天机器人咨询和评论分析。这些应用对伦理、经济和人类行为有着重大影响，引发了人们对信任 ML 决策的严重担忧。多项研究表明，传统的评估指标，如模型置信度和准确率，不足以建立人类对 ML 模型的信任。这些模型在训练过程中常常学习到虚假的相关性，并在推理时基于这些相关性进行预测。在现实世界中，这些相关性不存在时，模型的性能会显著下降。为了避免这种情况，一种常见的做法是测试预测是否合理。随之而来的是一个被称为“信任度预言问题”的挑战。由于缺乏自动化的信任度预言，评估需要通过解释方法手动验证决策过程，这既耗时又不可扩展。我们提出了 TOKI，这是首个用于文本分类器的自动化信任度预言生成方法，它通过解释方法和词嵌入自动检查预测贡献词是否与预测类别相关。为了展示其实际效用，我们引入了一种针对 TOKI 识别的信任度问题的新型对抗攻击方法。我们通过人工创建的 6,000 个预测的真实值，将 TOKI 与仅基于模型置信度的朴素基线进行比较。我们还比较了 TOKI 引导的对抗攻击方法与 SOTA 对抗攻击方法 A2T。结果显示，仅依赖预测不确定性无法区分可信和不可信的预测，TOKI 的准确率比朴素基线高出 142%，并且 TOKI 引导的对抗攻击方法在比 A2T 更少的扰动下更为有效。

[NLP-45] Linguistics Theory Meets LLM : Code-Switched Text Generation via Equivalence Constrained Large Language Models

【速读】：该论文试图解决自然语言处理 (NLP) 中代码转换 (code-switching) 文本生成的问题，特别是在现有研究多集中于句法约束或神经生成的情况下，如何将语言学理论与大型语言模型 (LLMs) 结合以生成自然且符合语言学规则的代码转换文本。解决方案的关键在于引入了一个名为 EZSwitch 的新框架，该框架结合了等价约束理论 (Equivalence Constraint Theory, ECT) 与 LLMs，以生成既符合语言学规则又流畅的代码转换文本。通过结合语言学约束和 LLMs，EZSwitch 显著提升了生成文本的质量，并通过人工评价和自动指标评估验证了其有效性。此外，论文还创建了 CSPref 数据集，用于分析模型在不同难度示例上的表现，进一步证明了语言学约束对生成文本的鲁棒性和与人类偏好的一致性的重要性。

链接: https://arxiv.org/abs/2410.22660
作者: Garry Kuwanto,Chaitanya Agarwal,Genta Indra Winata,Derry Tanti Wijaya
关键词-EN: Natural Language Processing, presents unique challenges, Language Processing, single conversation, presents unique
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code-switching, the phenomenon of alternating between two or more languages in a single conversation, presents unique challenges for Natural Language Processing (NLP). Most existing research focuses on either syntactic constraints or neural generation, with few efforts to integrate linguistic theory with large language models (LLMs) for generating natural code-switched text. In this paper, we introduce EZSwitch, a novel framework that combines Equivalence Constraint Theory (ECT) with LLMs to produce linguistically valid and fluent code-switched text. We evaluate our method using both human judgments and automatic metrics, demonstrating a significant improvement in the quality of generated code-switching sentences compared to baseline LLMs. To address the lack of suitable evaluation metrics, we conduct a comprehensive correlation study of various automatic metrics against human scores, revealing that current metrics often fail to capture the nuanced fluency of code-switched text. Additionally, we create CSPref, a human preference dataset based on human ratings and analyze model performance across hard and easy examples. Our findings indicate that incorporating linguistic constraints into LLMs leads to more robust and human-aligned generation, paving the way for scalable code-switching text generation across diverse language pairs.
摘要：代码转换（Code-switching），即在单次对话中交替使用两种或多种语言的现象，为自然语言处理（Natural Language Processing, NLP）带来了独特的挑战。现有的大多数研究要么集中在句法约束上，要么集中在神经生成上，很少有研究将语言学理论与大语言模型（Large Language Models, LLMs）结合，以生成自然的代码转换文本。本文介绍了EZSwitch，这是一种新颖的框架，它将等价约束理论（Equivalence Constraint Theory, ECT）与LLMs结合，以生成语言学上有效且流畅的代码转换文本。我们通过人类判断和自动度量两种方式评估了我们的方法，结果显示，与基线LLMs相比，生成的代码转换句子的质量有了显著提高。为了解决缺乏合适的评估度量的问题，我们进行了各种自动度量与人类评分之间的全面相关性研究，发现当前的度量方法往往无法捕捉代码转换文本的细微流畅性。此外，我们创建了CSPref，一个基于人类评分的偏好数据集，并分析了模型在“困难”和“简单”示例中的表现。我们的研究结果表明，将语言学约束融入LLMs中可以导致更稳健且与人类一致的生成，为跨多种语言对的代码转换文本生成铺平了道路。

[NLP-46] Prove Your Point!: Bringing Proof-Enhancement Principles to Argumentative Essay Generation EMNLP2024

【速读】：该论文试图解决生成式议论文 (Argumentative Essay Generation, AEG) 中逻辑混乱的问题，即生成的观点之间缺乏高层次的逻辑连接，导致论证无效。解决方案的关键在于提出了一个统一的两阶段框架：证明增强与自我注释 (Proof-Enhancement and Self-Annotation, PESA)。具体来说，首先利用大型语言模型为逻辑信息（如主张和论据）构建伪标签，然后通过引入证明原则的树规划方法确保逻辑一致性。实验结果表明，PESA框架生成的议论文在逻辑有效性和说服力方面优于强基线模型。

链接: https://arxiv.org/abs/2410.22642
作者: Ruiyu Xiao,Lei Wu,Yuhang Gou,Weinan Zhang,Ting Liu
关键词-EN: specific controversial topics, generate complete texts, topics or debates, complete texts, texts on specific
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2024

点击查看摘要

Abstract:Argumentative essay generation (AEG) aims to generate complete texts on specific controversial topics or debates. Although current AEG methods can generate individual opinions, they often overlook the high-level connections between these opinions. This often leads to the generated results being mired in logical confusion, unable to proof their own arguments effectively. The generated essay may present evidence that contradicts the claims or they may fail to assemble the claims into logical flow. In this paper, we present a unified two-stage framework: Proof-Enhancement and Self-Annotation (PESA) for AEG with a focus on logical enhancement. Specifically, we first construct pseudo-labels for logical information,claims and grounds, using a large language model. We then propose a tree planning approach that introduces proof principles and ensures logical consistency. Extensive experimental results show that, benefiting from proof principle guidance, PESA generates argumentative essays with better logical validity and persuasiveness than strong baseline models.
摘要：论说文生成 (Argumentative Essay Generation, AEG) 旨在针对特定争议性话题或辩论生成完整的文本。尽管当前的 AEG 方法能够生成独立的观点，但它们往往忽视了这些观点之间的高层次逻辑联系。这常常导致生成的结果陷入逻辑混乱，无法有效证明其论点。生成的文章可能会提供与主张相矛盾的证据，或者未能将主张组织成逻辑流畅的结构。本文提出了一种统一的二阶段框架：证明增强与自我标注 (Proof-Enhancement and Self-Annotation, PESA)，专注于逻辑增强。具体而言，我们首先利用大语言模型为逻辑信息、主张和依据构建伪标签。随后，我们提出了一种树形规划方法，引入证明原则并确保逻辑一致性。广泛的实验结果表明，得益于证明原则的指导，PESA 生成的论说文在逻辑有效性和说服力方面优于强大的基线模型。

[NLP-47] Characterizing the Role of Similarity in the Property Inferences of Language Models

【速读】：该论文试图解决的问题是关于属性继承（property inheritance）在语言模型（LMs）中的实现机制，即这种能力是源于显式存储的分类知识（taxonomic knowledge）还是基于心理表征之间的简单相似性计算（simple computations of similarity between mental representations）。解决方案的关键在于通过行为和因果表征分析实验，揭示语言模型在属性继承行为中如何同时利用分类关系和类别相似性。研究发现，语言模型在属性继承时，更倾向于将新属性从高层次类别传递到低层次类别，当这两个类别在分类上相关且相似度高时，这种传递更为显著。这一发现不仅揭示了语言模型的概念结构，还为未来针对人类受试者的心理语言学实验提供了新的方向。

链接: https://arxiv.org/abs/2410.22590
作者: Juan Diego Rodriguez,Aaron Mueller,Kanishka Misra
关键词-EN: higher level categories, level categories, higher level, lower level, projected from higher
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Property inheritance – a phenomenon where novel properties are projected from higher level categories (e.g., birds) to lower level ones (e.g., sparrows) – provides a unique window into how humans organize and deploy conceptual knowledge. It is debated whether this ability arises due to explicitly stored taxonomic knowledge vs. simple computations of similarity between mental representations. How are these mechanistic hypotheses manifested in contemporary language models? In this work, we investigate how LMs perform property inheritance with behavioral and causal representational analysis experiments. We find that taxonomy and categorical similarities are not mutually exclusive in LMs’ property inheritance behavior. That is, LMs are more likely to project novel properties from one category to the other when they are taxonomically related and at the same time, highly similar. Our findings provide insight into the conceptual structure of language models and may suggest new psycholinguistic experiments for human subjects.
摘要：属性继承——一种将新属性从较高层级类别（如鸟类）投射到较低层级类别（如麻雀）的现象——为研究人类如何组织和运用概念知识提供了独特的视角。关于这种能力是源于显式存储的分类知识还是心理表征之间简单相似性计算的争论一直存在。当代语言模型中这些机制假设是如何体现的呢？在本研究中，我们通过行为和因果表征分析实验探讨了大语言模型（LMs）如何进行属性继承。我们发现，分类学和类别相似性在大语言模型的属性继承行为中并非互斥。也就是说，当两个类别在分类学上相关且同时高度相似时，大语言模型更有可能将新属性从一个类别投射到另一个类别。我们的研究结果为大语言模型的概念结构提供了洞见，并可能为人类受试者的心理语言学实验提出新的方向。

[NLP-48] oxicity of the Commons: Curating Open-Source Pre-Training Data

【速读】：该论文试图解决开源大型语言模型在训练数据中可能包含的有害内容问题。解决方案的关键在于提出一个全新的开源数据毒性过滤管道，具体包括：1) 创建一个自定义训练数据集 ToxicCommons，该数据集涵盖了五个不同维度的有害内容分类（种族/起源、性别/性取向、宗教、能力歧视和暴力）；2) 使用 ToxicCommons 数据集训练一个定制的分类器 Celadon，用于更高效、更大规模地检测开源数据中的有害内容；3) 描述了一种平衡的内容过滤方法，以优化安全过滤与可用于训练的过滤数据之间的关系。

链接: https://arxiv.org/abs/2410.22587
作者: Catherine Arnett,Eliot Jones,Ivan P. Yamshchikov,Pierre-Carl Langlais
关键词-EN: Open-source large language, large language models, models, data, Optical Character Recognition
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open-source large language models are becoming increasingly available and popular among researchers and practitioners. While significant progress has been made on open-weight models, open training data is a practice yet to be adopted by the leading open-weight models creators. At the same time, there researchers are working to make language models safer. We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data. There are unique challenges to working with public domain data, as these sources differ from web text in both form and content. Many sources are historical documents and are the result of Optical Character Recognition (OCR). Consequently, current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models. In this paper, we introduce a new fully open-source pipeline for open-data toxicity filtering. Our contributions are threefold. We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions (racial/origin-based, gender/sex-based, religious, ability-based discrimination, and violence). We use this dataset to train a custom classifier, Celadon, that can be used to detect toxic content in open data more efficiently at a larger scale. Finally, we describe the balanced approach to content filtration that optimizes safety filtering with respect to the filtered data available for training.
摘要：开源大语言模型正变得越来越普及，受到研究人员和从业者的青睐。尽管在开源权重模型方面取得了显著进展，但领先的开放权重模型创建者尚未采用开放训练数据的实践。与此同时，研究人员正在努力使语言模型更加安全。我们提出了一种数据筛选流程，以减少模型在公共领域数据上训练时产生的有害输出。处理公共领域数据面临独特的挑战，因为这些数据源在形式和内容上都与网络文本不同。许多数据源是历史文档，并且是光学字符识别（OCR）的结果。因此，当前最先进的毒性过滤方法往往对开放数据模型不可行或不适用。在本文中，我们介绍了一种全新的全开源开放数据毒性过滤流程。我们的贡献主要有三点。我们创建了一个自定义训练数据集，称为ToxicCommons，该数据集由在五个不同维度（种族/基于起源、性别/基于性别、宗教、基于能力歧视和暴力）上分类的文本组成。我们使用这个数据集训练了一个自定义分类器Celadon，该分类器可以更高效地在大规模上检测开放数据中的有毒内容。最后，我们描述了一种平衡的内容过滤方法，该方法在可用于训练的过滤数据方面优化了安全过滤。

[NLP-49] BENCHAGENTS : Automated Benchmark Creation with Agent Interaction

【速读】：该论文试图解决现有评估方法受限于基准数据可用性的问题，特别是在模型不断进化的情况下，需要创建新的基准来衡量生成能力的新进展。解决方案的关键在于引入BENCHAGENTS框架，该框架利用大型语言模型（LLMs）自动化复杂能力的基准创建过程，同时确保数据和指标的质量。BENCHAGENTS将基准创建过程分解为规划、生成、数据验证和评估四个阶段，每个阶段由一个LLM代理执行，并通过与基准开发者的反馈循环来提升数据多样性和质量。

链接: https://arxiv.org/abs/2410.22584
作者: Natasha Butt,Varun Chandrasekaran,Neel Joshi,Besmira Nushi,Vidhisha Balachandran
关键词-EN: benchmark availability, BENCHAGENTS, benchmark, benchmarks, create benchmarks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluations are limited by benchmark availability. As models evolve, there is a need to create benchmarks that can measure progress on new generative capabilities. However, creating new benchmarks through human annotations is slow and expensive, restricting comprehensive evaluations for any capability. We introduce BENCHAGENTS, a framework that methodically leverages large language models (LLMs) to automate benchmark creation for complex capabilities while inherently ensuring data and metric quality. BENCHAGENTS decomposes the benchmark creation process into planning, generation, data verification, and evaluation, each of which is executed by an LLM agent. These agents interact with each other and utilize human-in-the-loop feedback from benchmark developers to explicitly improve and flexibly control data diversity and quality. We use BENCHAGENTS to create benchmarks to evaluate capabilities related to planning and constraint satisfaction during text generation. We then use these benchmarks to study seven state-of-the-art models and extract new insights on common failure modes and model differences.
摘要：评估受限于基准测试的可用性。随着模型的演进，需要创建能够衡量新生成能力进展的基准测试。然而，通过人工注释创建新基准测试既缓慢又昂贵，限制了任何能力的全面评估。我们引入了 BENCHAGENTS，这是一个框架，系统地利用大语言模型 (LLM) 来自动化复杂能力的基准测试创建，同时内在地确保数据和指标的质量。BENCHAGENTS 将基准测试创建过程分解为规划、生成、数据验证和评估，每个步骤都由一个 LLM 智能体执行。这些智能体相互交互，并利用基准测试开发者的反馈，以显式方式改进并灵活控制数据多样性和质量。我们使用 BENCHAGENTS 创建了评估文本生成过程中规划和约束满足能力的基准测试。然后，我们使用这些基准测试研究了七个最先进的模型，并提取了关于常见失败模式和模型差异的新见解。

[NLP-50] Auto-Intent: Automated Intent Discovery and Self-Exploration for Large Language Model Web Agents EMNLP2024

【速读】：该论文试图解决在无需直接微调的情况下，将预训练的大型语言模型 (LLM) 适应于特定领域（特别是网页导航任务）的问题。解决方案的关键在于提出了Auto-Intent方法，该方法通过无监督地从目标领域演示中提取潜在意图（以高度紧凑的形式，最多三个词），并训练意图预测器来根据代理的过去观察和行动预测下一个意图。特别地，论文提出了一种自探索方法，通过向预训练的LLM代理提供前k个可能的意图预测作为提示，从而增强其决策能力。这种方法显著提升了GPT-3.5、GPT-4、Llama-3.1-70B和Llama-405B在Mind2Web和WebArena等大规模真实网站导航基准测试中的表现，并展示了跨基准的泛化能力。

链接: https://arxiv.org/abs/2410.22552
作者: Jaekyeom Kim,Dong-Ki Kim,Lajanugen Logeswaran,Sungryull Sohn,Honglak Lee
关键词-EN: large language model, pre-trained large language, language model, direct fine-tuning, target domain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2024 Findings

点击查看摘要

Abstract:In this paper, we introduce Auto-Intent, a method to adapt a pre-trained large language model (LLM) as an agent for a target domain without direct fine-tuning, where we empirically focus on web navigation tasks. Our approach first discovers the underlying intents from target domain demonstrations unsupervisedly, in a highly compact form (up to three words). With the extracted intents, we train our intent predictor to predict the next intent given the agent’s past observations and actions. In particular, we propose a self-exploration approach where top-k probable intent predictions are provided as a hint to the pre-trained LLM agent, which leads to enhanced decision-making capabilities. Auto-Intent substantially improves the performance of GPT-3.5, 4 and Llama-3.1-70B, 405B agents on the large-scale real-website navigation benchmarks from Mind2Web and online navigation tasks from WebArena with its cross-benchmark generalization from Mind2Web.
摘要：本文介绍了一种名为 Auto-Intent 的方法，该方法旨在将预训练的大语言模型 (LLM) 作为目标领域的智能体，而无需直接进行微调，我们在此实证研究中专注于网页导航任务。我们的方法首先通过无监督的方式从目标领域的演示中挖掘出潜在的意图，并以高度紧凑的形式（最多三个词）呈现。基于提取的意图，我们训练意图预测器，以根据智能体的过去观察和行动预测下一个意图。特别地，我们提出了一种自探索方法，其中前 k 个可能的意图预测作为提示提供给预训练的 LLM 智能体，从而增强了其决策能力。Auto-Intent 显著提升了 GPT-3.5、4 以及 Llama-3.1-70B、405B 智能体在大规模真实网站导航基准测试（来自 Mind2Web）和在线导航任务（来自 WebArena）中的表现，并展示了从 Mind2Web 到其他基准的跨基准泛化能力。

[NLP-51] Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在处理模糊比较提示时产生的偏见问题。解决方案的关键在于通过分析注意力机制来识别和减轻模型内部的偏见。具体来说，论文提出了一种名为ATLAS（Attention-based Targeted Layer Analysis and Scaling）的技术，通过量化模型对不同实体的偏好，定位偏见集中的特定层，并通过在这些层中调整注意力权重来减少偏见。实验结果表明，偏见主要集中在模型的后三分之一层，而ATLAS方法在有效减少偏见的同时，仅轻微增加了0.82%的困惑度，并在所有数据集上平均提高了0.28点的偏见评分。

链接: https://arxiv.org/abs/2410.22517
作者: Rishabh Adiga,Besmira Nushi,Varun Chandrasekaran
关键词-EN: ambiguous comparative prompts, providing clear context, large language models, comparative prompts, inputs that compare
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We explore the internal mechanisms of how bias emerges in large language models (LLMs) when provided with ambiguous comparative prompts: inputs that compare or enforce choosing between two or more entities without providing clear context for preference. Most approaches for bias mitigation focus on either post-hoc analysis or data augmentation. However, these are transient solutions, without addressing the root cause: the model itself. Numerous prior works show the influence of the attention module towards steering generations. We believe that analyzing attention is also crucial for understanding bias, as it provides insight into how the LLM distributes its focus across different entities and how this contributes to biased decisions. To this end, we first introduce a metric to quantify the LLM’s preference for one entity over another. We then propose \textttATLAS (Attention-based Targeted Layer Analysis and Scaling), a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers. To evaluate our method, we conduct experiments across 3 datasets (BBQ, Crows-Pairs, and WinoGender) using \textttGPT-2 XL (1.5B), \textttGPT-J (6B), \textttLLaMA-2 (7B) and \textttLLaMA-3 (8B). Our experiments demonstrate that bias is concentrated in the later layers, typically around the last third. We also show how \textttATLAS effectively mitigates bias through targeted interventions without compromising downstream performance and an average increase of only 0.82% in perplexity when the intervention is applied. We see an average improvement of 0.28 points in the bias score across all the datasets.
摘要：我们探讨了大语言模型（LLM）在面对模糊比较性提示时，偏见产生的内部机制：这些输入在没有提供明确偏好上下文的情况下，比较或强制选择两个或多个实体。大多数偏见缓解方法侧重于事后分析或数据增强，但这些方法只是暂时的解决方案，并未触及问题的根源：模型本身。许多先前的工作表明，注意力模块对生成结果有显著影响。我们认为，分析注意力对于理解偏见同样至关重要，因为它揭示了LLM如何在不同实体之间分配其关注点，以及这种分配如何导致偏见决策。为此，我们首先引入了一种度量方法，用于量化LLM对某一实体的偏好程度。接着，我们提出了ATLAS（基于注意力的目标层分析与缩放）技术，通过分析注意力分数，将偏见定位到LLM的特定层，并通过在这些偏见层中缩放注意力来减少偏见。为了评估我们的方法，我们在3个数据集（BBQ、Crows-Pairs和WinoGender）上进行了实验，使用了GPT-2 XL（1.5B）、GPT-J（6B）、LLaMA-2（7B）和LLaMA-3（8B）。实验结果表明，偏见主要集中在模型的后几层，通常在最后三分之一的部分。我们还展示了ATLAS通过有针对性的干预有效缓解偏见，同时不损害下游性能，并且在应用干预时，困惑度仅平均增加了0.82%。我们在所有数据集上的偏见评分平均提高了0.28分。

[NLP-52] Anticipating Future with Large Language Model for Simultaneous Machine Translation

【速读】：该论文试图解决的是实时机器翻译（Simultaneous Machine Translation, SMT）中翻译质量和延迟之间的权衡问题。解决方案的关键是提出了一个名为“通过预测未来进行翻译（Translation by Anticipating Future, TAF）”的方法。TAF的核心思想是利用大型语言模型（Large Language Model, LLM）来预测未来的源语言词汇，从而在保持低延迟的同时，提高翻译质量。实验结果表明，TAF在四个语言方向上均实现了最佳的翻译质量-延迟权衡，并且在相同延迟（三个词）下，其翻译质量比基线方法高出最多5个BLEU点。

链接: https://arxiv.org/abs/2410.22499
作者: Siqi Ouyang,Oleksii Hrinchuk,Zhehuai Chen,Vitaly Lavrukhin,Jagadeesh Balam,Lei Li,Boris Ginsburg
关键词-EN: produces target text, Simultaneous machine translation, incrementally produces target, Simultaneous machine, streaming input utterances
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Simultaneous machine translation (SMT) takes streaming input utterances and incrementally produces target text. Existing SMT methods only use the partial utterance that has already arrived at the input and the generated hypothesis. Motivated by human interpreters’ technique to forecast future words before hearing them, we propose \textbfT ranslation by \textbfA nticipating \textbfF uture (TAF), a method to improve translation quality while retraining low latency. Its core idea is to use a large language model (LLM) to predict future source words and opportunistically translate without introducing too much risk. We evaluate our TAF and multiple baselines of SMT on four language directions. Experiments show that TAF achieves the best translation quality-latency trade-off and outperforms the baselines by up to 5 BLEU points at the same latency (three words).
摘要：同时机器翻译 (Simultaneous Machine Translation, SMT) 接收流式输入语句并逐步生成目标文本。现有的 SMT 方法仅利用已到达输入端的局部语句和生成的假设文本。受人类口译员在听到未来词汇之前预测其内容的技巧启发，我们提出了通过预测未来 (Translation by Anticipating Future, TAF) 的方法，旨在提高翻译质量的同时保持低延迟。其核心思想是利用大语言模型 (Large Language Model, LLM) 预测未来的源语言词汇，并在不引入过多风险的情况下进行机会性翻译。我们在四个语言方向上评估了 TAF 和多个 SMT 基线方法。实验结果表明，TAF 在翻译质量与延迟的权衡上表现最佳，在相同延迟（三个词）的情况下，其翻译质量比基线方法高出最多 5 个 BLEU 分。

[NLP-53] Scaling LLM Inference with Optimized Sample Compute Allocation

【速读】：该论文试图解决在大语言模型（LLMs）推理过程中，如何在有限的计算资源下高效地进行采样配置的问题。解决方案的关键在于提出了一种名为OSCA（Optimizes Sample Compute Allocation）的算法，该算法通过优化不同推理配置（如模型、温度、语言等）的混合分配，以实现更高的准确性。实验结果表明，OSCA能够在代码生成任务中以128倍的计算量减少达到优于单一配置的准确性，并在4个推理任务中以25倍的计算量减少实现相同效果。此外，OSCA在多轮任务中也表现出色，在SWE-Bench上以3倍的计算量减少实现了更好的准确性。

链接: https://arxiv.org/abs/2410.22480
作者: Kexun Zhang,Shang Zhou,Danqing Wang,William Yang Wang,Lei Li
关键词-EN: large language models, basic operation, compute, large language, Optimizes Sample Compute
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sampling is a basic operation in many inference-time algorithms of large language models (LLMs). To scale up inference efficiently with a limited compute, it is crucial to find an optimal allocation for sample compute budgets: Which sampling configurations (model, temperature, language, etc.) do we use? How many samples do we generate in each configuration? We formulate these choices as a learning problem and propose OSCA, an algorithm that Optimizes Sample Compute Allocation by finding an optimal mix of different inference configurations. Our experiments show that with our learned mixed allocation, we can achieve accuracy better than the best single configuration with 128x less compute on code generation and 25x less compute on 4 reasoning tasks. OSCA is also shown to be effective in agentic workflows beyond single-turn tasks, achieving a better accuracy on SWE-Bench with 3x less compute than the default configuration. Our code and generations are released at this https URL.
摘要：采样是许多大语言模型（LLM）推理算法中的基本操作。为了在有限的计算资源下高效扩展推理，找到样本计算预算的最佳分配至关重要：我们应使用哪些采样配置（模型、温度、语言等）？在每种配置下应生成多少样本？我们将这些选择形式化为一个学习问题，并提出了 OSCA 算法，该算法通过寻找不同推理配置的最佳组合来优化样本计算分配。我们的实验表明，通过我们学习到的混合分配，我们可以在代码生成任务上以 128 倍的计算量获得比最佳单一配置更高的准确性，在 4 个推理任务上以 25 倍的计算量获得更高的准确性。OSCA 还被证明在单轮任务之外的智能体工作流程中同样有效，在 SWE-Bench 上以 3 倍的计算量实现了比默认配置更高的准确性。我们的代码和生成结果已在此 https URL 发布。

[NLP-54] A Pointer Network-based Approach for Joint Extraction and Detection of Multi-Label Multi-Class Intents EMNLP2024

【速读】：该论文试图解决面向任务的对话系统中复杂查询的多意图检测和意图跨度提取问题，以及缺乏多语言多意图数据集的问题。解决方案的关键在于引入了一个新的多标签多类别意图检测数据集（MLMCID-dataset），并提出了一种基于指针网络的架构（MLMCID），用于从查询中提取意图跨度和检测多个意图，使用粗粒度和细粒度标签的六元组形式。该方法在多个数据集上的准确率和F1-score方面表现优于基线方法。

链接: https://arxiv.org/abs/2410.22476
作者: Ankan Mullick,Sombit Bose,Abhilash Nandy,Gajula Sai Chaitanya,Pawan Goyal
关键词-EN: interpreting user queries, task-oriented dialogue systems, providing appropriate responses, task-oriented dialogue, crucial for interpreting
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at EMNLP 2024 Findings (Long Paper)

点击查看摘要

Abstract:In task-oriented dialogue systems, intent detection is crucial for interpreting user queries and providing appropriate responses. Existing research primarily addresses simple queries with a single intent, lacking effective systems for handling complex queries with multiple intents and extracting different intent spans. Additionally, there is a notable absence of multilingual, multi-intent datasets. This study addresses three critical tasks: extracting multiple intent spans from queries, detecting multiple intents, and developing a multi-lingual multi-label intent dataset. We introduce a novel multi-label multi-class intent detection dataset (MLMCID-dataset) curated from existing benchmark datasets. We also propose a pointer network-based architecture (MLMCID) to extract intent spans and detect multiple intents with coarse and fine-grained labels in the form of sextuplets. Comprehensive analysis demonstrates the superiority of our pointer network-based system over baseline approaches in terms of accuracy and F1-score across various datasets.
摘要：在面向任务的对话系统中，意图检测对于解释用户查询并提供适当的响应至关重要。现有研究主要针对单一意图的简单查询，缺乏有效处理包含多个意图的复杂查询以及提取不同意图范围的系统。此外，多语言、多意图数据集的缺失也是一个显著问题。本研究解决了三个关键任务：从查询中提取多个意图范围、检测多个意图，以及开发多语言多标签意图数据集。我们引入了一个新颖的多标签多类意图检测数据集（MLMCID-dataset），该数据集从现有基准数据集中精心筛选而成。同时，我们提出了一种基于指针网络的架构（MLMCID），用于提取意图范围并检测多个意图，这些意图以六元组形式表示的粗粒度和细粒度标签。综合分析表明，我们的基于指针网络的系统在各种数据集上的准确率和F1分数方面优于基线方法。

[NLP-55] Advancing Agent ic Systems: Dynamic Task Decomposition Tool Integration and Evaluation using Novel Metrics and Dataset NEURIPS2024

【速读】：该论文试图解决自主代理系统在动态环境中处理复杂任务和工具选择的问题。解决方案的关键在于提出了一个先进的代理框架，该框架能够处理多跳查询、生成和执行任务图、选择适当的工具，并适应实时变化。此外，论文引入了新的评估指标，如节点F1分数、结构相似性指数（SSI）和工具F1分数，以全面评估代理系统。通过开发基于AsyncHow的数据集，论文进一步分析了不同任务复杂性下的代理行为。研究结果表明，异步和动态的任务图分解显著增强了系统的响应性和可扩展性，特别是在处理复杂的多步骤任务时。

链接: https://arxiv.org/abs/2410.22457
作者: Adrian Garret Gabriel,Alaa Alameer Ahmad,Shankar Kumar Jeyakumar
关键词-EN: Large Language Models, Advancements in Large, Toggle, automated tool selection, agentic systems
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024), NeurIPS 2024 Workshop on Open-World Agents

点击查看摘要

Abstract:Advancements in Large Language Models (LLMs) are revolutionizing the development of autonomous agentic systems by enabling dynamic, context-aware task decomposition and automated tool selection. These sophisticated systems possess significant automation potential across various industries, managing complex tasks, interacting with external systems to enhance knowledge, and executing actions independently. This paper presents three primary contributions to advance this field: - Advanced Agentic Framework: A system that handles multi-hop queries, generates and executes task graphs, selects appropriate tools, and adapts to real-time changes. - Novel Evaluation Metrics: Introduction of Node F1 Score, Structural Similarity Index (SSI), and Tool F1 Score to comprehensively assess agentic systems. - Specialized Dataset: Development of an AsyncHow-based dataset for analyzing agent behavior across different task complexities. Our findings reveal that asynchronous and dynamic task graph decomposition significantly enhances system responsiveness and scalability, particularly for complex, multi-step tasks. Detailed analysis shows that structural and node-level metrics are crucial for sequential tasks, while tool-related metrics are more important for parallel tasks. Specifically, the Structural Similarity Index (SSI) is the most significant predictor of performance in sequential tasks, and the Tool F1 Score is essential for parallel tasks. These insights highlight the need for balanced evaluation methods that capture both structural and operational dimensions of agentic systems. Additionally, our evaluation framework, validated through empirical analysis and statistical testing, provides valuable insights for improving the adaptability and reliability of agentic systems in dynamic environments. Comments: 38th Conference on Neural Information Processing Systems (NeurIPS 2024), NeurIPS 2024 Workshop on Open-World Agents Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2410.22457 [cs.AI] (or arXiv:2410.22457v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.22457 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shankar Kumar Jeyakumar [view email] [v1] Tue, 29 Oct 2024 18:45:13 UTC (973 KB) Full-text links: Access Paper: View a PDF of the paper titled Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset, by Adrian Garret Gabriel and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2024-10 Change to browse by: cs cs.CL cs.LG cs.MA References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
摘要：大语言模型 (LLM) 的进步正在通过实现动态、上下文感知的任务分解和自动化工具选择，彻底改变自主智能体系统的发展。这些复杂的系统在各个行业中具有显著的自动化潜力，能够管理复杂任务，与外部系统交互以增强知识，并独立执行操作。本文提出了三项主要贡献以推动该领域的发展：- 高级智能体框架：一个处理多跳查询、生成并执行任务图、选择适当工具并适应实时变化的系统。- 新型评估指标：引入节点 F1 分数、结构相似性指数 (SSI) 和工具 F1 分数，以全面评估智能体系统。- 专用数据集：开发了一个基于 AsyncHow 的数据集，用于分析不同任务复杂性下的智能体行为。我们的研究结果表明，异步和动态任务图分解显著增强了系统的响应能力和可扩展性，特别是在复杂的多步骤任务中。详细分析显示，结构和节点级指标对于顺序任务至关重要，而工具相关指标对于并行任务更为重要。具体而言，结构相似性指数 (SSI) 是顺序任务性能的最重要预测指标，而工具 F1 分数对于并行任务至关重要。这些见解强调了平衡评估方法的必要性，以捕捉智能体系统的结构和操作维度。此外，我们的评估框架通过实证分析和统计测试验证，为提高智能体系统在动态环境中的适应性和可靠性提供了宝贵的见解。

评论：第38届神经信息处理系统会议 (NeurIPS 2024)，NeurIPS 2024 开放世界智能体研讨会
主题：人工智能 (cs.AI)；计算与语言 (cs.CL)；机器学习 (cs.LG)；多智能体系统 (cs.MA)
引用为：arXiv:2410.22457 [cs.AI]（或 arXiv:2410.22457v1 [cs.AI] 用于此版本）
https://doi.org/10.48550/arXiv.2410.22457

提交历史：从 Shankar Kumar Jeyakumar [查看电子邮件]
[v1] 2024年10月29日 18:45:13 UTC (973 KB)
全文链接：访问论文：查看题为《推进智能体系统：使用新型指标和数据集进行动态任务分解、工具集成和评估》的PDF，作者为 Adrian Garret Gabriel 及其他两位作者。查看 PDF HTML (实验性) TeX 源代码其他格式查看许可证当前浏览上下文：cs.AI 上一篇 | 下一篇新 | 最近 | 2024-10 更改浏览方式：cs cs.CL cs.LG cs.MA 参考文献引文 NASA ADS Google Scholar Semantic Scholar 导出 BibTeX 引用加载中… BibTeX 格式化引用加载中… 数据由以下机构提供：书签已选中=“已选中”> 书目工具书目和引文工具书目浏览器切换书目浏览器 (什么是浏览器？) Litmaps 切换 Litmaps (什么是 Litmaps？) scite.ai 切换 scite 智能引文 (什么是智能引文？) 代码、数据、媒体与此文章相关的代码、数据和媒体 alphaXiv 切换 alphaXiv (什么是 alphaXiv？) 代码链接切换 CatalyzeX 论文代码查找器 (什么是 CatalyzeX？) DagsHub 切换 DagsHub (什么是 DagsHub？) GotitPub 切换 Gotit.pub (什么是 GotitPub？) Huggingface 切换 Hugging Face (什么是 Huggingface？) 代码链接切换带有代码的论文 (什么是带有代码的论文？) ScienceCast 切换 ScienceCast (什么是 ScienceCast？) 演示演示 Replicate 切换 Replicate (什么是 Replicate？) Spaces 切换 Hugging Face Spaces (什么是 Spaces？) Spaces 切换 TXYZ.AI (什么是 TXYZ.AI？) 相关论文推荐和搜索工具影响力花朵链接影响力花朵 (什么是影响力花朵？) 连接的论文切换连接的论文 (什么是连接的论文？) CORE 推荐器切换 CORE 推荐器 (什么是 CORE？) 作者地点机构主题关于 arXivLabs arXivLabs：社区合作者的实验项目 arXivLabs 是一个框架，允许合作者在我们的网站上直接开发和分享新的 arXiv 功能。无论是个人还是组织，与 arXivLabs 合作的人都接受并接受了我们关于开放性、社区、卓越和用户数据隐私的价值观。arXiv 致力于这些价值观，并且只与遵守这些价值观的合作伙伴合作。有一个项目将为 arXiv 社区增加价值吗？了解更多关于 arXivLabs 的信息。本文的哪些作者是支持者？ | 禁用 MathJax (什么是 MathJax？) mathjaxToggle(); 关于帮助联系 arXiv 点击此处联系 arXiv 联系订阅 arXiv 邮件列表点击此处订阅订阅版权隐私政策网页无障碍辅助 arXiv 操作状态通过以下方式获取状态通知电子邮件或 Slack

[NLP-56] Do Large Language Models Align with Core Mental Health Counseling Competencies? NAACL2025

【速读】：该论文试图解决大语言模型（LLMs）在心理健康咨询领域中与核心咨询能力对齐的问题。解决方案的关键在于引入了一个名为CounselingBench的新基准，该基准基于NCMHCE评估LLMs在五个关键心理健康咨询能力方面的表现。研究发现，前沿模型虽然超过了最低阈值，但在达到专家级水平方面仍有不足，特别是在需要同理心和情境理解的核心咨询属性和专业实践与伦理方面。此外，医学领域的LLMs在准确性上意外地不如通用模型，尽管其解释质量稍高，但情境相关错误更多。这强调了开发专门针对心理健康咨询的LLMs的必要性，这些模型需要严格对齐核心能力，并在实际应用前结合适当的人类监督。

链接: https://arxiv.org/abs/2410.22446
作者: Viet Cuong Nguyen,Mohammad Taher,Dongwan Hong,Vinicius Konkolics Possobom,Vibha Thirunellayi Gopalakrishnan,Ekta Raj,Zihang Li,Heather J. Soled,Michael L. Birnbaum,Srijan Kumar,Munmun De Choudhury
关键词-EN: Large Language Models, Large Language, mental health counseling, offers promising potential, evolution of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 Pages, In Submission to NAACL 2025

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLMs) offers promising potential to alleviate the global scarcity of mental health professionals. However, LLMs’ alignment with essential mental health counseling competencies remains understudied. We introduce CounselingBench, a novel NCMHCE-based benchmark evaluating LLMs across five key mental health counseling competencies. Testing 22 general-purpose and medical-finetuned LLMs, we find frontier models exceed minimum thresholds but fall short of expert-level performance, with significant variations: they excel in Intake, Assessment Diagnosis yet struggle with Core Counseling Attributes and Professional Practice Ethics. Medical LLMs surprisingly underperform generalist models accuracy-wise, while at the same time producing slightly higher-quality justifications but making more context-related errors. Our findings highlight the complexities of developing AI systems for mental health counseling, particularly for competencies requiring empathy and contextual understanding. We found that frontier LLMs perform at a level exceeding the minimal required level of aptitude for all key mental health counseling competencies, but fall short of expert-level performance, and that current medical LLMs do not significantly improve upon generalist models in mental health counseling competencies. This underscores the critical need for specialized, mental health counseling-specific fine-tuned LLMs that rigorously aligns with core competencies combined with appropriate human supervision before any responsible real-world deployment can be considered.
摘要：大语言模型 (LLM) 的快速发展为缓解全球心理健康专业人员的短缺提供了有前景的潜力。然而，LLM 与基本心理健康咨询能力的对齐仍未得到充分研究。我们引入了 CounselingBench，这是一个基于 NCMHCE 的新型基准，用于评估 LLM 在五个关键心理健康咨询能力方面的表现。测试了 22 个通用和医疗微调的 LLM，我们发现前沿模型超过了最低阈值，但未能达到专家级水平，且存在显著差异：它们在初诊、评估和诊断方面表现出色，但在核心咨询属性和专业实践与伦理方面表现不佳。医疗 LLM 在准确性方面出乎意料地低于通用模型，同时在生成略高质量的解释的同时，犯下了更多与上下文相关的错误。我们的研究结果突显了开发用于心理健康咨询的 AI 系统的复杂性，特别是对于需要同理心和上下文理解的能力。我们发现，前沿 LLM 在所有关键心理健康咨询能力方面的表现均超过了最低要求水平，但未能达到专家级水平，并且当前的医疗 LLM 在心理健康咨询能力方面并未显著优于通用模型。这强调了在考虑任何负责任的实际部署之前，迫切需要专门针对心理健康咨询进行微调的 LLM，这些模型应严格对齐核心能力，并结合适当的人类监督。

[NLP-57] AAAR-1.0: Assessing AIs Potential to Assist Research

【速读】：该论文试图解决如何评估大型语言模型（LLMs）在科研工作中的表现，特别是针对科研任务如方程推理（EquationInference）、实验设计（ExperimentDesign）、论文弱点识别（PaperWeakness）和评审批评（REVIEWCRITIQUE）等需要深度专业知识的任务。解决方案的关键在于引入AAAR-1.0基准数据集，该数据集专门设计用于评估LLMs在这些科研任务中的表现，强调任务的研究导向性和研究者导向性，即任务设计紧密贴合科研人员的日常工作活动。通过这一数据集，论文揭示了LLMs在复杂科研任务中的潜力与局限性，并计划持续迭代更新AAAR-1.0以适应未来需求。

链接: https://arxiv.org/abs/2410.22394
作者: Renze Lou,Hanzi Xu,Sijia Wang,Jiangshu Du,Ryo Kamoi,Xiaoxin Lu,Jian Xie,Yuxuan Sun,Yusen Zhang,Jihyun Janice Ahn,Hongchao Fang,Zhuoyang Zou,Wenchao Ma,Xi Li,Kai Zhang,Congying Xia,Lifu Huang,Wenpeng Yin
关键词-EN: large language models, creative content generation, Numerous studies, facilitating everyday tasks, question answering
类目: Computation and Language (cs.CL)
备注: Project Webpage: this https URL

点击查看摘要

Abstract:Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; (iii) PaperWeakness, identifying weaknesses in paper submissions; and (iv) REVIEWCRITIQUE, identifying each segment in human reviews is deficient or not. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will keep iterating AAAR-1.0 to new versions.
摘要：众多研究已经评估了AI系统，特别是大语言模型 (LLMs)，在辅助日常任务如撰写邮件、问答和创意内容生成方面的能力。然而，研究人员在利用LLMs进行自己的工作时，如头脑风暴研究想法、设计实验以及撰写或审阅论文，面临着独特的挑战和机遇。在本研究中，我们引入了AAAR-1.0，这是一个基准数据集，旨在评估LLM在三项基础且需要专业知识的研究任务中的表现：(i) 方程推理 (EquationInference)，评估基于论文提交中上下文信息的方程正确性；(ii) 实验设计 (ExperimentDesign)，设计实验以验证研究想法和解决方案；(iii) 论文弱点识别 (PaperWeakness)，识别论文提交中的弱点；以及 (iv) 审阅批评 (REVIEWCRITIQUE)，识别人类审阅中每个部分的不足之处。AAAR-1.0与之前的基准数据集在两个关键方面有所不同：首先，它明确面向研究，任务需要深厚的领域专业知识；其次，它面向研究人员，反映了研究人员日常主要活动的实际情况。对开源和专有LLMs的评估揭示了它们在执行复杂研究任务中的潜力和局限性。我们将持续迭代AAAR-1.0至新版本。

[NLP-58] Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

【速读】：该论文试图解决文本到图像生成模型（T2I）在生成罕见概念组合时表现不佳的问题。解决方案的关键在于利用大型语言模型（LLM）的指导，通过在扩散采样过程中暴露与目标罕见概念相关的频繁概念，从而显著增强扩散模型对这些罕见概念的组合生成能力。具体来说，论文提出了一种无需训练的方法，称为R2F，该方法通过LLM的丰富语义知识，在整个扩散推理过程中规划和执行从罕见概念到频繁概念的指导。这种方法不仅灵活适用于任何预训练的扩散模型和LLM，还能与区域引导的扩散方法无缝集成，从而在T2I对齐方面显著优于现有模型，如SD3.0和FLUX。

链接: https://arxiv.org/abs/2410.22376
作者: Dongmin Park,Sebin Kim,Taehong Moon,Minkyu Kim,Kangwook Lee,Jaewoong Cho
关键词-EN: Large Language Model, generate rare compositions, objects with unusual, unusual attributes, struggle to generate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare compositions of concepts, e.g., objects with unusual attributes. In this paper, we show that the compositional generation power of diffusion models on such rare concepts can be significantly enhanced by the Large Language Model (LLM) guidance. We start with empirical and theoretical analysis, demonstrating that exposing frequent concepts relevant to the target rare concepts during the diffusion sampling process yields more accurate concept composition. Based on this, we propose a training-free approach, R2F, that plans and executes the overall rare-to-frequent concept guidance throughout the diffusion inference by leveraging the abundant semantic knowledge in LLMs. Our framework is flexible across any pre-trained diffusion models and LLMs, and can be seamlessly integrated with the region-guided diffusion approaches. Extensive experiments on three datasets, including our newly proposed benchmark, RareBench, containing various prompts with rare compositions of concepts, R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at this https URL.
摘要：最先进的文本到图像 (Text-to-Image, T2I) 扩散模型在生成罕见概念的组合时常常遇到困难，例如具有不寻常属性的物体。本文展示了，通过大语言模型 (Large Language Model, LLM) 的引导，可以显著增强扩散模型在生成这些罕见概念上的组合能力。我们首先进行了实证和理论分析，证明在扩散采样过程中暴露与目标罕见概念相关的频繁概念，能够更准确地生成概念组合。基于此，我们提出了一种无需训练的方法，称为 R2F，该方法利用 LLM 中丰富的语义知识，在整个扩散推断过程中规划并执行从罕见概念到频繁概念的引导。我们的框架灵活适用于任何预训练的扩散模型和 LLM，并且可以无缝集成到区域引导的扩散方法中。在三个数据集上的广泛实验，包括我们新提出的基准 RareBench，该基准包含各种具有罕见概念组合的提示，R2F 在 T2I 对齐方面显著优于现有的模型，包括 SD3.0 和 FLUX，提升幅度高达 28.1%。代码可在以下链接获取：https URL。

[NLP-59] Rethinking Code Refinement: Learning to Judge Code Efficiency

【速读】：该论文试图解决的问题是：由大型语言模型（LLMs）生成的代码优化版本并不总是比原始版本更高效，而每次运行两个不同版本的代码并进行比较既不理想又耗时。论文提出的解决方案之关键是：基于代码语言模型开发了一种新方法，该模型经过训练可以判断两个不同代码版本（由人类和机器生成）之间的效率差异，通过分类确定更高效的版本或预测相对改进。该方法在多种编程语言和多步优化中得到了验证，表明其能够有效区分更高效和低效的代码版本。

链接: https://arxiv.org/abs/2410.22375
作者: Minju Seo,Jinheon Baek,Sung Ju Hwang
关键词-EN: Large Language Models, demonstrated impressive capabilities, Large Language, demonstrated impressive, understanding and generating
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in understanding and generating codes. Due to these capabilities, many recent methods are proposed to automatically refine the codes with LLMs. However, we should rethink that the refined codes (from LLMs and even humans) are not always more efficient than their original versions. On the other hand, running two different versions of codes and comparing them every time is not ideal and time-consuming. Therefore, in this work, we propose a novel method based on the code language model that is trained to judge the efficiency between two different codes (generated across humans and machines) by either classifying the superior one or predicting the relative improvement. We validate our method on multiple programming languages with multiple refinement steps, demonstrating that the proposed method can effectively distinguish between more and less efficient versions of code.
摘要：大语言模型 (Large Language Models, LLMs) 在理解和生成代码方面展示了令人印象深刻的能力。由于这些能力，许多近期提出的方法利用 LLMs 来自动优化代码。然而，我们需要重新思考的是，经过 LLMs 甚至人类优化的代码并不总是比原始版本更高效。另一方面，每次运行两个不同版本的代码并进行比较既不理想也耗时。因此，在本研究中，我们提出了一种基于代码语言模型的新方法，该模型经过训练，能够判断两种不同代码（由人类和机器生成）之间的效率，通过分类出更优的代码或预测相对改进来实现。我们在多种编程语言和多个优化步骤上验证了我们的方法，结果表明，所提出的方法能够有效区分更高效和较低效的代码版本。

[NLP-60] Survey of User Interface Design and Interaction Techniques in Generative AI Applications

【速读】：该论文试图解决当前人机交互研究中对生成式 AI (Generative AI) 应用的用户界面设计和交互模式缺乏具体分析的问题。解决方案的关键在于通过全面调查用户与 AI 的交互方式，提出一套详尽的分类体系，涵盖用户引导的交互模式，并排除用户隐含信号的交互。论文旨在创建一个交互模式的汇编，为设计师和开发者提供参考，从而降低学习和设计生成式 AI 应用的门槛。

链接: https://arxiv.org/abs/2410.22370
作者: Reuben Luera,Ryan A. Rossi,Alexa Siu,Franck Dernoncourt,Tong Yu,Sungchul Kim,Ruiyi Zhang,Xiang Chen,Hanieh Salehy,Jian Zhao,Samyadeep Basu,Puneet Mathur,Nedim Lipka
关键词-EN: extremely impressive, user, user interaction patterns, generative, user interface designs
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The applications of generative AI have become extremely impressive, and the interplay between users and AI is even more so. Current human-AI interaction literature has taken a broad look at how humans interact with generative AI, but it lacks specificity regarding the user interface designs and patterns used to create these applications. Therefore, we present a survey that comprehensively presents taxonomies of how a human interacts with AI and the user interaction patterns designed to meet the needs of a variety of relevant use cases. We focus primarily on user-guided interactions, surveying interactions that are initiated by the user and do not include any implicit signals given by the user. With this survey, we aim to create a compendium of different user-interaction patterns that can be used as a reference for designers and developers alike. In doing so, we also strive to lower the entry barrier for those attempting to learn more about the design of generative AI applications.
摘要：生成式 AI (Generative AI) 的应用已经变得极为引人注目，而用户与 AI 之间的互动更是如此。当前关于人机交互的文献广泛探讨了人类如何与生成式 AI 互动，但对于用于创建这些应用的用户界面设计和模式缺乏具体性。因此，我们进行了一项调查，全面呈现了人类与 AI 互动的分类以及为满足各种相关用例需求而设计的用户交互模式。我们主要关注用户引导的交互，调查由用户发起的交互，不包括用户提供的任何隐含信号。通过这项调查，我们旨在创建一个不同用户交互模式的汇编，供设计师和开发者参考。同时，我们也努力降低那些希望深入了解生成式 AI 应用设计的人的入门门槛。

[NLP-61] ArxivDIGESTables: Synthesizing Scientific Literature into Tables using Language Models EMNLP2024

【速读】：该论文试图解决自动生成文献综述表格的问题，解决方案的关键在于利用语言模型 (LMs) 将任务分解为模式生成和值生成两个步骤。具体来说，论文提出了一个框架，通过语言模型来实现这一目标，并解决了两个主要挑战：一是通过创建和发布arxivDIGESTables数据集来弥补高质量数据集的不足；二是开发了DecontextEval自动评估方法，以支持模型生成与人工参考表格之间的可扩展性评估。实验结果表明，额外的上下文信息（如表格标题和文本引用）有助于提高生成质量，并且即使在模型未能完全重构参考表格的情况下，其生成的创新方面仍然具有一定的实用性。

链接: https://arxiv.org/abs/2410.22360
作者: Benjamin Newman,Yoonjoo Lee,Aakanksha Naik,Pao Siangliulue,Raymond Fok,Juho Kim,Daniel S. Weld,Joseph Chee Chang,Kyle Lo
关键词-EN: conducting literature reviews, create literature review, scientists often create, literature review tables, rows are publications
类目: Computation and Language (cs.CL)
备注: EMNLP 2024, 21 pages, 8 figures, 10 tables

点击查看摘要

Abstract:When conducting literature reviews, scientists often create literature review tables - tables whose rows are publications and whose columns constitute a schema, a set of aspects used to compare and contrast the papers. Can we automatically generate these tables using language models (LMs)? In this work, we introduce a framework that leverages LMs to perform this task by decomposing it into separate schema and value generation steps. To enable experimentation, we address two main challenges: First, we overcome a lack of high-quality datasets to benchmark table generation by curating and releasing arxivDIGESTables, a new dataset of 2,228 literature review tables extracted from ArXiv papers that synthesize a total of 7,542 research papers. Second, to support scalable evaluation of model generations against human-authored reference tables, we develop DecontextEval, an automatic evaluation method that aligns elements of tables with the same underlying aspects despite differing surface forms. Given these tools, we evaluate LMs’ abilities to reconstruct reference tables, finding this task benefits from additional context to ground the generation (e.g. table captions, in-text references). Finally, through a human evaluation study we find that even when LMs fail to fully reconstruct a reference table, their generated novel aspects can still be useful.
摘要：在进行文献综述时，科学家通常会创建文献综述表——这些表的行是出版物，列则构成一个模式（schema），即用于比较和对比论文的一组方面。我们能否利用语言模型（LMs）来自动生成这些表格？在本研究中，我们提出了一种框架，通过将任务分解为独立的模式生成和值生成步骤，利用LMs来执行此任务。为了便于实验，我们解决了两个主要挑战：首先，我们通过整理并发布arxivDIGESTables，一个从ArXiv论文中提取的包含2,228个文献综述表的新数据集，解决了高质量数据集缺乏的问题，这些表格总共综合了7,542篇研究论文。其次，为了支持模型生成结果与人工编写的参考表格之间的可扩展评估，我们开发了DecontextEval，这是一种自动评估方法，能够在表面形式不同的情况下，将表格中的元素与相同的底层方面对齐。借助这些工具，我们评估了LMs重建参考表格的能力，发现额外的上下文（例如表格标题、文本中的引用）有助于生成任务。最后，通过一项人类评估研究，我们发现即使LMs未能完全重建参考表格，它们生成的创新方面仍然具有实用性。

[NLP-62] RuleRAG: Rule-guided retrieval-augmented generation with language models for question answering

【速读】：该论文试图解决现有检索增强生成（Retrieval-augmented Generation, RAG）框架在知识密集型问答（QA）中，仅依赖查询本身进行检索和生成，而未明确指导检索器如何选择相关文档以及生成器如何引用这些文档的问题。解决方案的关键在于提出了一种基于规则的检索增强生成方法（Rule-Guided Retrieval-Augmented Generation with LMs, RuleRAG），通过引入符号规则作为上下文学习的示范（RuleRAG-ICL），指导检索器按规则逻辑方向检索相关文档，并统一指导生成器根据相同规则集生成答案。此外，查询与规则的组合可用于监督微调数据，更新检索器和生成器（RuleRAG-FT），以增强基于规则的指令遵循能力，从而检索更支持的结果并生成更可接受的答案。实验结果表明，无需训练的RuleRAG-ICL在检索质量和生成准确性上显著优于标准RAG，而进一步微调的RuleRAG-FT则持续带来更显著的性能提升。

链接: https://arxiv.org/abs/2410.22353
作者: Zhongwu Chen,Chengjin Xu,Dingmin Wang,Zhen Huang,Yong Dou,Jian Guo
关键词-EN: knowledge-intensive question answering, shown promising potential, retrieving external corpus, framework has shown, question answering
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) framework has shown promising potential in knowledge-intensive question answering (QA) by retrieving external corpus and generating based on augmented context. However, existing approaches only consider the query itself, neither specifying the retrieval preferences for the retrievers nor informing the generators of how to refer to the retrieved documents for the answers, which poses a significant challenge to the QA performance. To address these issues, we propose Rule-Guided Retrieval-Augmented Generation with LMs, which explicitly introduces symbolic rules as demonstrations for in-context learning (RuleRAG-ICL) to guide retrievers to retrieve logically related documents in the directions of rules and uniformly guide generators to generate answers attributed by the guidance of the same set of rules. Moreover, the combination of queries and rules can be further used as supervised fine-tuning data to update retrievers and generators (RuleRAG-FT) to achieve better rule-based instruction following capability, leading to retrieve more supportive results and generate more acceptable answers. To emphasize the attribution of rules, we construct five rule-aware QA benchmarks, including three temporal and two static scenarios, and equip RuleRAG with several kinds of retrievers and generators. Experiments demonstrate that training-free RuleRAG-ICL effectively improves the retrieval quality of +89.2% in Recall@10 scores and generation accuracy of +103.1% in exact match scores over standard RAG on average across the five benchmarks, and further fine-tuned RuleRAG-FT consistently yields more significant performance enhancement. Extensive analyses indicate that RuleRAG scales well with increasing numbers of retrieved documents and exhibits generalization ability for untrained rules.
摘要：检索增强生成 (Retrieval-augmented generation, RAG) 框架在知识密集型问答 (Question Answering, QA) 中展现了巨大的潜力，通过检索外部语料库并基于增强的上下文生成答案。然而，现有方法仅考虑查询本身，既未明确检索器的检索偏好，也未告知生成器如何参考检索到的文档来生成答案，这给 QA 性能带来了显著挑战。为解决这些问题，我们提出了基于规则的检索增强生成与大语言模型 (Rule-Guided Retrieval-Augmented Generation with LMs, RuleRAG)，该方法明确引入符号规则作为上下文学习的示范 (RuleRAG-ICL)，以指导检索器按照规则逻辑方向检索相关文档，并统一指导生成器根据同一组规则生成答案。此外，查询与规则的组合可进一步用作监督微调数据，以更新检索器和生成器 (RuleRAG-FT)，从而实现更好的基于规则的指令遵循能力，进而检索到更支持的结果并生成更可接受的答案。为强调规则的归属，我们构建了五个规则感知的 QA 基准，包括三个时间场景和两个静态场景，并为 RuleRAG 配备了多种检索器和生成器。实验表明，无需训练的 RuleRAG-ICL 在五个基准上的平均 Recall@10 分数中有效提升了 89.2% 的检索质量和 103.1% 的生成准确性，而进一步微调的 RuleRAG-FT 则持续带来更显著的性能提升。广泛的分析表明，RuleRAG 随着检索文档数量的增加具有良好的扩展性，并展现出对未训练规则的泛化能力。

[NLP-63] Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses

【速读】：该论文试图解决基于大型语言模型（LLM）的生成式搜索引擎（Answer Engines）在实际应用中的局限性问题。解决方案的关键在于通过用户研究和自动化评估，识别出16个主要限制，并提出16项设计建议，这些建议与8个评估指标相关联。论文通过自动化评估方法对三款流行的生成式搜索引擎（Google, Bing, BingChat）进行了量化分析，验证了这些限制的存在，并展示了不同引擎在回答准确性和引用准确性等方面的差异。最终，论文发布了一个名为Answer Engine Evaluation benchmark (AEE)的评估基准，以促进对LLM应用的透明评估。

链接: https://arxiv.org/abs/2410.22349
作者: Pranav Narayanan Venkit,Philippe Laban,Yilun Zhou,Yixin Mao,Chien-Sheng Wu
关键词-EN: Large Language Model, Large Language, Language Model, products serving millions, influencing how people
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based applications are graduating from research prototypes to products serving millions of users, influencing how people write and consume information. A prominent example is the appearance of Answer Engines: LLM-based generative search engines supplanting traditional search engines. Answer engines not only retrieve relevant sources to a user query but synthesize answer summaries that cite the sources. To understand these systems’ limitations, we first conducted a study with 21 participants, evaluating interactions with answer vs. traditional search engines and identifying 16 answer engine limitations. From these insights, we propose 16 answer engine design recommendations, linked to 8 metrics. An automated evaluation implementing our metrics on three popular engines (this http URL, this http URL, BingChat) quantifies common limitations (e.g., frequent hallucination, inaccurate citation) and unique features (e.g., variation in answer confidence), with results mirroring user study insights. We release our Answer Engine Evaluation benchmark (AEE) to facilitate transparent evaluation of LLM-based applications.
摘要：基于大语言模型（Large Language Model, LLM）的应用正从研究原型转变为服务于数百万用户的产品，影响着人们写作和消费信息的方式。一个显著的例子是答案引擎的出现：基于LLM的生成式搜索引擎正在取代传统的搜索引擎。答案引擎不仅检索与用户查询相关的来源，还综合生成引用这些来源的答案摘要。为了理解这些系统的局限性，我们首先进行了由21名参与者参与的研究，评估了与答案引擎和传统搜索引擎的交互，并识别出16个答案引擎的局限性。基于这些洞察，我们提出了16个答案引擎设计建议，这些建议与8个指标相关联。我们通过在三个流行的引擎（this http URL, this http URL, BingChat）上实施自动评估，量化了常见的局限性（例如，频繁的幻觉现象、不准确的引用）和独特的特征（例如，答案自信度的变化），结果与用户研究洞察相吻合。我们发布了答案引擎评估基准（AEE），以促进对基于LLM应用的透明评估。

[NLP-64] Efficient Machine Translation with a BiLSTM-Attention Approach

【速读】：该论文试图解决机器翻译模型在提高翻译质量的同时减少存储空间需求的问题。解决方案的关键在于提出了一种新颖的Seq2Seq模型，该模型采用双向长短期记忆网络（Bi-LSTM）作为编码器以捕捉输入序列的上下文信息，并在解码器中引入注意力机制（attention mechanism），以增强模型在翻译过程中对关键信息的聚焦能力。与当前主流的Transformer模型相比，该模型在WMT14机器翻译数据集上表现更优，同时模型尺寸更小。通过一系列实验验证，该模型在保持翻译准确性的同时显著降低了存储需求，这对于资源受限场景下的翻译应用具有重要意义。

链接: https://arxiv.org/abs/2410.22335
作者: Yuxu Wu,Yiren Xing
关键词-EN: Natural Language Processing, development of Natural, Language Processing, Natural Language, model
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid development of Natural Language Processing (NLP) technology, the accuracy and efficiency of machine translation have become hot topics of research. This paper proposes a novel Seq2Seq model aimed at improving translation quality while reducing the storage space required by the model. The model employs a Bidirectional Long Short-Term Memory network (Bi-LSTM) as the encoder to capture the context information of the input sequence; the decoder incorporates an attention mechanism, enhancing the model’s ability to focus on key information during the translation process. Compared to the current mainstream Transformer model, our model achieves superior performance on the WMT14 machine translation dataset while maintaining a smaller size. The study first introduces the design principles and innovative points of the model architecture, followed by a series of experiments to verify the effectiveness of the model. The experimental includes an assessment of the model’s performance on different language pairs, as well as comparative analysis with traditional Seq2Seq models. The results show that while maintaining translation accuracy, our model significantly reduces the storage requirements, which is of great significance for translation applications in resource-constrained scenarios. our code are available at this https URL . Thanks for the support provided by MindSpore Community. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2410.22335 [cs.CL] (or arXiv:2410.22335v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2410.22335 Focus to learn more arXiv-issued DOI via DataCite
摘要：随着自然语言处理 (Natural Language Processing, NLP) 技术的快速发展，机器翻译的准确性和效率已成为研究热点。本文提出了一种新颖的序列到序列 (Seq2Seq) 模型，旨在提高翻译质量的同时减少模型所需的存储空间。该模型采用双向长短期记忆网络 (Bidirectional Long Short-Term Memory network, Bi-LSTM) 作为编码器，以捕捉输入序列的上下文信息；解码器则结合了注意力机制 (attention mechanism)，增强了模型在翻译过程中关注关键信息的能力。与当前主流的 Transformer 模型相比，我们的模型在 WMT14 机器翻译数据集上表现更优，同时模型规模更小。研究首先介绍了模型架构的设计原则和创新点，随后通过一系列实验验证了模型的有效性。实验包括对模型在不同语言对上的性能评估，以及与传统 Seq2Seq 模型的对比分析。结果表明，在保持翻译准确性的同时，我们的模型显著降低了存储需求，这对于资源受限场景下的翻译应用具有重要意义。我们的代码可在以下链接获取：https URL。感谢 MindSpore 社区提供的支持。

主题：计算与语言 (Computation and Language, cs.CL)
引用方式：arXiv:2410.22335 [cs.CL] (或 arXiv:2410.22335v1 [cs.CL] 用于此版本)
https://doi.org/10.48550/arXiv.2410.22335
通过 DataCite 发布的 arXiv 发行 DOI

[NLP-65] Sing it Narrate it: Quality Musical Lyrics Translation

【速读】：该论文试图解决音乐剧歌词翻译中翻译质量与可唱性要求之间的矛盾问题。解决方案的关键在于提出了一种三步方法：首先，创建数据集以训练奖励模型，用于自动评估翻译质量；其次，通过两阶段训练过程和过滤技术，同时提升翻译质量和可唱性；最后，引入一个推断时优化框架，用于整首歌曲的翻译。实验结果表明，该方法在翻译质量和可唱性方面均显著优于基线方法。

链接: https://arxiv.org/abs/2410.22066
作者: Zhuorui Ye,Jinhan Li,Rongwu Xu
关键词-EN: presents unique challenges, unique challenges due, musicals presents unique, ensure high translation, translation quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Translating lyrics for musicals presents unique challenges due to the need to ensure high translation quality while adhering to singability requirements such as length and rhyme. Existing song translation approaches often prioritize these singability constraints at the expense of translation quality, which is crucial for musicals. This paper aims to enhance translation quality while maintaining key singability features. Our method consists of three main components. First, we create a dataset to train reward models for the automatic evaluation of translation quality. Second, to enhance both singability and translation quality, we implement a two-stage training process with filtering techniques. Finally, we introduce an inference-time optimization framework for translating entire songs. Extensive experiments, including both automatic and human evaluations, demonstrate significant improvements over baseline methods and validate the effectiveness of each component in our approach.
摘要：为音乐剧翻译歌词带来了独特的挑战，因为需要在确保高翻译质量的同时，遵守诸如长度和韵律等可唱性要求。现有的歌曲翻译方法往往优先考虑这些可唱性约束，而牺牲了翻译质量，这对音乐剧至关重要。本文旨在提高翻译质量的同时，保持关键的可唱性特征。我们的方法由三个主要部分组成。首先，我们创建了一个数据集，用于训练奖励模型，以自动评估翻译质量。其次，为了同时增强可唱性和翻译质量，我们实施了一个包含过滤技术的两阶段训练过程。最后，我们引入了一个推理时优化框架，用于翻译整首歌曲。广泛的实验，包括自动评估和人工评估，展示了相对于基线方法的显著改进，并验证了我们方法中每个组件的有效性。

[NLP-66] A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation NEURIPS2024

【速读】：该论文试图解决从粗粒度音频编码（coarse tokens）重构高质量音频波形的问题。解决方案的关键在于选择合适的学习目标和重构方法。论文研究了基于令牌预测（token prediction）和回归（regression）的两种策略，并引入了一种基于薛定谔桥（Schrödinger Bridge）的新方法。通过对比不同设计选择对机器和人类感知的影响，论文强调了这些选择对生成音频质量的显著影响。

链接: https://arxiv.org/abs/2410.22448
作者: Alexander H. Liu,Qirui Wang,Yuan Gong,James Glass
关键词-EN: initially designed, compression technique, Neural Audio Codecs, gained more attention, attention recently
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: NeurIPS 2024 Audio Imagination workshop paper; demo page at this https URL

点击查看摘要

Abstract:Neural Audio Codecs, initially designed as a compression technique, have gained more attention recently for speech generation. Codec models represent each audio frame as a sequence of tokens, i.e., discrete embeddings. The discrete and low-frequency nature of neural codecs introduced a new way to generate speech with token-based models. As these tokens encode information at various levels of granularity, from coarse to fine, most existing works focus on how to better generate the coarse tokens. In this paper, we focus on an equally important but often overlooked question: How can we better resynthesize the waveform from coarse tokens? We point out that both the choice of learning target and resynthesis approach have a dramatic impact on the generated audio quality. Specifically, we study two different strategies based on token prediction and regression, and introduce a new method based on Schrödinger Bridge. We examine how different design choices affect machine and human perception.
摘要：神经音频编解码器最初作为一种压缩技术设计，近年来在语音生成领域引起了更多关注。编解码器模型将每个音频帧表示为一系列 Token，即离散嵌入。神经编解码器的离散和低频特性为基于 Token 的模型生成语音提供了新的途径。由于这些 Token 在不同粒度级别（从粗到细）编码信息，大多数现有工作集中在如何更好地生成粗粒度 Token 上。在本文中，我们关注一个同样重要但常被忽视的问题：如何更好地从粗粒度 Token 重构波形？我们指出，学习目标和重构方法的选择对生成音频的质量有显著影响。具体而言，我们研究了基于 Token 预测和回归的两种不同策略，并引入了一种基于 Schrödinger Bridge 的新方法。我们探讨了不同设计选择对机器和人类感知的影响。

人工智能

[AI-0] Provable acceleration for diffusion models under minimal assumptions

链接: https://arxiv.org/abs/2410.23285
作者: Gen Li,Changxiao Cai
关键词-EN: exceptional sampling quality, achieved exceptional sampling, high computational burden, score-based diffusion models, score function evaluations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While score-based diffusion models have achieved exceptional sampling quality, their sampling speeds are often limited by the high computational burden of score function evaluations. Despite the recent remarkable empirical advances in speeding up the score-based samplers, theoretical understanding of acceleration techniques remains largely limited. To bridge this gap, we propose a novel training-free acceleration scheme for stochastic samplers. Under minimal assumptions – namely, L^2 -accurate score estimates and a finite second-moment condition on the target distribution – our accelerated sampler provably achieves \varepsilon -accuracy in total variation within \widetildeO(d^5/4/\sqrt\varepsilon) iterations, thereby significantly improving upon the \widetildeO(d/\varepsilon) iteration complexity of standard score-based samplers. Notably, our convergence theory does not rely on restrictive assumptions on the target distribution or higher-order score estimation guarantees.

[AI-1] A Neural Transformer Framework for Simultaneous Tasks of Segmentation Classification and Caller Identification of Marmoset Vocalization

链接: https://arxiv.org/abs/2410.23279
作者: Bin Wu,Sakriani Sakti,Shinnosuke Takamichi,Satoshi Nakamura
关键词-EN: highly vocalized primate, studying social-communicative behavior, popular animal model, vocalized primate, highly vocalized
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.

[AI-2] Multi-student Diffusion Distillation for Better One-step Generators

链接: https://arxiv.org/abs/2410.23274
作者: Yanke Song,Jonathan Lorraine,Weili Nie,Karsten Kreis,James Lucas
关键词-EN: achieve high-quality sample, multistep inference procedure, models achieve high-quality, lengthy multistep inference, Diffusion models achieve
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion models achieve high-quality sample generation at the cost of a lengthy multistep inference procedure. To overcome this, diffusion distillation techniques produce student generators capable of matching or surpassing the teacher in a single step. However, the student model’s inference speed is limited by the size of the teacher architecture, preventing real-time generation for computationally heavy applications. In this work, we introduce Multi-Student Distillation (MSD), a framework to distill a conditional teacher diffusion model into multiple single-step generators. Each student generator is responsible for a subset of the conditioning data, thereby obtaining higher generation quality for the same capacity. MSD trains multiple distilled students, allowing smaller sizes and, therefore, faster inference. Also, MSD offers a lightweight quality boost over single-student distillation with the same architecture. We demonstrate MSD is effective by training multiple same-sized or smaller students on single-step distillation using distribution matching and adversarial distillation techniques. With smaller students, MSD gets competitive results with faster inference for single-step generation. Using 4 same-sized students, MSD sets a new state-of-the-art for one-step image generation: FID 1.20 on ImageNet-64x64 and 8.20 on zero-shot COCO2014.

[AI-3] Proportional Fairness in Non-Centroid Clustering NEURIPS2024

链接: https://arxiv.org/abs/2410.23273
作者: Ioannis Caragiannis,Evi Micha,Nisarg Shah
关键词-EN: provide group fairness, group fairness guarantees, proportionally fair clustering, recently developed framework, provide group
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: A preliminary version appeared at NeurIPS 2024

点击查看摘要

Abstract:We revisit the recently developed framework of proportionally fair clustering, where the goal is to provide group fairness guarantees that become stronger for groups of data points (agents) that are large and cohesive. Prior work applies this framework to centroid clustering, where the loss of an agent is its distance to the centroid assigned to its cluster. We expand the framework to non-centroid clustering, where the loss of an agent is a function of the other agents in its cluster, by adapting two proportional fairness criteria – the core and its relaxation, fully justified representation (FJR) – to this setting. We show that the core can be approximated only under structured loss functions, and even then, the best approximation we are able to establish, using an adaptation of the GreedyCapture algorithm developed for centroid clustering [Chen et al., 2019; Micha and Shah, 2020], is unappealing for a natural loss function. In contrast, we design a new (inefficient) algorithm, GreedyCohesiveClustering, which achieves the relaxation FJR exactly under arbitrary loss functions, and show that the efficient GreedyCapture algorithm achieves a constant approximation of FJR. We also design an efficient auditing algorithm, which estimates the FJR approximation of any given clustering solution up to a constant factor. Our experiments on real data suggest that traditional clustering algorithms are highly unfair, whereas GreedyCapture is considerably fairer and incurs only a modest loss in common clustering objectives. Comments: A preliminary version appeared at NeurIPS 2024 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2410.23273 [cs.LG] (or arXiv:2410.23273v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.23273 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] A Monte Carlo Framework for Calibrated Uncertainty Estimation in Sequence Prediction

链接: https://arxiv.org/abs/2410.23272
作者: Qidong Yang,Weicheng Zhu,Joseph Keslin,Laure Zanna,Tim G. J. Rudner,Carlos Fernandez-Granda
关键词-EN: key challenge, risk-sensitive applications, Monte Carlo, Probabilistic prediction, Monte Carlo framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Probabilistic prediction of sequences from images and other high-dimensional data is a key challenge, particularly in risk-sensitive applications. In these settings, it is often desirable to quantify the uncertainty associated with the prediction (instead of just determining the most likely sequence, as in language modeling). In this paper, we propose a Monte Carlo framework to estimate probabilities and confidence intervals associated with the distribution of a discrete sequence. Our framework uses a Monte Carlo simulator, implemented as an autoregressively trained neural network, to sample sequences conditioned on an image input. We then use these samples to estimate the probabilities and confidence intervals. Experiments on synthetic and real data show that the framework produces accurate discriminative predictions, but can suffer from miscalibration. In order to address this shortcoming, we propose a time-dependent regularization method, which is shown to produce calibrated predictions.

[AI-5] Keypoint Abstraction using Large Models for Object-Relative Imitation Learning

链接: https://arxiv.org/abs/2410.23254
作者: Xiaolin Fang,Bo-Ruei Huang,Jiayuan Mao,Jasmine Shone,Joshua B. Tenenbaum,Tomás Lozano-Pérez,Leslie Pack Kaelbling
关键词-EN: challenge in robotics, critical challenge, Generalization, object configurations, diverse tasks
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: CoRL LangRob Workshop, 2024

点击查看摘要

Abstract:Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels. Website: this https URL

[AI-6] A little less conversation a little more action please: Investigating the physical common-sense of LLM s in a 3D embodied environment

链接: https://arxiv.org/abs/2410.23242
作者: Matteo G. Mecattaf,Ben Slater,Marko Tešić,Jonathan Prunty,Konstantinos Voudouris,Lucy G. Cheke
关键词-EN: Large Language Models, Large Language, Language Models, reason about everyday, physical
类目: Artificial Intelligence (cs.AI)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:As general-purpose tools, Large Language Models (LLMs) must often reason about everyday physical environments. In a question-and-answer capacity, understanding the interactions of physical objects may be necessary to give appropriate responses. Moreover, LLMs are increasingly used as reasoning engines in agentic systems, designing and controlling their action sequences. The vast majority of research has tackled this issue using static benchmarks, comprised of text or image-based questions about the physical world. However, these benchmarks do not capture the complexity and nuance of real-life physical processes. Here we advocate for a second, relatively unexplored, approach: ‘embodying’ the LLMs by granting them control of an agent within a 3D environment. We present the first embodied and cognitively meaningful evaluation of physical common-sense reasoning in LLMs. Our framework allows direct comparison of LLMs with other embodied agents, such as those based on Deep Reinforcement Learning, and human and non-human animals. We employ the Animal-AI (AAI) environment, a simulated 3D virtual laboratory, to study physical common-sense reasoning in LLMs. For this, we use the AAI Testbed, a suite of experiments that replicate laboratory studies with non-human animals, to study physical reasoning capabilities including distance estimation, tracking out-of-sight objects, and tool use. We demonstrate that state-of-the-art multi-modal models with no finetuning can complete this style of task, allowing meaningful comparison to the entrants of the 2019 Animal-AI Olympics competition and to human children. Our results show that LLMs are currently outperformed by human children on these tasks. We argue that this approach allows the study of physical reasoning using ecologically valid experiments drawn directly from cognitive science, improving the predictability and reliability of LLMs.

[AI-7] EMOTION: Expressive Motion Sequence Generation for Humanoid Robots with In-Context Learning

链接: https://arxiv.org/abs/2410.23234
作者: Peide Huang,Yuhan Hu,Nataliya Nechyporenko,Daehwa Kim,Walter Talbott,Jian Zhang
关键词-EN: humanlike non-verbal communication, enhancing their ability, paper introduces, ability to engage, engage in humanlike
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a framework, called EMOTION, for generating expressive motion sequences in humanoid robots, enhancing their ability to engage in humanlike non-verbal communication. Non-verbal cues such as facial expressions, gestures, and body movements play a crucial role in effective interpersonal interactions. Despite the advancements in robotic behaviors, existing methods often fall short in mimicking the diversity and subtlety of human non-verbal communication. To address this gap, our approach leverages the in-context learning capability of large language models (LLMs) to dynamically generate socially appropriate gesture motion sequences for human-robot interaction. We use this framework to generate 10 different expressive gestures and conduct online user studies comparing the naturalness and understandability of the motions generated by EMOTION and its human-feedback version, EMOTION++, against those by human operators. The results demonstrate that our approach either matches or surpasses human performance in generating understandable and natural robot motions under certain scenarios. We also provide design implications for future research to consider a set of variables when generating expressive robotic gestures.

[AI-8] Aligning Audio-Visual Joint Representations with an Agent ic Workflow

链接: https://arxiv.org/abs/2410.23230
作者: Shentong Mo,Yibing Song
关键词-EN: signals naturally formulate, audio, naturally formulate, data, audio signals
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool use). Then, AVAgent reasons whether this paired data is aligned well and plans to edit the audio signal if needed (i.e., planning). The audio editing is executed by predefined actions that filter noise or augment data. Moreover, we use a VLM to evaluate how modified audio signals match the visual content and provide feedback to AVAgent (i.e., reflection). The tool use, planning, and reflection steps operate cyclically to become an agentic workflow where audio signals are gradually aligned to visual content. To this end, existing methods can directly leverage the aligned AV data via our agentic workflow to improve AV joint representations. The experimental results comprehensively demonstrate the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.

[AI-9] Partial Channel Dependence with Channel Masks for Time Series Foundation Models NEURIPS

链接: https://arxiv.org/abs/2410.23222
作者: Seunghan Lee,Taeyoung Park,Kibok Lee
关键词-EN: Recent advancements, time series, successfully extended, emergence of large-scale, Recent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: NeurIPS Workshop on Time Series in the Age of Large Models, 2024. Oral presentation

点击查看摘要

Abstract:Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of large-scale TS datasets. However, previous efforts have primarily focused on designing model architectures to address explicit heterogeneity among datasets such as various numbers of channels, while often overlooking implicit heterogeneity such as varying dependencies between channels. In this work, we introduce the concept of partial channel dependence (PCD), which enables a more sophisticated adjustment of channel dependencies based on dataset-specific information. To achieve PCD, we propose a channel mask that captures the relationships between channels within a dataset using two key components: 1) a correlation matrix that encodes relative dependencies between channels, and 2) domain parameters that learn the absolute dependencies specific to each dataset, refining the correlation matrix. We validate the effectiveness of PCD across four tasks in TS including forecasting, classification, imputation, and anomaly detection, under diverse settings, including few-shot and zero-shot scenarios with both TS foundation models and single-task models. Code is available at this https URL.

[AI-10] DiaMond: Dementia Diagnosis with Multi-Modal Vision Transformers Using MRI and PET WACV

链接: https://arxiv.org/abs/2410.23219
作者: Yitong Li,Morteza Ghahremani,Youssef Wally,Christian Wachinger
关键词-EN: Alzheimer Disease, Diagnosing dementia, frontotemporal dementia, overlapping symptoms, complex due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Diagnosing dementia, particularly for Alzheimer’s Disease (AD) and frontotemporal dementia (FTD), is complex due to overlapping symptoms. While magnetic resonance imaging (MRI) and positron emission tomography (PET) data are critical for the diagnosis, integrating these modalities in deep learning faces challenges, often resulting in suboptimal performance compared to using single modalities. Moreover, the potential of multi-modal approaches in differential diagnosis, which holds significant clinical importance, remains largely unexplored. We propose a novel framework, DiaMond, to address these issues with vision Transformers to effectively integrate MRI and PET. DiaMond is equipped with self-attention and a novel bi-attention mechanism that synergistically combine MRI and PET, alongside a multi-modal normalization to reduce redundant dependency, thereby boosting the performance. DiaMond significantly outperforms existing multi-modal methods across various datasets, achieving a balanced accuracy of 92.4% in AD diagnosis, 65.2% for AD-MCI-CN classification, and 76.5% in differential diagnosis of AD and FTD. We also validated the robustness of DiaMond in a comprehensive ablation study. The code is available at this https URL.

[AI-11] Grounding by Trying: LLM s with Reinforcement Learning-Enhanced Retrieval

链接: https://arxiv.org/abs/2410.23214
作者: Sheryl Hsu,Omar Khattab,Chelsea Finn,Archit Sharma
关键词-EN: large language models, language models, real sources, hallucinations of large, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The hallucinations of large language models (LLMs) are increasingly mitigated by allowing LLMs to search for information and to ground their answers in real sources. Unfortunately, LLMs often struggle with posing the right search queries, especially when dealing with complex or otherwise indirect topics. Observing that LLMs can learn to search for relevant facts by \textittrying different queries and learning to up-weight queries that successfully produce relevant results, we introduce \underlineLe arning to \underlineRe trieve by \underlineT rying (LeReT), a reinforcement learning framework that explores search queries and uses preference-based optimization to improve their quality. \methodclass can improve the absolute retrieval accuracy by up to 29% and the downstream generator evaluations by 17%. The simplicity and flexibility of LeReT allows it to be applied to arbitrary off-the-shelf retrievers and makes it a promising technique for improving general LLM pipelines. Project website: this http URL.

[AI-12] Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks

链接: https://arxiv.org/abs/2410.23208
作者: Michael Matthews,Michael Beukman,Chris Lu,Jakob Foerster
关键词-EN: sequential decision problems, decision problems remains, shown remarkable capabilities, image domains, open challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The first two authors contributed equally. Project page located at: this https URL

点击查看摘要

Abstract:While large models trained with self-supervised learning on offline datasets have shown remarkable capabilities in text and image domains, achieving the same generalisation for agents that act in sequential decision problems remains an open challenge. In this work, we take a step towards this goal by procedurally generating tens of millions of 2D physics-based tasks and using these to train a general reinforcement learning (RL) agent for physical control. To this end, we introduce Kinetix: an open-ended space of physics-based RL environments that can represent tasks ranging from robotic locomotion and grasping to video games and classic RL environments, all within a unified framework. Kinetix makes use of our novel hardware-accelerated physics engine Jax2D that allows us to cheaply simulate billions of environment steps during training. Our trained agent exhibits strong physical reasoning capabilities, being able to zero-shot solve unseen human-designed environments. Furthermore, fine-tuning this general agent on tasks of interest shows significantly stronger performance than training an RL agent tabula rasa. This includes solving some environments that standard RL training completely fails at. We believe this demonstrates the feasibility of large scale, mixed-quality pre-training for online RL and we hope that Kinetix will serve as a useful framework to investigate this further.

[AI-13] ReasoningRec: Bridging Personalized Recommendations and Human-Interpretable Explanations through LLM Reasoning NAACL2025

链接: https://arxiv.org/abs/2410.23180
作者: Millennium Bismay,Xiangjue Dong,James Caverlee
关键词-EN: leverages Large Language, Large Language Models, Large Language, leverages Large, Language Models
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Large Language Model, Recommendation, Human-Interpretable Reasoning, Personalization, Submitted for NAACL 2025

点击查看摘要

Abstract:This paper presents ReasoningRec, a reasoning-based recommendation framework that leverages Large Language Models (LLMs) to bridge the gap between recommendations and human-interpretable explanations. In contrast to conventional recommendation systems that rely on implicit user-item interactions, ReasoningRec employs LLMs to model users and items, focusing on preferences, aversions, and explanatory reasoning. The framework utilizes a larger LLM to generate synthetic explanations for user preferences, subsequently used to fine-tune a smaller LLM for enhanced recommendation accuracy and human-interpretable explanation. Our experimental study investigates the impact of reasoning and contextual information on personalized recommendations, revealing that the quality of contextual and personalized data significantly influences the LLM’s capacity to generate plausible explanations. Empirical evaluations demonstrate that ReasoningRec surpasses state-of-the-art methods by up to 12.5% in recommendation prediction while concurrently providing human-intelligible explanations. The code is available here: this https URL.

[AI-14] FlexTSF: A Universal Forecasting Model for Time Series with Variable Regularities

链接: https://arxiv.org/abs/2410.23160
作者: Jingge Xiao,Yile Chen,Gao Cong,Wolfgang Nejdl,Simon Gottschalk
关键词-EN: time series forecasting, time series, Developing a foundation, irregular time series, attracted significant attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Developing a foundation model for time series forecasting across diverse domains has attracted significant attention in recent years. Existing works typically assume regularly sampled, well-structured data, limiting their applicability to more generalized scenarios where time series often contain missing values, unequal sequence lengths, and irregular time intervals between measurements. To cover diverse domains and handle variable regularities, we propose FlexTSF, a universal time series forecasting model that possesses better generalization and natively support both regular and irregular time series. FlexTSF produces forecasts in an autoregressive manner and incorporates three novel designs: VT-Norm, a normalization strategy to ablate data domain barriers, IVP Patcher, a patching module to learn representations from flexibly structured time series, and LED attention, an attention mechanism to seamlessly integrate these two and propagate forecasts with awareness of domain and time information. Experiments on 12 datasets show that FlexTSF outperforms state-of-the-art forecasting models respectively designed for regular and irregular time series. Furthermore, after self-supervised pre-training, FlexTSF shows exceptional performance in both zero-shot and few-show settings for time series forecasting.

[AI-15] Fourier Amplitude and Correlation Loss: Beyond Using L2 Loss for Skillful Precipitation Nowcasting NEURIPS2024

链接: https://arxiv.org/abs/2410.23159
作者: Chiu-Wai Yan,Shi Quan Foo,Van Hoan Trinh,Dit-Yan Yeung,Ka-Hing Wong,Wai-Kin Wong
关键词-EN: Deep learning approaches, Deep learning, Fourier Correlation Loss, Fourier Amplitude Loss, Fourier Amplitude
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024. Camera-ready submission

点击查看摘要

Abstract:Deep learning approaches have been widely adopted for precipitation nowcasting in recent years. Previous studies mainly focus on proposing new model architectures to improve pixel-wise metrics. However, they frequently result in blurry predictions which provide limited utility to forecasting operations. In this work, we propose a new Fourier Amplitude and Correlation Loss (FACL) which consists of two novel loss terms: Fourier Amplitude Loss (FAL) and Fourier Correlation Loss (FCL). FAL regularizes the Fourier amplitude of the model prediction and FCL complements the missing phase information. The two loss terms work together to replace the traditional L_2 losses such as MSE and weighted MSE for the spatiotemporal prediction problem on signal-based data. Our method is generic, parameter-free and efficient. Extensive experiments using one synthetic dataset and three radar echo datasets demonstrate that our method improves perceptual metrics and meteorology skill scores, with a small trade-off to pixel-wise accuracy and structural similarity. Moreover, to improve the error margin in meteorological skill scores such as Critical Success Index (CSI) and Fractions Skill Score (FSS), we propose and adopt the Regional Histogram Divergence (RHD), a distance metric that considers the patch-wise similarity between signal-based imagery patterns with tolerance to local transforms. Code is available at this https URL

[AI-16] VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning

链接: https://arxiv.org/abs/2410.23156
作者: Yichao Liang,Nishanth Kumar,Hao Tang,Adrian Weller,Joshua B. Tenenbaum,Tom Silver,João F. Henriques,Kevin Ellis
关键词-EN: Broadly intelligent agents, raw sensorimotor space, Broadly intelligent, form task-specific abstractions, sensorimotor space
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: In submission

点击查看摘要

Abstract:Broadly intelligent agents should form task-specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space. In this work, we present Neuro-Symbolic Predicates, a first-order abstraction language that combines the strengths of symbolic and neural knowledge representations. We outline an online algorithm for inventing such predicates and learning abstract world models. We compare our approach to hierarchical reinforcement learning, vision-language model planning, and symbolic predicate invention approaches, on both in- and out-of-distribution tasks across five simulated robotic domains. Results show that our approach offers better sample complexity, stronger out-of-distribution generalization, and improved interpretability.

[AI-17] Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms

链接: https://arxiv.org/abs/2410.23144
作者: Jordan Meyer,Nick Padgett,Cullen Miller,Laura Exline
关键词-EN: million high-quality public, present Public Domain, high-quality public domain, Public Domain, designed for training
类目: Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the this http URL platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.

[AI-18] Fair Division with Market Values

链接: https://arxiv.org/abs/2410.23137
作者: Siddharth Barman,Soroush Ebadian,Mohamad Latifian,Nisarg Shah
关键词-EN: market valuation, subjective valuations, market, respect, valuations
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce a model of fair division with market values, where indivisible goods must be partitioned among agents with (additive) subjective valuations, and each good additionally has a market value. The market valuation can be viewed as a separate additive valuation that holds identically across all the agents. We seek allocations that are simultaneously fair with respect to the subjective valuations and with respect to the market valuation. We show that an allocation that satisfies stochastically-dominant envy-freeness up to one good (SD-EF1) with respect to both the subjective valuations and the market valuation does not always exist, but the weaker guarantee of EF1 with respect to the subjective valuations along with SD-EF1 with respect to the market valuation can be guaranteed. We also study a number of other guarantees such as Pareto optimality, EFX, and MMS. In addition, we explore non-additive valuations and extend our model to cake-cutting. Along the way, we identify several tantalizing open questions. Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.23137 [cs.GT] (or arXiv:2410.23137v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2410.23137 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-19] Revisiting MAE pre-training for 3D medical image segmentation

链接: https://arxiv.org/abs/2410.23132
作者: Tassilo Wald,Constantin Ulrich,Stanislav Lukyanenko,Andrei Goncharov,Alberto Paderno,Leander Maerkisch,Paul F. Jäger,Klaus Maier-Hein
关键词-EN: Self-Supervised Learning, untapped clinical datasets, presents an exciting, potential of vast, untapped clinical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Arxiv Preprint. Currently under Review

点击查看摘要

Abstract:Self-Supervised Learning (SSL) presents an exciting opportunity to unlock the potential of vast, untapped clinical datasets, for various downstream applications that suffer from the scarcity of labeled data. While SSL has revolutionized fields like natural language processing and computer vision, their adoption in 3D medical image computing has been limited by three key pitfalls: Small pre-training dataset sizes, architectures inadequate for 3D medical image analysis, and insufficient evaluation practices. We address these issues by i) leveraging a large-scale dataset of 44k 3D brain MRI volumes and ii) using a Residual Encoder U-Net architecture within the state-of-the-art nnU-Net framework. iii) A robust development framework, incorporating 5 development and 8 testing brain MRI segmentation datasets, allowed performance-driven design decisions to optimize the simple concept of Masked Auto Encoders (MAEs) for 3D CNNs. The resulting model not only surpasses previous SSL methods but also outperforms the strong nnU-Net baseline by an average of approximately 3 Dice points. Furthermore, our model demonstrates exceptional stability, achieving the highest average rank of 2 out of 7 methods, compared to the second-best method’s mean rank of 3.

[AI-20] Why Gradient Subspace? Identifying and Mitigating LoRAs Bottlenecks in Federated Fine-Tuning of Large Language Models

链接: https://arxiv.org/abs/2410.23111
作者: Navyansh Mahla,Ganesh Ramakrishnan
关键词-EN: Large Language Models, Large Language, demonstrated remarkable capabilities, Language Models, demonstrated remarkable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 24 pages, 10 figures, pre-print

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, particularly in task generalization for both text and vision data. While fine-tuning these models can significantly enhance their performance on specific downstream tasks, it often requires high-quality data that cannot be shared due to privacy concerns. Federated Learning (FL) offers a promising solution for collaborative training without direct data sharing. However, many parameter-efficient fine-tuning strategies for LLMs in FL, particularly those based on Low-Rank Adaptation (LoRA), face limitations. In this paper, we critically analyze the convergence and performance guarantees of popular FL frameworks utilizing LoRA, highlighting its suboptimal nature due to constrained subspace learning of low-rank matrices. This limitation hinders effective fine-tuning of LLMs in federated settings. Through rigorous analytical and empirical evaluations, we demonstrate that direct weight averaging outperforms LoRA-based strategies, leading to superior performance for fine-tuned models. Our comprehensive comparison exposes inefficiencies in LoRA approaches and underscores the advantages of full-rank weight aggregation. We extend our analysis to low-rank gradient-based optimizers, such as GaLore, used during local training steps. Our findings show that GaLore is a more effective alternative, outperforming federated LoRA methods like FlexLoRA and FFA-LoRA across both text and image modalities. While privacy remains paramount in FL discourse, our focus is on assessing performance outcomes of federated fine-tuned models and evaluating various FL frameworks from both theoretical and empirical perspectives. Our findings advocate reassessing the reliance on LoRA within FL contexts, paving the way for more efficient training methodologies.

[AI-21] Controllable Game Level Generation: Assessing the Effect of Negative Examples in GAN Models

链接: https://arxiv.org/abs/2410.23108
作者: Mahsa Bazzaz,Seth Cooper
关键词-EN: Generative Adversarial Networks, Adversarial Networks, Generative Adversarial, Conditional Generative Adversarial, unsupervised models designed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) are unsupervised models designed to learn and replicate a target distribution. The vanilla versions of these models can be extended to more controllable models. Conditional Generative Adversarial Networks (CGANs) extend vanilla GANs by conditioning both the generator and discriminator on some additional information (labels). Controllable models based on complementary learning, such as Rumi-GAN, have been introduced. Rumi-GANs leverage negative examples to enhance the generator’s ability to learn positive examples. We evaluate the performance of two controllable GAN variants, CGAN and Rumi-GAN, in generating game levels targeting specific constraints of interest: playability and controllability. This evaluation is conducted under two scenarios: with and without the inclusion of negative examples. The goal is to determine whether incorporating negative examples helps the GAN models avoid generating undesirable outputs. Our findings highlight the strengths and weaknesses of each method in enforcing the generation of specific conditions when generating outputs based on given positive and negative examples.

[AI-22] Decoupling Semantic Similarity from Spatial Alignment for Neural Networks NEURIPS2024

链接: https://arxiv.org/abs/2410.23107
作者: Tassilo Wald,Constantin Ulrich,Gregor Köhler,David Zimmerer,Stefan Denner,Michael Baumgartner,Fabian Isensee,Priyank Jaini,Klaus H. Maier-Hein
关键词-EN: neural networks learn, similarity, neural networks, deep neural networks, networks learn
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS2024

点击查看摘要

Abstract:What representation do deep neural networks learn? How similar are images to each other for neural networks? Despite the overwhelming success of deep learning methods key questions about their internal workings still remain largely unanswered, due to their internal high dimensionality and complexity. To address this, one approach is to measure the similarity of activation responses to various inputs. Representational Similarity Matrices (RSMs) distill this similarity into scalar values for each input pair. These matrices encapsulate the entire similarity structure of a system, indicating which input leads to similar responses. While the similarity between images is ambiguous, we argue that the spatial location of semantic objects does neither influence human perception nor deep learning classifiers. Thus this should be reflected in the definition of similarity between image responses for computer vision systems. Revisiting the established similarity calculations for RSMs we expose their sensitivity to spatial alignment. In this paper, we propose to solve this through semantic RSMs, which are invariant to spatial permutation. We measure semantic similarity between input responses by formulating it as a set-matching problem. Further, we quantify the superiority of semantic RSMs over spatio-semantic RSMs through image retrieval and by comparing the similarity between representations to the similarity between predicted class probabilities.

[AI-23] Guided Game Level Repair via Explainable AI

链接: https://arxiv.org/abs/2410.23101
作者: Mahsa Bazzaz,Seth Cooper
关键词-EN: machine learning models, created by machine, machine learning, learning models, Procedurally generated levels
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Procedurally generated levels created by machine learning models can be unsolvable without further editing. Various methods have been developed to automatically repair these levels by enforcing hard constraints during the post-processing step. However, as levels increase in size, these constraint-based repairs become increasingly slow. This paper proposes using explainability methods to identify specific regions of a level that contribute to its unsolvability. By assigning higher weights to these regions, constraint-based solvers can prioritize these problematic areas, enabling more efficient repairs. Our results, tested across three games, demonstrate that this approach can help to repair procedurally generated levels faster.

[AI-24] From Hype to Reality: The Road Ahead of Deploying DRL in 6G Networks

链接: https://arxiv.org/abs/2410.23086
作者: Haiyuan Li,Hari Madhukumar,Peizheng Li,Yiran Teng,Shuangyi Yan,Dimitra Simeonidou
关键词-EN: high computational capacity, demand massive connectivity, massive connectivity, high computational, computational capacity
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The industrial landscape is rapidly evolving with the advent of 6G applications, which demand massive connectivity, high computational capacity, and ultra-low latency. These requirements present new challenges, which can no longer be efficiently addressed by conventional strategies. In response, this article underscores the transformative potential of Deep Reinforcement Learning (DRL) for 6G, highlighting its advantages over classic machine learning solutions in meeting the demands of 6G. The necessity of DRL is further validated through three DRL applications in an end-to-end communication procedure, including wireless access control, baseband function placement, and network slicing coordination. However, DRL-based network management initiatives are far from mature. We extend the discussion to identify the challenges of applying DRL in practical networks and explore potential solutions along with their respective limitations. In the end, these insights are validated through a practical DRL deployment in managing network slices on the testbed.

[AI-25] S3PT: Scene Semantics and Structure Guided Clustering to Boost Self-Supervised Pre-Training for Autonomous Driving WACV2025

链接: https://arxiv.org/abs/2410.23085
作者: Maciej K. Wozniak,Hariprasath Govindarajan,Marvin Klingner,Camille Maurice,Ravi Kiran,Senthil Yogamani
关键词-EN: Recent self-supervised clustering-based, DINO and Cribo, clustering-based pre-training techniques, shown impressive results, self-supervised clustering-based pre-training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted for WACV 2025

点击查看摘要

Abstract:Recent self-supervised clustering-based pre-training techniques like DINO and Cribo have shown impressive results for downstream detection and segmentation tasks. However, real-world applications such as autonomous driving face challenges with imbalanced object class and size distributions and complex scene geometries. In this paper, we propose S3PT a novel scene semantics and structure guided clustering to provide more scene-consistent objectives for self-supervised training. Specifically, our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals. Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs. Third, we propose a depth-guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level. Our learned representations significantly improve performance in downstream semantic segmentation and 3D object detection tasks on the nuScenes, nuImages, and Cityscapes datasets and show promising domain translation properties.

[AI-26] An Event-Based Digital Compute-In-Memory Accelerator with Flexible Operand Resolution and Layer-Wise Weight/Output Stationarity ISCAS2025

链接: https://arxiv.org/abs/2410.23082
作者: Nicolas Chauvaux,Adrian Kneip,Christoph Posch,Kofi Makinwa,Charlotte Frenkel
关键词-EN: spiking neural networks, s-level inference latency, edge vision applications, accelerators for spiking, neural networks
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 5 pages, 7 figures, submitted to IEEE ISCAS 2025

点击查看摘要

Abstract:Compute-in-memory (CIM) accelerators for spiking neural networks (SNNs) are promising solutions to enable \mu s-level inference latency and ultra-low energy in edge vision applications. Yet, their current lack of flexibility at both the circuit and system levels prevents their deployment in a wide range of real-life scenarios. In this work, we propose a novel digital CIM macro that supports arbitrary operand resolution and shape, with a unified CIM storage for weights and membrane potentials. These circuit-level techniques enable a hybrid weight- and output-stationary dataflow at the system level to maximize operand reuse, thereby minimizing costly on- and off-chip data movements during the SNN execution. Measurement results of a fabricated FlexSpIM prototype in 40-nm CMOS demonstrate a 2 \times increase in bit-normalized energy efficiency compared to prior fixed-precision digital CIM-SNNs, while providing resolution reconfiguration with bitwise granularity. Our approach can save up to 90% energy in large-scale systems, while reaching a state-of-the-art classification accuracy of 95.8% on the IBM DVS gesture dataset.

[AI-27] CNN Explainability with Multivector Tucker Saliency Maps for Self-Supervised Models

链接: https://arxiv.org/abs/2410.23072
作者: Aymene Mohammed Bouayed,Samuel Deslauriers-Gauthier,Adrian Iaccovelli,David Naccache
关键词-EN: Convolutional Neural Networks, Neural Networks, Convolutional Neural, Interpreting the decisions, decisions of Convolutional
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 29 pages, 20 figures

点击查看摘要

Abstract:Interpreting the decisions of Convolutional Neural Networks (CNNs) is essential for understanding their behavior, yet explainability remains a significant challenge, particularly for self-supervised models. Most existing methods for generating saliency maps rely on ground truth labels, restricting their use to supervised tasks. EigenCAM is the only notable label-independent alternative, leveraging Singular Value Decomposition to generate saliency maps applicable across CNN models, but it does not fully exploit the tensorial structure of feature maps. In this work, we introduce the Tucker Saliency Map (TSM) method, which applies Tucker tensor decomposition to better capture the inherent structure of feature maps, producing more accurate singular vectors and values. These are used to generate high-fidelity saliency maps, effectively highlighting objects of interest in the input. We further extend EigenCAM and TSM into multivector variants -Multivec-EigenCAM and Multivector Tucker Saliency Maps (MTSM)- which utilize all singular vectors and values, further improving saliency map quality. Quantitative evaluations on supervised classification models demonstrate that TSM, Multivec-EigenCAM, and MTSM achieve competitive performance with label-dependent methods. Moreover, TSM enhances explainability by approximately 50% over EigenCAM for both supervised and self-supervised models. Multivec-EigenCAM and MTSM further advance state-of-the-art explainability performance on self-supervised models, with MTSM achieving the best results.

[AI-28] LLM s Integration in Software Engineering Team Projects: Roles Impact and a Pedagogical Design Space for AI Tools in Computing Education

链接: https://arxiv.org/abs/2410.23069
作者: Ahmed Kharrufa,Sami Alghamdi,Abeer Aziz,Christopher Bull
关键词-EN: undergraduate Software Engineering, Engineering Team Project, Software Engineering Team, Software Engineering, undergraduate Software
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work takes a pedagogical lens to explore the implications of generative AI (GenAI) models and tools, such as ChatGPT and GitHub Copilot, in a semester-long 2nd-year undergraduate Software Engineering Team Project. Qualitative findings from survey (39 students) and interviews (eight students) provide insights into the students’ views on the impact of GenAI use on their coding experience, learning, and self-efficacy. Our results address a particular gap in understanding the role and implications of GenAI on teamwork, team-efficacy, and team dynamics. The analysis of the learning aspects is distinguished by the application of learning and pedagogy informed lenses to discuss the data. We propose a preliminary design space for GenAI-based programming learning tools highlighting the importance of considering the roles that GenAI can play during the learning process, the varying support-ability patterns that can be applied to each role, and the importance of supporting transparency in GenAI for team members and students in addition to educators.

[AI-29] Emotional RAG: Enhancing Role-Playing Agents through Emotional Retrieval

链接: https://arxiv.org/abs/2410.23041
作者: Le Huang,Hengzhi Lan,Zijun Sun,Chuan Shi,Ting Bai
关键词-EN: mimic human replies, role-playing research areas, role-playing agents, human-like capability, increasing attention
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As LLMs exhibit a high degree of human-like capability, increasing attention has been paid to role-playing research areas in which responses generated by LLMs are expected to mimic human replies. This has promoted the exploration of role-playing agents in various applications, such as chatbots that can engage in natural conversations with users and virtual assistants that can provide personalized support and guidance. The crucial factor in the role-playing task is the effective utilization of character memory, which stores characters’ profiles, experiences, and historical dialogues. Retrieval Augmented Generation (RAG) technology is used to access the related memory to enhance the response generation of role-playing agents. Most existing studies retrieve related information based on the semantic similarity of memory to maintain characters’ personalized traits, and few attempts have been made to incorporate the emotional factor in the retrieval argument generation (RAG) of LLMs. Inspired by the Mood-Dependent Memory theory, which indicates that people recall an event better if they somehow reinstate during recall the original emotion they experienced during learning, we propose a novel emotion-aware memory retrieval framework, termed Emotional RAG, which recalls the related memory with consideration of emotional state in role-playing agents. Specifically, we design two kinds of retrieval strategies, i.e., combination strategy and sequential strategy, to incorporate both memory semantic and emotional states during the retrieval process. Extensive experiments on three representative role-playing datasets demonstrate that our Emotional RAG framework outperforms the method without considering the emotional factor in maintaining the personalities of role-playing agents. This provides evidence to further reinforce the Mood-Dependent Memory theory in psychology.

[AI-30] Offline Reinforcement Learning and Sequence Modeling for Downlink Link Adaptation

链接: https://arxiv.org/abs/2410.23031
作者: Samuele Peri,Alessio Russo,Gabor Fodor,Pablo Soldati
关键词-EN: Contemporary radio access, employ link adaption, achieved spectral efficiency, prevailing propagation conditions, Contemporary radio
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Contemporary radio access networks employ link adaption (LA) algorithms to optimize the modulation and coding schemes to adapt to the prevailing propagation conditions and are near-optimal in terms of the achieved spectral efficiency. LA is a challenging task in the presence of mobility, fast fading, and imperfect channel quality information and limited knowledge of the receiver characteristics at the transmitter, which render model-based LA algorithms complex and suboptimal. Model-based LA is especially difficult as connected user equipment devices become increasingly heterogeneous in terms of receiver capabilities, antenna configurations and hardware characteristics. Recognizing these difficulties, previous works have proposed reinforcement learning (RL) for LA, which faces deployment difficulties due to their potential negative impacts on live performance. To address this challenge, this paper considers offline RL to learn LA policies from data acquired in live networks with minimal or no intrusive effects on the network operation. We propose three LA designs based on batch-constrained deep Q-learning, conservative Q-learning, and decision transformers, showing that offline RL algorithms can achieve performance of state-of-the-art online RL methods when data is collected with a proper behavioral policy.

[AI-31] A Comparison of Prompt Engineering Techniques for Task Planning and Execution in Service Robotics

链接: https://arxiv.org/abs/2410.22997
作者: Jonas Bode,Bastian Pätzold,Raphael Memmesheimer,Sven Behnke
关键词-EN: vast general knowledge, autonomous robot control, Recent advances, prompt engineering techniques, instrumental in autonomous
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 6 pages, 3 figures, 2 tables, to be published in the 2024 IEEE-RAS International Conference on Humanoid Robots, We make our code, including all prompts, available at this https URL

点击查看摘要

Abstract:Recent advances in LLM have been instrumental in autonomous robot control and human-robot interaction by leveraging their vast general knowledge and capabilities to understand and reason across a wide range of tasks and scenarios. Previous works have investigated various prompt engineering techniques for improving the performance of \glsplLLM to accomplish tasks, while others have proposed methods that utilize LLMs to plan and execute tasks based on the available functionalities of a given robot platform. In this work, we consider both lines of research by comparing prompt engineering techniques and combinations thereof within the application of high-level task planning and execution in service robotics. We define a diverse set of tasks and a simple set of functionalities in simulation, and measure task completion accuracy and execution time for several state-of-the-art models.

[AI-32] Semantic Enrichment of the Quantum Cascade Laser Properties in Text- A Knowledge Graph Generation Approach

链接: https://arxiv.org/abs/2410.22996
作者: Deperias Kerre,Anne Laurent,Kenneth Maussang,Dickson Owuor
关键词-EN: Quantum Cascade Laser, Quantum Cascade, QCL properties Knowledge, QCL properties, Cascade Laser
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A well structured collection of the various Quantum Cascade Laser (QCL) design and working properties data provides a platform to analyze and understand the relationships between these properties. By analyzing these relationships, we can gain insights into how different design features impact laser performance properties such as the working temperature. Most of these QCL properties are captured in scientific text. There is therefore need for efficient methodologies that can be utilized to extract QCL properties from text and generate a semantically enriched and interlinked platform where the properties can be analyzed to uncover hidden relations. There is also the need to maintain provenance and reference information on which these properties are based. Semantic Web technologies such as Ontologies and Knowledge Graphs have proven capability in providing interlinked data platforms for knowledge representation in various domains. In this paper, we propose an approach for generating a QCL properties Knowledge Graph (KG) from text for semantic enrichment of the properties. The approach is based on the QCL ontology and a Retrieval Augmented Generation (RAG) enabled information extraction pipeline based on GPT 4-Turbo language model. The properties of interest include: working temperature, laser design type, lasing frequency, laser optical power and the heterostructure. The experimental results demonstrate the feasibility and effectiveness of this approach for efficiently extracting QCL properties from unstructured text and generating a QCL properties Knowledge Graph, which has potential applications in semantic enrichment and analysis of QCL data.

[AI-33] Higher-order Cross-structural Embedding Model for Time Series Analysis

链接: https://arxiv.org/abs/2410.22984
作者: Guancen Lin,Cong Shen,Aijing Lin
关键词-EN: gained significant attention, significant attention due, Time series, Time series analysis, sensor networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series analysis has gained significant attention due to its critical applications in diverse fields such as healthcare, finance, and sensor networks. The complexity and non-stationarity of time series make it challenging to capture the interaction patterns across different timestamps. Current approaches struggle to model higher-order interactions within time series, and focus on learning temporal or spatial dependencies separately, which limits performance in downstream tasks. To address these gaps, we propose Higher-order Cross-structural Embedding Model for Time Series (High-TS), a novel framework that jointly models both temporal and spatial perspectives by combining multiscale Transformer with Topological Deep Learning (TDL). Meanwhile, High-TS utilizes contrastive learning to integrate these two structures for generating robust and discriminative representations. Extensive experiments show that High-TS outperforms state-of-the-art methods in various time series tasks and demonstrate the importance of higher-order cross-structural information in improving model performance.

[AI-34] PDSR: Efficient UAV Deployment for Swift and Accurate Post-Disaster Search and Rescue

链接: https://arxiv.org/abs/2410.22982
作者: Alaa Awad Abdellatif,Ali Elmancy,Amr Mohamed,Ahmed Massoud,Wadha Lebda,Khalid K. Naji
关键词-EN: Unmanned Aerial Vehicles, leveraging Unmanned Aerial, operations leveraging Unmanned, rescue operations leveraging, Search and Rescue
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: This paper is currently under review at IEEE IoT Magazine

点击查看摘要

Abstract:This paper introduces a comprehensive framework for Post-Disaster Search and Rescue (PDSR), aiming to optimize search and rescue operations leveraging Unmanned Aerial Vehicles (UAVs). The primary goal is to improve the precision and availability of sensing capabilities, particularly in various catastrophic scenarios. Central to this concept is the rapid deployment of UAV swarms equipped with diverse sensing, communication, and intelligence capabilities, functioning as an integrated system that incorporates multiple technologies and approaches for efficient detection of individuals buried beneath rubble or debris following a disaster. Within this framework, we propose architectural solution and address associated challenges to ensure optimal performance in real-world disaster scenarios. The proposed framework aims to achieve complete coverage of damaged areas significantly faster than traditional methods using a multi-tier swarm architecture. Furthermore, integrating multi-modal sensing data with machine learning for data fusion could enhance detection accuracy, ensuring precise identification of survivors.

[AI-35] Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

链接: https://arxiv.org/abs/2410.22952
作者: Wei Dong,Yuan Sun,Yiting Yang,Xing Zhang,Zhijun Lin,Qingsen Yan,Haokui Zhang,Peng Wang,Yang Yang,Hengtao Shen
关键词-EN: pre-trained Vision Transformers, Vision Transformers, common strategy, strategy for Parameter-Efficient, matrix
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of down-projection and up-projection matrices, with the bottleneck dimensionality being crucial for reducing the number of learnable parameters, as exemplified by prevalent methods like LoRA and Adapter. However, these low-rank strategies typically employ a fixed bottleneck dimensionality, which limits their flexibility in handling layer-wise variations. To address this limitation, we propose a novel PEFT approach inspired by Singular Value Decomposition (SVD) for representing the adaptation matrix. SVD decomposes a matrix into the product of a left unitary matrix, a diagonal matrix of scaling values, and a right unitary matrix. We utilize Householder transformations to construct orthogonal matrices that efficiently mimic the unitary matrices, requiring only a vector. The diagonal values are learned in a layer-wise manner, allowing them to flexibly capture the unique properties of each layer. This approach enables the generation of adaptation matrices with varying ranks across different layers, providing greater flexibility in adapting pre-trained models. Experiments on standard downstream vision tasks demonstrate that our method achieves promising fine-tuning performance.

[AI-36] SpiroActive: Active Learning for Efficient Data Acquisition for Spirometry

链接: https://arxiv.org/abs/2410.22950
作者: Ankita Kumari Jain,Nitish Sharma,Madhav Kanda,Nipun Batra
关键词-EN: global health burden, significant global health, Respiratory illnesses, significant global, health burden
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Respiratory illnesses are a significant global health burden. Respiratory illnesses, primarily Chronic obstructive pulmonary disease (COPD), is the seventh leading cause of poor health worldwide and the third leading cause of death worldwide, causing 3.23 million deaths in 2019, necessitating early identification and diagnosis for effective mitigation. Among the diagnostic tools employed, spirometry plays a crucial role in detecting respiratory abnormalities. However, conventional clinical spirometry methods often entail considerable costs and practical limitations like the need for specialized equipment, trained personnel, and a dedicated clinical setting, making them less accessible. To address these challenges, wearable spirometry technologies have emerged as promising alternatives, offering accurate, cost-effective, and convenient solutions. The development of machine learning models for wearable spirometry heavily relies on the availability of high-quality ground truth spirometry data, which is a laborious and expensive endeavor. In this research, we propose using active learning, a sub-field of machine learning, to mitigate the challenges associated with data collection and labeling. By strategically selecting samples from the ground truth spirometer, we can mitigate the need for resource-intensive data collection. We present evidence that models trained on small subsets obtained through active learning achieve comparable/better results than models trained on the complete dataset.

[AI-37] DiffLight: A Partial Rewards Conditioned Diffusion Model for Traffic Signal Control with Missing Data NEURIPS2024

链接: https://arxiv.org/abs/2410.22938
作者: Hanyang Chen,Yang Jiang,Shengnan Guo,Xiaowei Mao,Youfang Lin,Huaiyu Wan
关键词-EN: yielded notable achievements, notable achievements, TSC, extensively researched, researched and yielded
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:The application of reinforcement learning in traffic signal control (TSC) has been extensively researched and yielded notable achievements. However, most existing works for TSC assume that traffic data from all surrounding intersections is fully and continuously available through sensors. In real-world applications, this assumption often fails due to sensor malfunctions or data loss, making TSC with missing data a critical challenge. To meet the needs of practical applications, we introduce DiffLight, a novel conditional diffusion model for TSC under data-missing scenarios in the offline setting. Specifically, we integrate two essential sub-tasks, i.e., traffic data imputation and decision-making, by leveraging a Partial Rewards Conditioned Diffusion (PRCD) model to prevent missing rewards from interfering with the learning process. Meanwhile, to effectively capture the spatial-temporal dependencies among intersections, we design a Spatial-Temporal transFormer (STFormer) architecture. In addition, we propose a Diffusion Communication Mechanism (DCM) to promote better communication and control performance under data-missing scenarios. Extensive experiments on five datasets with various data-missing scenarios demonstrate that DiffLight is an effective controller to address TSC with missing data. The code of DiffLight is released at this https URL.

[AI-38] houghtful Adoption of NLP for Civic Participation: Understanding Differences Among Policymakers

链接: https://arxiv.org/abs/2410.22937
作者: Jose A. Guridi,Cristobal Cheyre,Qian Yang
关键词-EN: Natural language processing, Natural language, NLP, NLP tools, analyze citizen opinions
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Forthcoming in the Proceedings of the 2025 Conference on Computer Supported Cooperative Work and Social Computing (CSCW)

点击查看摘要

Abstract:Natural language processing (NLP) tools have the potential to boost civic participation and enhance democratic processes because they can significantly increase governments’ capacity to gather and analyze citizen opinions. However, their adoption in government remains limited, and harnessing their benefits while preventing unintended consequences remains a challenge. While prior work has focused on improving NLP performance, this work examines how different internal government stakeholders influence NLP tools’ thoughtful adoption. We interviewed seven politicians (politically appointed officials as heads of government institutions) and thirteen public servants (career government employees who design and administrate policy interventions), inquiring how they choose whether and how to use NLP tools to support civic participation processes. The interviews suggest that policymakers across both groups focused on their needs for career advancement and the need to showcase the legitimacy and fairness of their work when considering NLP tool adoption and use. Because these needs vary between politicians and public servants, their preferred NLP features and tool designs also differ. Interestingly, despite their differing needs and opinions, neither group clearly identifies who should advocate for NLP adoption to enhance civic participation or address the unintended consequences of a poorly considered adoption. This lack of clarity in responsibility might have caused the governments’ low adoption of NLP tools. We discuss how these findings reveal new insights for future HCI research. They inform the design of NLP tools for increasing civic participation efficiency and capacity, the design of other tools and methods that ensure thoughtful adoption of AI tools in government, and the design of NLP tools for collaborative use among users with different incentives and needs.

[AI-39] BIS: NL2SQL Service Evaluation Benchmark for Business Intelligence Scenarios

链接: https://arxiv.org/abs/2410.22925
作者: Bora Caglayan,Mingxue Wang,John D. Kelleher,Shen Fei,Gui Tong,Jiandong Ding,Puchao Zhang
关键词-EN: Structured Query Language, Natural Language, Structured Query, Query Language, Language to Structured
类目: Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by ICSOC (International Conference on Service-Oriented Computing) 2024

点击查看摘要

Abstract:NL2SQL (Natural Language to Structured Query Language) transformation has seen wide adoption in Business Intelligence (BI) applications in recent years. However, existing NL2SQL benchmarks are not suitable for production BI scenarios, as they are not designed for common business intelligence questions. To address this gap, we have developed a new benchmark focused on typical NL questions in industrial BI scenarios. We discuss the challenges of constructing a BI-focused benchmark and the shortcomings of existing benchmarks. Additionally, we introduce question categories in our benchmark that reflect common BI inquiries. Lastly, we propose two novel semantic similarity evaluation metrics for assessing NL2SQL capabilities in BI applications and services.

[AI-40] Self-optimization in distributed manufacturing systems using Modular State-based Stackelberg Games

链接: https://arxiv.org/abs/2410.22912
作者: Steve Yuwono,Ahmar Kamal Hussain,Dorothea Schwung,Andreas Schwung
关键词-EN: Modular State-based Stackelberg, introduce Modular State-based, State-based Stackelberg Games, Stackelberg Games, State-based Potential Games
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: This pre-print was submitted to Journal of Manufacturing Systems on October 30, 2024

点击查看摘要

Abstract:In this study, we introduce Modular State-based Stackelberg Games (Mod-SbSG), a novel game structure developed for distributed self-learning in modular manufacturing systems. Mod-SbSG enhances cooperative decision-making among self-learning agents within production systems by integrating State-based Potential Games (SbPG) with Stackelberg games. This hierarchical structure assigns more important modules of the manufacturing system a first-mover advantage, while less important modules respond optimally to the leaders’ decisions. This decision-making process differs from typical multi-agent learning algorithms in manufacturing systems, where decisions are made simultaneously. We provide convergence guarantees for the novel game structure and design learning algorithms to account for the hierarchical game structure. We further analyse the effects of single-leader/multiple-follower and multiple-leader/multiple-follower scenarios within a Mod-SbSG. To assess its effectiveness, we implement and test Mod-SbSG in an industrial control setting using two laboratory-scale testbeds featuring sequential and serial-parallel processes. The proposed approach delivers promising results compared to the vanilla SbPG, which reduces overflow by 97.1%, and in some cases, prevents overflow entirely. Additionally, it decreases power consumption by 5-13% while satisfying the production demand, which significantly improves potential (global objective) values.

[AI-41] YOLOv11 for Vehicle Detection: Advancements Performance and Applications in Intelligent Transportation Systems

链接: https://arxiv.org/abs/2410.22898
作者: Mujadded Al Rabbani Alif
关键词-EN: Accurate vehicle detection, intelligent transportation systems, Accurate vehicle, intelligent transportation, Accurate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 16 pages

点击查看摘要

Abstract:Accurate vehicle detection is essential for the development of intelligent transportation systems, autonomous driving, and traffic monitoring. This paper presents a detailed analysis of YOLO11, the latest advancement in the YOLO series of deep learning models, focusing exclusively on vehicle detection tasks. Building upon the success of its predecessors, YOLO11 introduces architectural improvements designed to enhance detection speed, accuracy, and robustness in complex environments. Using a comprehensive dataset comprising multiple vehicle types-cars, trucks, buses, motorcycles, and bicycles we evaluate YOLO11’s performance using metrics such as precision, recall, F1 score, and mean average precision (mAP). Our findings demonstrate that YOLO11 surpasses previous versions (YOLOv8 and YOLOv10) in detecting smaller and more occluded vehicles while maintaining a competitive inference time, making it well-suited for real-time applications. Comparative analysis shows significant improvements in the detection of complex vehicle geometries, further contributing to the development of efficient and scalable vehicle detection systems. This research highlights YOLO11’s potential to enhance autonomous vehicle performance and traffic monitoring systems, offering insights for future developments in the field.

[AI-42] Adaptive Paradigm Synergy: Can a Cross-Paradigm Objective Enhance Long-Tailed Learning?

链接: https://arxiv.org/abs/2410.22883
作者: Haowen Xiao,Guanghui Liu,Xinyi Gao,Yang Li,Fengmao Lv,Jielei Chu
关键词-EN: computer vision tasks, achieved impressive results, rivaling supervised methods, vision tasks, achieved impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:Self-supervised learning (SSL) has achieved impressive results across several computer vision tasks, even rivaling supervised methods. However, its performance degrades on real-world datasets with long-tailed distributions due to difficulties in capturing inherent class imbalances. Although supervised long-tailed learning offers significant insights, the absence of labels in SSL prevents direct transfer of these this http URL bridge this gap, we introduce Adaptive Paradigm Synergy (APS), a cross-paradigm objective that seeks to unify the strengths of both paradigms. Our approach reexamines contrastive learning from a spatial structure perspective, dynamically adjusting the uniformity of latent space structure through adaptive temperature tuning. Furthermore, we draw on a re-weighting strategy from supervised learning to compensate for the shortcomings of temperature adjustment in explicit quantity this http URL experiments on commonly used long-tailed datasets demonstrate that APS improves performance effectively and efficiently. Our findings reveal the potential for deeper integration between supervised and self-supervised learning, paving the way for robust models that handle real-world class imbalance.

[AI-43] SFA-UNet: More Attention to Multi-Scale Contrast and Contextual Information in Infrared Small Object Segmentation

链接: https://arxiv.org/abs/2410.22881
作者: Imad Ali Shah,Fahad Mumtaz Malik,Muhammad Waqas Ashraf
关键词-EN: Computer vision researchers, fundamental infrared visual, infrared visual recognition, Computer vision, past few decades
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted and Presented at PRIP 2023

点击查看摘要

Abstract:Computer vision researchers have extensively worked on fundamental infrared visual recognition for the past few decades. Among various approaches, deep learning has emerged as the most promising candidate. However, Infrared Small Object Segmentation (ISOS) remains a major focus due to several challenges including: 1) the lack of effective utilization of local contrast and global contextual information; 2) the potential loss of small objects in deep models; and 3) the struggling to capture fine-grained details and ignore noise. To address these challenges, we propose a modified U-Net architecture, named SFA-UNet, by combining Scharr Convolution (SC) and Fast Fourier Convolution (FFC) in addition to vertical and horizontal Attention gates (AG) into UNet. SFA-UNet utilizes double convolution layers with the addition of SC and FFC in its encoder and decoder layers. SC helps to learn the foreground-to-background contrast information whereas FFC provide multi-scale contextual information while mitigating the small objects vanishing problem. Additionally, the introduction of vertical AGs in encoder layers enhances the model’s focus on the targeted object by ignoring irrelevant regions. We evaluated the proposed approach on publicly available, SIRST and IRSTD datasets, and achieved superior performance by an average 0.75% with variance of 0.025 of all combined metrics in multiple runs as compared to the existing state-of-the-art methods

[AI-44] Conditioned quantum-assisted deep generative surrogate for particle-calorimeter interactions

链接: https://arxiv.org/abs/2410.22870
作者: J. Quetzalcoatl Toledo-Marin,Sebastian Gonzalez,Hao Jia,Ian Lu,Deniz Sogutlu,Abhishek Abhishek,Colin Gay,Eric Paquet,Roger Melko,Geoffrey C. Fox,Maximilian Swiatlowski,Wojciech Fedorko
关键词-EN: Large Hadron Collider, ATLAS and CMS, Hadron Collider, Large Hadron, enable exquisite measurements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph); Computational Physics (physics.comp-ph); Instrumentation and Detectors (physics.ins-det)
*备注: 26 pages, 10 figures, 8 appendices

点击查看摘要

Abstract:Particle collisions at accelerators such as the Large Hadron Collider, recorded and analyzed by experiments such as ATLAS and CMS, enable exquisite measurements of the Standard Model and searches for new phenomena. Simulations of collision events at these detectors have played a pivotal role in shaping the design of future experiments and analyzing ongoing ones. However, the quest for accuracy in Large Hadron Collider (LHC) collisions comes at an imposing computational cost, with projections estimating the need for millions of CPU-years annually during the High Luminosity LHC (HL-LHC) run \citecollaboration2022atlas. Simulating a single LHC event with \textscGeant4 currently devours around 1000 CPU seconds, with simulations of the calorimeter subdetectors in particular imposing substantial computational demands \citerousseau2023experimental. To address this challenge, we propose a conditioned quantum-assisted deep generative model. Our model integrates a conditioned variational autoencoder (VAE) on the exterior with a conditioned Restricted Boltzmann Machine (RBM) in the latent space, providing enhanced expressiveness compared to conventional VAEs. The RBM nodes and connections are meticulously engineered to enable the use of qubits and couplers on D-Wave’s Pegasus-structured \textitAdvantage quantum annealer (QA) for sampling. We introduce a novel method for conditioning the quantum-assisted RBM using \textitflux biases. We further propose a novel adaptive mapping to estimate the effective inverse temperature in quantum annealers. The effectiveness of our framework is illustrated using Dataset 2 of the CaloChallenge \citecalochallenge.

[AI-45] HijackRAG: Hijacking Attacks against Retrieval-Augmented Large Language Models

链接: https://arxiv.org/abs/2410.22832
作者: Yucheng Zhang,Qinfeng Li,Tianyu Du,Xuhong Zhang,Xinkui Zhao,Zhengwen Feng,Jianwei Yin
关键词-EN: enhance large language, Retrieval-Augmented Generation, systems enhance large, integrating external knowledge, large language models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge, making them adaptable and cost-effective for various applications. However, the growing reliance on these systems also introduces potential security risks. In this work, we reveal a novel vulnerability, the retrieval prompt hijack attack (HijackRAG), which enables attackers to manipulate the retrieval mechanisms of RAG systems by injecting malicious texts into the knowledge database. When the RAG system encounters target questions, it generates the attacker’s pre-determined answers instead of the correct ones, undermining the integrity and trustworthiness of the system. We formalize HijackRAG as an optimization problem and propose both black-box and white-box attack strategies tailored to different levels of the attacker’s knowledge. Extensive experiments on multiple benchmark datasets show that HijackRAG consistently achieves high attack success rates, outperforming existing baseline attacks. Furthermore, we demonstrate that the attack is transferable across different retriever models, underscoring the widespread risk it poses to RAG systems. Lastly, our exploration of various defense mechanisms reveals that they are insufficient to counter HijackRAG, emphasizing the urgent need for more robust security measures to protect RAG systems in real-world deployments.

[AI-46] owards Robust and Efficient Federated Low-Rank Adaptation with Heterogeneous Clients

链接: https://arxiv.org/abs/2410.22815
作者: Jabin Koo,Minwoo Jang,Jungseul Ok
关键词-EN: Large Language Models, large model updates, transmitting large model, Large Language, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated fine-tuning for Large Language Models (LLMs) has recently gained attention due to the heavy communication overhead of transmitting large model updates. Low Rank Adaptation (LoRA) has been proposed as a solution, yet its application in federated learning is complicated by discordance in aggregation. Existing methods addressing this discordance often suffer from performance degradation at low ranks in heterogeneous data settings. In response, we introduce LoRA-A2 (Low Rank Adaptation with Alternating freeze and Adaptive rank selection), which demonstrates robustness in challenging settings with low ranks and high data heterogeneity. Our experimental findings reveal that LoRA-A2 maintains performance even under extreme heterogeneity and low rank conditions, achieving up to a 99.8% reduction in uploaded parameters compared to full fine-tuning without compromising performance. This adaptive mechanism boosts robustness and communication efficiency in federated fine-tuning, enabling the practical deployment of LLMs in resource-constrained environments.

[AI-47] Universality of the pi2/6 Pathway in Avoiding Model Collapse

链接: https://arxiv.org/abs/2410.22812
作者: Apratim Dey,David Donoho
关键词-EN: machine learning recently, learning recently spotlighted, so-called Model Collapse, augment workflow, real data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 30 pages

点击查看摘要

Abstract:Researchers in empirical machine learning recently spotlighted their fears of so-called Model Collapse. They imagined a discard workflow, where an initial generative model is trained with real data, after which the real data are discarded, and subsequently, the model generates synthetic data on which a new model is trained. They came to the conclusion that models degenerate as model-fitting generations proceed. However, other researchers considered an augment workflow, where the original real data continue to be used in each generation of training, augmented by synthetic data from models fit in all earlier generations. Empirical results on canonical datasets and learning procedures confirmed the occurrence of model collapse under the discard workflow and avoidance of model collapse under the augment workflow. Under the augment workflow, theoretical evidence also confirmed avoidance in particular instances; specifically, Gerstgrasser et al. (2024) found that for classical Linear Regression, test risk at any later generation is bounded by a moderate multiple, viz. pi-squared-over-6 of the test risk of training with the original real data alone. Some commentators questioned the generality of theoretical conclusions based on the generative model assumed in Gerstgrasser et al. (2024): could similar conclusions be reached for other task/model pairings? In this work, we demonstrate the universality of the pi-squared-over-6 augment risk bound across a large family of canonical statistical models, offering key insights into exactly why collapse happens under the discard workflow and is avoided under the augment workflow. In the process, we provide a framework that is able to accommodate a large variety of workflows (beyond discard and augment), thereby enabling an experimenter to judge the comparative merits of multiple different workflows by simulating a simple Gaussian process.

[AI-48] Causality-Enhanced Behavior Sequence Modeling in LLM s for Personalized Recommendation

链接: https://arxiv.org/abs/2410.22809
作者: Yang Zhang,Juntao You,Yimeng Bai,Jizhi Zhang,Keqin Bao,Wenjie Wang,Tat-Seng Chua
关键词-EN: leveraging Large Language, Large Language Models, yielding promising outcomes, Large Language, Recent advancements
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in recommender systems have focused on leveraging Large Language Models (LLMs) to improve user preference modeling, yielding promising outcomes. However, current LLM-based approaches struggle to fully leverage user behavior sequences, resulting in suboptimal preference modeling for personalized recommendations. In this study, we propose a novel Counterfactual Fine-Tuning (CFT) method to address this issue by explicitly emphasizing the role of behavior sequences when generating recommendations. Specifically, we employ counterfactual reasoning to identify the causal effects of behavior sequences on model output and introduce a task that directly fits the ground-truth labels based on these effects, achieving the goal of explicit emphasis. Additionally, we develop a token-level weighting mechanism to adjust the emphasis strength for different item tokens, reflecting the diminishing influence of behavior sequences from earlier to later tokens during predicting an item. Extensive experiments on real-world datasets demonstrate that CFT effectively improves behavior sequence modeling. Our codes are available at this https URL.

[AI-49] Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising

链接: https://arxiv.org/abs/2410.22805
作者: Yoto Fujita,Aditya Arie Nugraha,Diego Di Carlo,Yoshiaki Bando,Mathieu Fontaine,Kazuyoshi Yoshii
关键词-EN: automatic speech recognition, paper describes speech, realtime automatic speech, real environments, describes speech enhancement
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to APSIPA2024

点击查看摘要

Abstract:This paper describes speech enhancement for realtime automatic speech recognition (ASR) in real environments. A standard approach to this task is to use neural beamforming that can work efficiently in an online manner. It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming. The performance of such a supervised approach, however, is drastically degraded under mismatched conditions. This calls for run-time adaptation of the DNN. Although the ground-truth speech spectrogram required for adaptation is not available at run time, blind dereverberation and separation methods such as weighted prediction error (WPE) and fast multichannel nonnegative matrix factorization (FastMNMF) can be used for generating pseudo groundtruth data from a mixture. Based on this idea, a prior work proposed a dual-process system based on a cascade of WPE and minimum variance distortionless response (MVDR) beamforming asynchronously fine-tuned by block-online FastMNMF. To integrate the dereverberation capability into neural beamforming and make it fine-tunable at run time, we propose to use weighted power minimization distortionless response (WPD) beamforming, a unified version of WPE and minimum power distortionless response (MPDR), whose joint dereverberation and denoising filter is estimated using a DNN. We evaluated the impact of run-time adaptation under various conditions with different numbers of speakers, reverberation times, and signal-to-noise ratios (SNRs).

[AI-50] DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection

链接: https://arxiv.org/abs/2410.22803
作者: Yoto Fujita,Yoshiaki Bando,Keisuke Imoto,Masaki Onishi,Kazuyoshi Yoshii
关键词-EN: paper describes sound, sound event localization, describes sound event, FOA data annotated, localization and detection
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted to APSIPA2023

点击查看摘要

Abstract:This paper describes sound event localization and detection (SELD) for spatial audio recordings captured by firstorder ambisonics (FOA) microphones. In this task, one may train a deep neural network (DNN) using FOA data annotated with the classes and directions of arrival (DOAs) of sound events. However, the performance of this approach is severely bounded by the amount of annotated data. To overcome this limitation, we propose a novel method of pretraining the feature extraction part of the DNN in a self-supervised manner. We use spatial audio-visual recordings abundantly available as virtual reality contents. Assuming that sound objects are concurrently observed by the FOA microphones and the omni-directional camera, we jointly train audio and visual encoders with contrastive learning such that the audio and visual embeddings of the same recording and DOA are made close. A key feature of our method is that the DOA-wise audio embeddings are jointly extracted from the raw audio data, while the DOA-wise visual embeddings are separately extracted from the local visual crops centered on the corresponding DOA. This encourages the latent features of the audio encoder to represent both the classes and DOAs of sound events. The experiment using the DCASE2022 Task 3 dataset of 20 hours shows non-annotated audio-visual recordings of 100 hours reduced the error score of SELD from 36.4 pts to 34.9 pts.

[AI-51] Dual Contrastive Transformer for Hierarchical Preference Modeling in Sequential Recommendation

链接: https://arxiv.org/abs/2410.22790
作者: Chengkai Huang,Shoujin Wang,Xianzhi Wang,Lina Yao
关键词-EN: Sequential recommender systems, recommender systems, user-item interactions, high-level preference, comprehensively modeling users’
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sequential recommender systems (SRSs) aim to predict the subsequent items which may interest users via comprehensively modeling users’ complex preference embedded in the sequence of user-item interactions. However, most of existing SRSs often model users’ single low-level preference based on item ID information while ignoring the high-level preference revealed by item attribute information, such as item category. Furthermore, they often utilize limited sequence context information to predict the next item while overlooking richer inter-item semantic relations. To this end, in this paper, we proposed a novel hierarchical preference modeling framework to substantially model the complex low- and high-level preference dynamics for accurate sequential recommendation. Specifically, in the framework, a novel dual-transformer module and a novel dual contrastive learning scheme have been designed to discriminatively learn users’ low- and high-level preference and to effectively enhance both low- and high-level preference learning respectively. In addition, a novel semantics-enhanced context embedding module has been devised to generate more informative context embedding for further improving the recommendation performance. Extensive experiments on six real-world datasets have demonstrated both the superiority of our proposed method over the state-of-the-art ones and the rationality of our design.

[AI-52] Contrastive Learning and Adversarial Disentanglement for Privacy-Preserving Task-Oriented Semantic Communications

链接: https://arxiv.org/abs/2410.22784
作者: Omar Erak,Omar Alhussein,Wen Tong
关键词-EN: intelligent data transmission, semantic communication systems, data transmission, systems have emerged, promising approach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Image and Video Processing (eess.IV)
*备注: Submitted to EEE Journal on Selected Areas in Communications (JSAC): Intelligent Communications for Real-Time Computer Vision (Comm4CV)

点击查看摘要

Abstract:Task-oriented semantic communication systems have emerged as a promising approach to achieving efficient and intelligent data transmission, where only information relevant to a specific task is communicated. However, existing methods struggle to fully disentangle task-relevant and task-irrelevant information, leading to privacy concerns and subpar performance. To address this, we propose an information-bottleneck method, named CLAD (contrastive learning and adversarial disentanglement). CLAD leverages contrastive learning to effectively capture task-relevant features while employing adversarial disentanglement to discard task-irrelevant information. Additionally, due to the lack of reliable and reproducible methods to gain insight into the informativeness and minimality of the encoded feature vectors, we introduce a new technique to compute the information retention index (IRI), a comparative metric used as a proxy for the mutual information between the encoded features and the input, reflecting the minimality of the encoded features. The IRI quantifies the minimality and informativeness of the encoded feature vectors across different task-oriented communication techniques. Our extensive experiments demonstrate that CLAD outperforms state-of-the-art baselines in terms of task performance, privacy preservation, and IRI. CLAD achieves a predictive performance improvement of around 2.5-3%, along with a 77-90% reduction in IRI and a 57-76% decrease in adversarial accuracy.

[AI-53] Reliability Assessment of Information Sources Based on Random Permutation Set

链接: https://arxiv.org/abs/2410.22772
作者: Juntao Xu,Tianxiang Zhan,Yong Deng
关键词-EN: significantly affects decision-making, Random Permutation Set, RPS, DST, significantly affects
类目: Artificial Intelligence (cs.AI)
*备注: 10 pages

点击查看摘要

Abstract:In pattern recognition, handling uncertainty is a critical challenge that significantly affects decision-making and classification accuracy. Dempster-Shafer Theory (DST) is an effective reasoning framework for addressing uncertainty, and the Random Permutation Set (RPS) extends DST by additionally considering the internal order of elements, forming a more ordered extension of DST. However, there is a lack of a transformation method based on permutation order between RPS and DST, as well as a sequence-based probability transformation method for RPS. Moreover, the reliability of RPS sources remains an issue that requires attention. To address these challenges, this paper proposes an RPS transformation approach and a probability transformation method tailored for RPS. On this basis, a reliability computation method for RPS sources, based on the RPS probability transformation, is introduced and applied to pattern recognition. Experimental results demonstrate that the proposed approach effectively bridges the gap between DST and RPS and achieves superior recognition accuracy in classification problems.

[AI-54] Self-Driving Car Racing: Application of Deep Reinforcement Learning

链接: https://arxiv.org/abs/2410.22766
作者: Florentiana Yuwono,Gan Pang Yen,Jason Christopher
关键词-EN: self-driving car racing, autonomous self-driving car, deep reinforcement learning, Proximal Policy Optimization, paper explores
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores the application of deep reinforcement learning (RL) techniques in the domain of autonomous self-driving car racing. Motivated by the rise of AI-driven mobility and autonomous racing events, the project aims to develop an AI agent that efficiently drives a simulated car in the OpenAI Gymnasium CarRacing environment. We investigate various RL algorithms, including Deep Q-Network (DQN), Proximal Policy Optimization (PPO), and novel adaptations that incorporate transfer learning and recurrent neural networks (RNNs) for enhanced performance. The project demonstrates that while DQN provides a strong baseline for policy learning, integrating ResNet and LSTM models significantly improves the agent’s ability to capture complex spatial and temporal dynamics. PPO, particularly in continuous action spaces, shows promising results for fine control, although challenges such as policy collapse remain. We compare the performance of these approaches and outline future research directions focused on improving computational efficiency and addressing model stability. Our findings contribute to the ongoing development of AI systems in autonomous driving and related control tasks.

[AI-55] SoftCTRL: Soft conservative KL-control of Transformer Reinforcement Learning for Autonomous Driving

链接: https://arxiv.org/abs/2410.22752
作者: Minh Tri Huynh,Duc Dung Nguyen
关键词-EN: popular problem due, urban self-driving cars, recent years, motion planning, self-driving cars
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: submitted to IEEE Open Journal of Intelligent Transportation Systems

点击查看摘要

Abstract:In recent years, motion planning for urban self-driving cars (SDV) has become a popular problem due to its complex interaction of road components. To tackle this, many methods have relied on large-scale, human-sampled data processed through Imitation learning (IL). Although effective, IL alone cannot adequately handle safety and reliability concerns. Combining IL with Reinforcement learning (RL) by adding KL divergence between RL and IL policy to the RL loss can alleviate IL’s weakness but suffer from over-conservation caused by covariate shift of IL. To address this limitation, we introduce a method that combines IL with RL using an implicit entropy-KL control that offers a simple way to reduce the over-conservation characteristic. In particular, we validate different challenging simulated urban scenarios from the unseen dataset, indicating that although IL can perform well in imitation tasks, our proposed method significantly improves robustness (over 17% reduction in failures) and generates human-like driving behavior.

[AI-56] Designing AI Personalities: Enhancing Human-Agent Interaction Through Thoughtful Persona Design

链接: https://arxiv.org/abs/2410.22744
作者: Nima Zargham,Mateusz Dubiel,Smit Desai,Thomas Mildner,Hanz-Joachim Belz
关键词-EN: rapidly evolving field, shaping user experience, artificial intelligence, rapidly evolving, evolving field
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 8 pages, the workshop accepted at the 23rd International Conference on Mobile and Ubiquitous Multimedia (MUM 2024)

点击查看摘要

Abstract:In the rapidly evolving field of artificial intelligence (AI) agents, designing the agent’s characteristics is crucial for shaping user experience. This workshop aims to establish a research community focused on AI agent persona design for various contexts, such as in-car assistants, educational tools, and smart home environments. We will explore critical aspects of persona design, such as voice, embodiment, and demographics, and their impact on user satisfaction and engagement. Through discussions and hands-on activities, we aim to propose practices and standards that enhance the ecological validity of agent personas. Topics include the design of conversational interfaces, the influence of agent personas on user experience, and approaches for creating contextually appropriate AI agents. This workshop will provide a platform for building a community dedicated to developing AI agent personas that better fit diverse, everyday interactions.

[AI-57] Offline Behavior Distillation NEURIPS2024

链接: https://arxiv.org/abs/2410.22728
作者: Shiye Lei,Sen Zhang,Dacheng Tao
关键词-EN: Massive reinforcement learning, Massive reinforcement, large data volume, training inefficiencies, typically collected
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Massive reinforcement learning (RL) data are typically collected to train policies offline without the need for interactions, but the large data volume can cause training inefficiencies. To tackle this issue, we formulate offline behavior distillation (OBD), which synthesizes limited expert behavioral data from sub-optimal RL data, enabling rapid policy learning. We propose two naive OBD objectives, DBC and PBC, which measure distillation performance via the decision difference between policies trained on distilled data and either offline data or a near-expert policy. Due to intractable bi-level optimization, the OBD objective is difficult to minimize to small values, which deteriorates PBC by its distillation performance guarantee with quadratic discount complexity \mathcalO(1/(1-\gamma)^2) . We theoretically establish the equivalence between the policy performance and action-value weighted decision difference, and introduce action-value weighted PBC (Av-PBC) as a more effective OBD objective. By optimizing the weighted decision difference, Av-PBC achieves a superior distillation guarantee with linear discount complexity \mathcalO(1/(1-\gamma)) . Extensive experiments on multiple D4RL datasets reveal that Av-PBC offers significant improvements in OBD performance, fast distillation convergence speed, and robust cross-architecture/optimizer generalization.

[AI-58] Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization

链接: https://arxiv.org/abs/2410.22707
作者: Kento Kawaharazuka,Yoshiki Obinata,Naoaki Kanazawa,Kei Okada,Masayuki Inaba
关键词-EN: daily life support, State recognition, perform daily life, State, environment and objects
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at Humanoids2024

点击查看摘要

Abstract:State recognition of the environment and objects, such as the open/closed state of doors and the on/off of lights, is indispensable for robots that perform daily life support and security tasks. Until now, state recognition methods have been based on training neural networks from manual annotations, preparing special sensors for the recognition, or manually programming to extract features from point clouds or raw images. In contrast, we propose a robotic state recognition method using a pre-trained vision-language model, which is capable of Image-to-Text Retrieval (ITR) tasks. We prepare several kinds of language prompts in advance, calculate the similarity between these prompts and the current image by ITR, and perform state recognition. By applying the optimal weighting to each prompt using black-box optimization, state recognition can be performed with higher accuracy. Experiments show that this theory enables a variety of state recognitions by simply preparing multiple prompts without retraining neural networks or manual programming. In addition, since only prompts and their weights need to be prepared for each recognizer, there is no need to prepare multiple models, which facilitates resource management. It is possible to recognize the open/closed state of transparent doors, the state of whether water is running or not from a faucet, and even the qualitative state of whether a kitchen is clean or not, which have been challenging so far, through language.

[AI-59] Permutation Invariant Learning with High-Dimensional Particle Filters

链接: https://arxiv.org/abs/2410.22695
作者: Akhilan Boopathy,Aneesh Muppidi,Peggy Yang,Abhiram Iyer,William Yue,Ila Fiete
关键词-EN: training data impacts, loss of plasticity, largely due, suffers from challenges, permutation dependence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Website: this https URL

点击查看摘要

Abstract:Sequential learning in deep models often suffers from challenges such as catastrophic forgetting and loss of plasticity, largely due to the permutation dependence of gradient-based algorithms, where the order of training data impacts the learning outcome. In this work, we introduce a novel permutation-invariant learning framework based on high-dimensional particle filters. We theoretically demonstrate that particle filters are invariant to the sequential ordering of training minibatches or tasks, offering a principled solution to mitigate catastrophic forgetting and loss-of-plasticity. We develop an efficient particle filter for optimizing high-dimensional models, combining the strengths of Bayesian methods with gradient-based optimization. Through extensive experiments on continual supervised and reinforcement learning benchmarks, including SplitMNIST, SplitCIFAR100, and ProcGen, we empirically show that our method consistently improves performance, while reducing variance compared to standard baselines.

[AI-60] Choice between Partial Trajectories

链接: https://arxiv.org/abs/2410.22690
作者: Henrik Marklund,Benjamin Van Roy
关键词-EN: bootstrapped return, bootstrapped return model, return, generate increasingly sophisticated, bootstrapped
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As AI agents generate increasingly sophisticated behaviors, manually encoding human preferences to guide these agents becomes more challenging. To address this, it has been suggested that agents instead learn preferences from human choice data. This approach requires a model of choice behavior that the agent can use to interpret the data. For choices between partial trajectories of states and actions, previous models assume choice probabilities to be determined by the partial return or the cumulative advantage. We consider an alternative model based instead on the bootstrapped return, which adds to the partial return an estimate of the future return. Benefits of the bootstrapped return model stem from its treatment of human beliefs. Unlike partial return, choices based on bootstrapped return reflect human beliefs about the environment. Further, while recovering the reward function from choices based on cumulative advantage requires that those beliefs are correct, doing so from choices based on bootstrapped return does not. To motivate the bootstrapped return model, we formulate axioms and prove an Alignment Theorem. This result formalizes how, for a general class of human preferences, such models are able to disentangle goals from beliefs. This ensures recovery of an aligned reward function when learning from choices based on bootstrapped return. The bootstrapped return model also affords greater robustness to choice behavior. Even when choices are based on partial return, learning via a bootstrapped return model recovers an aligned reward function. The same holds with choices based on the cumulative advantage if the human and the agent both adhere to correct and consistent beliefs about the environment. On the other hand, if choices are based on bootstrapped return, learning via partial return or cumulative advantage models does not generally produce an aligned reward function. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.22690 [cs.LG] (or arXiv:2410.22690v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.22690 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Henrik Marklund [view email] [v1] Wed, 30 Oct 2024 04:52:22 UTC (983 KB)

[AI-61] Multi-Task Interactive Robot Fleet Learning with Visual World Models

链接: https://arxiv.org/abs/2410.22689
作者: Huihan Liu,Yu Zhang,Vaarij Betala,Evan Zhang,James Liu,Crystal Ding,Yuke Zhu
关键词-EN: Recent advancements, perform diverse tasks, industrial settings, deploying robot fleets, offer the potential
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: In Proceedings of CoRL 2024

点击查看摘要

Abstract:Recent advancements in large-scale multi-task robot learning offer the potential for deploying robot fleets in household and industrial settings, enabling them to perform diverse tasks across various environments. However, AI-enabled robots often face challenges with generalization and robustness when exposed to real-world variability and uncertainty. We introduce Sirius-Fleet, a multi-task interactive robot fleet learning framework to address these challenges. Sirius-Fleet monitors robot performance during deployment and involves humans to correct the robot’s actions when necessary. We employ a visual world model to predict the outcomes of future actions and build anomaly predictors to predict whether they will likely result in anomalies. As the robot autonomy improves, the anomaly predictors automatically adapt their prediction criteria, leading to fewer requests for human intervention and gradually reducing human workload over time. Evaluations on large-scale benchmarks demonstrate Sirius-Fleet’s effectiveness in improving multi-task policy performance and monitoring accuracy. We demonstrate Sirius-Fleet’s performance in both RoboCasa in simulation and Mutex in the real world, two diverse, large-scale multi-task benchmarks. More information is available on the project website: this https URL

[AI-62] Backdoor Attack Against Vision Transformers via Attention Gradient-Based Image Erosion

链接: https://arxiv.org/abs/2410.22678
作者: Ji Guo,Hongwei Li,Wenbo Jiang,Guoming Lu
关键词-EN: Convolutional Neural Networks, traditional Convolutional Neural, outperformed traditional Convolutional, Neural Networks, Convolutional Neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by IEEE GLOBECOM 2024

点击查看摘要

Abstract:Vision Transformers (ViTs) have outperformed traditional Convolutional Neural Networks (CNN) across various computer vision tasks. However, akin to CNN, ViTs are vulnerable to backdoor attacks, where the adversary embeds the backdoor into the victim model, causing it to make wrong predictions about testing samples containing a specific trigger. Existing backdoor attacks against ViTs have the limitation of failing to strike an optimal balance between attack stealthiness and attack effectiveness. In this work, we propose an Attention Gradient-based Erosion Backdoor (AGEB) targeted at ViTs. Considering the attention mechanism of ViTs, AGEB selectively erodes pixels in areas of maximal attention gradient, embedding a covert backdoor trigger. Unlike previous backdoor attacks against ViTs, AGEB achieves an optimal balance between attack stealthiness and attack effectiveness, ensuring the trigger remains invisible to human detection while preserving the model’s accuracy on clean samples. Extensive experimental evaluations across various ViT architectures and datasets confirm the effectiveness of AGEB, achieving a remarkable Attack Success Rate (ASR) without diminishing Clean Data Accuracy (CDA). Furthermore, the stealthiness of AGEB is rigorously validated, demonstrating minimal visual discrepancies between the clean and the triggered images. Comments: Accepted by IEEE GLOBECOM 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2410.22678 [cs.CV] (or arXiv:2410.22678v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.22678 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-63] A Walsh Hadamard Derived Linear Vector Symbolic Architecture NEURIPS2024

链接: https://arxiv.org/abs/2410.22669
作者: Mohammad Mahmudul Alam,Alexander Oberle,Edward Raff,Stella Biderman,Tim Oates,James Holt
关键词-EN: Vector Symbolic Architectures, Symbolic Architectures, developing Neuro-symbolic, Vector Symbolic, approach to developing
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To appear in the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Vector Symbolic Architectures (VSAs) are one approach to developing Neuro-symbolic AI, where two vectors in \mathbbR^d are `bound’ together to produce a new vector in the same space. VSAs support the commutativity and associativity of this binding operation, along with an inverse operation, allowing one to construct symbolic-style manipulations over real-valued vectors. Most VSAs were developed before deep learning and automatic differentiation became popular and instead focused on efficacy in hand-designed systems. In this work, we introduce the Hadamard-derived linear Binding (HLB), which is designed to have favorable computational efficiency, and efficacy in classic VSA tasks, and perform well in differentiable systems. Code is available at this https URL

[AI-64] textbfEMOS: textbfEmbodiment-aware Heterogeneous textbfMulti-robot textbfOperating textbfSystem with LLM Agents

链接: https://arxiv.org/abs/2410.22662
作者: Junting Chen,Checheng Yu,Xunzhe Zhou,Tianqi Xu,Yao Mu,Mengkang Hu,Wenqi Shao,Yikai Wang,Guohao Li,Lin Shao
关键词-EN: tackling complex tasks, tackling complex, Heterogeneous multi-robot systems, HMRS, multi-robot system
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 10 pages of main content, 3 pages of references, 5 pages of appendix, 7 figures in total

点击查看摘要

Abstract:Heterogeneous multi-robot systems (HMRS) have emerged as a powerful approach for tackling complex tasks that single robots cannot manage alone. Current large-language-model-based multi-agent systems (LLM-based MAS) have shown success in areas like software development and operating systems, but applying these systems to robot control presents unique challenges. In particular, the capabilities of each agent in a multi-robot system are inherently tied to the physical composition of the robots, rather than predefined roles. To address this issue, we introduce a novel multi-agent framework designed to enable effective collaboration among heterogeneous robots with varying embodiments and capabilities, along with a new benchmark named Habitat-MAS. One of our key designs is \textitRobot Resume : Instead of adopting human-designed role play, we propose a self-prompted approach, where agents comprehend robot URDF files and call robot kinematics tools to generate descriptions of their physics capabilities to guide their behavior in task planning and action execution. The Habitat-MAS benchmark is designed to assess how a multi-agent framework handles tasks that require embodiment-aware reasoning, which includes 1) manipulation, 2) perception, 3) navigation, and 4) comprehensive multi-floor object rearrangement. The experimental results indicate that the robot’s resume and the hierarchical design of our multi-agent system are essential for the effective operation of the heterogeneous multi-robot system within this intricate problem context.

[AI-65] Incremental Learning of Retrievable Skills For Efficient Continual Task Adaptation

链接: https://arxiv.org/abs/2410.22658
作者: Daehee Lee,Minjong Yoo,Woo Kyung Kim,Wonje Choi,Honguk Woo
关键词-EN: Continual Imitation Learning, Continual Imitation, Imitation Learning, involves extracting, multi-task policy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual Imitation Learning (CiL) involves extracting and accumulating task knowledge from demonstrations across multiple stages and tasks to achieve a multi-task policy. With recent advancements in foundation models, there has been a growing interest in adapter-based CiL approaches, where adapters are established parameter-efficiently for tasks newly demonstrated. While these approaches isolate parameters for specific tasks and tend to mitigate catastrophic forgetting, they limit knowledge sharing among different demonstrations. We introduce IsCiL, an adapter-based CiL framework that addresses this limitation of knowledge sharing by incrementally learning shareable skills from different demonstrations, thus enabling sample-efficient task adaptation using the skills particularly in non-stationary CiL environments. In IsCiL, demonstrations are mapped into the state embedding space, where proper skills can be retrieved upon input states through prototype-based memory. These retrievable skills are incrementally learned on their corresponding adapters. Our CiL experiments with complex tasks in Franka-Kitchen and Meta-World demonstrate robust performance of IsCiL in both task adaptation and sample-efficiency. We also show a simple extension of IsCiL for task unlearning scenarios.

[AI-66] DECRL: A Deep Evolutionary Clustering Jointed Temporal Knowledge Graph Representation Learning Approach NEURIPS2024

链接: https://arxiv.org/abs/2410.22631
作者: Qian Chen,Ling Chen
关键词-EN: low-dimensional vector space, continuous low-dimensional vector, Temporal Knowledge Graph, representation learning aims, map temporal evolving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by NeurIPS 2024, 17 pages, and 3 figures

点击查看摘要

Abstract:Temporal Knowledge Graph (TKG) representation learning aims to map temporal evolving entities and relations to embedded representations in a continuous low-dimensional vector space. However, existing approaches cannot capture the temporal evolution of high-order correlations in TKGs. To this end, we propose a Deep Evolutionary Clustering jointed temporal knowledge graph Representation Learning approach (DECRL). Specifically, a deep evolutionary clustering module is proposed to capture the temporal evolution of high-order correlations among entities. Furthermore, a cluster-aware unsupervised alignment mechanism is introduced to ensure the precise one-to-one alignment of soft overlapping clusters across timestamps, thereby maintaining the temporal smoothness of clusters. In addition, an implicit correlation encoder is introduced to capture latent correlations between any pair of clusters under the guidance of a global graph. Extensive experiments on seven real-world datasets demonstrate that DECRL achieves the state-of-the-art performances, outperforming the best baseline by an average of 9.53%, 12.98%, 10.42%, and 14.68% in MRR, Hits@1, Hits@3, and Hits@10, respectively.

[AI-67] CoGS: Model Agnostic Causality Constrained Counterfactual Explanations using goal-directed ASP

链接: https://arxiv.org/abs/2410.22615
作者: Sopam Dasgupta,Joaquín Arias,Elmer Salazar,Gopal Gupta
关键词-EN: approvals and hiring, black boxes, obscuring their decision-making, decision-making processes, critical areas
类目: Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2407.08179

点击查看摘要

Abstract:Machine learning models are increasingly used in critical areas such as loan approvals and hiring, yet they often function as black boxes, obscuring their decision-making processes. Transparency is crucial, as individuals need explanations to understand decisions, primarily if the decisions result in an undesired outcome. Our work introduces CoGS (Counterfactual Generation with s(CASP)), a model-agnostic framework capable of generating counterfactual explanations for classification models. CoGS leverages the goal-directed Answer Set Programming system s(CASP) to compute realistic and causally consistent modifications to feature values, accounting for causal dependencies between them. By using rule-based machine learning algorithms (RBML), notably the FOLD-SE algorithm, CoGS extracts the underlying logic of a statistical model to generate counterfactual solutions. By tracing a step-by-step path from an undesired outcome to a desired one, CoGS offers interpretable and actionable explanations of the changes required to achieve the desired outcome. We present details of the CoGS framework along with its evaluation.

[AI-68] Are Large-Language Models Graph Algorithmic Reasoners?

链接: https://arxiv.org/abs/2410.22597
作者: Alexander K Taylor,Anthony Cuturrufo,Vishal Yathish,Mingyu Derek Ma,Wei Wang
关键词-EN: Large Language Models, current Large Language, facing current Large, Language Models, Large Language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 13 Figures

点击查看摘要

Abstract:We seek to address a core challenge facing current Large Language Models (LLMs). LLMs have demonstrated superior performance in many tasks, yet continue to struggle with reasoning problems on explicit graphs that require multiple steps. To address this gap, we introduce a novel benchmark designed to evaluate LLM performance on classical algorithmic reasoning tasks on explicit graphs. Our benchmark encompasses five fundamental algorithms: Breadth-First Search (BFS) and Depth-First Search (DFS) for connectivity, Dijkstra’s algorithm and Floyd-Warshall algorithm for all nodes shortest path, and Prim’s Minimum Spanning Tree (MST-Prim’s) algorithm. Through extensive experimentation, we assess the capabilities of state-of-the-art LLMs in executing these algorithms step-by-step and systematically evaluate their performance at each stage. Our findings highlight the persistent challenges LLMs face in this domain and underscore the necessity for advanced prompting techniques and algorithmic instruction to enhance their graph reasoning abilities. This work presents MAGMA, the first comprehensive benchmark focused on LLMs completing classical graph algorithms, and provides a critical step toward understanding and improving their structured problem-solving skills.

[AI-69] FGCE: Feasible Group Counterfactual Explanations for Auditing Fairness

链接: https://arxiv.org/abs/2410.22591
作者: Christos Fragkathoulas,Vasiliki Papanikou,Evaggelia Pitoura,Evimaria Terzi
关键词-EN: trustworthy machine learning, group counterfactual explanations, generating group counterfactual, counterfactual explanations, audit model fairness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper introduces the first graph-based framework for generating group counterfactual explanations to audit model fairness, a crucial aspect of trustworthy machine learning. Counterfactual explanations are instrumental in understanding and mitigating unfairness by revealing how inputs should change to achieve a desired outcome. Our framework, named Feasible Group Counterfactual Explanations (FGCEs), captures real-world feasibility constraints and constructs subgroups with similar counterfactuals, setting it apart from existing methods. It also addresses key trade-offs in counterfactual generation, including the balance between the number of counterfactuals, their associated costs, and the breadth of coverage achieved. To evaluate these trade-offs and assess fairness, we propose measures tailored to group counterfactual generation. Our experimental results on benchmark datasets demonstrate the effectiveness of our approach in managing feasibility constraints and trade-offs, as well as the potential of our proposed metrics in identifying and quantifying fairness issues.

[AI-70] Energy-Aware Multi-Agent Reinforcement Learning for Collaborative Execution in Mission-Oriented Drone Networks

链接: https://arxiv.org/abs/2410.22578
作者: Ying Li,Changling Li,Jiyao Chen,Christine Roinou
关键词-EN: Mission-oriented drone networks, disaster monitoring, border surveillance, structural inspection, drone networks
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
*备注: 2022 International Conference on Computer Communications and Networks

点击查看摘要

Abstract:Mission-oriented drone networks have been widely used for structural inspection, disaster monitoring, border surveillance, etc. Due to the limited battery capacity of drones, mission execution strategy impacts network performance and mission completion. However, collaborative execution is a challenging problem for drones in such a dynamic environment as it also involves efficient trajectory design. We leverage multi-agent reinforcement learning (MARL) to manage the challenge in this study, letting each drone learn to collaboratively execute tasks and plan trajectories based on its current status and environment. Simulation results show that the proposed collaborative execution model can successfully complete the mission at least 80% of the time, regardless of task locations and lengths, and can even achieve a 100% success rate when the task density is not way too sparse. To the best of our knowledge, our work is one of the pioneer studies on leveraging MARL on collaborative execution for mission-oriented drone networks; the unique value of this work lies in drone battery level driving our model design.

[AI-71] Unpicking Data at the Seams: VAEs Disentanglement and Independent Components

链接: https://arxiv.org/abs/2410.22559
作者: Carl Allen
关键词-EN: identifying salient statistically, generative process underlying, Generative Adversarial Networks, synthetic data generation, statistically independent factors
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Disentanglement, or identifying salient statistically independent factors of the data, is of interest in many areas of machine learning and statistics, with relevance to synthetic data generation with controlled properties, robust classification of features, parsimonious encoding, and a greater understanding of the generative process underlying the data. Disentanglement arises in several generative paradigms, including Variational Autoencoders (VAEs), Generative Adversarial Networks and diffusion models. Particular progress has recently been made in understanding disentanglement in VAEs, where the choice of diagonal posterior covariance matrices is shown to promote mutual orthogonality between columns of the decoder’s Jacobian. We continue this thread to show how this linear independence translates to statistical independence, completing the chain in understanding how the VAE’s objective identifies independent components of, or disentangles, the data.

[AI-72] ML Research Benchmark

链接: https://arxiv.org/abs/2410.22553
作者: Matthew Kenney
关键词-EN: Artificial intelligence agents, Artificial intelligence, increasingly capable, capable of performing, research
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial intelligence agents are increasingly capable of performing complex tasks across various domains. As these agents advance, there is a growing need to accurately measure and benchmark their capabilities, particularly in accelerating AI research and development. Current benchmarks focus on general machine learning tasks, but lack comprehensive evaluation methods for assessing AI agents’ abilities in tackling research-level problems and competition-level challenges in the field of AI. We present the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived from recent machine learning conference tracks. These tasks span activities typically undertaken by AI researchers, including model training efficiency, pretraining on limited data, domain specific fine-tuning, and model compression. This paper introduces a novel benchmark and evaluates it using agent scaffolds powered by frontier models, including Claude-3 and GPT-4o. The results indicate that the Claude-3.5 Sonnet agent performs best across our benchmark, excelling in planning and developing machine learning models. However, both tested agents struggled to perform non-trivial research iterations. We observed significant performance variations across tasks, highlighting the complexity of AI development and the challenges in creating versatile agent scaffolds. While current AI agents can successfully navigate complex instructions and produce baseline results, they fall short of the capabilities required for advanced AI research. The ML Research Benchmark provides a valuable framework for assessing and comparing AI agents on tasks mirroring real-world AI research challenges.

[AI-73] From Silos to Systems: Process-Oriented Hazard Analysis for AI Systems

链接: https://arxiv.org/abs/2410.22526
作者: Shalaleh Rismani,Roel Dobbe,AJung Moon
关键词-EN: effectively address potential, address potential harms, effectively address, address potential, essential to identify
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:To effectively address potential harms from AI systems, it is essential to identify and mitigate system-level hazards. Current analysis approaches focus on individual components of an AI system, like training data or models, in isolation, overlooking hazards from component interactions or how they are situated within a company’s development process. To this end, we draw from the established field of system safety, which considers safety as an emergent property of the entire system, not just its components. In this work, we translate System Theoretic Process Analysis (STPA) - a recognized system safety framework - for analyzing AI operation and development processes. We focus on systems that rely on machine learning algorithms and conducted STPA on three case studies involving linear regression, reinforcement learning, and transformer-based generative models. Our analysis explored how STPA’s control and system-theoretic perspectives apply to AI systems and whether unique AI traits - such as model opacity, capability uncertainty, and output complexity - necessitate significant modifications to the framework. We find that the key concepts and steps of conducting an STPA readily apply, albeit with a few adaptations tailored for AI systems. We present the Process-oriented Hazard Analysis for AI Systems (PHASE) as a guideline that adapts STPA concepts for AI, making STPA-based hazard analysis more accessible. PHASE enables four key affordances for analysts responsible for managing AI system harms: 1) detection of hazards at the systems level, including those from accumulation of disparate issues; 2) explicit acknowledgment of social factors contributing to experiences of algorithmic harms; 3) creation of traceable accountability chains between harms and those who can mitigate the harm; and 4) ongoing monitoring and mitigation of new hazards.

[AI-74] RealCQA-V2 : Visual Premise Proving

链接: https://arxiv.org/abs/2410.22492
作者: Saleem Ahmed,Rangaraj Setlur,Venu Govindaraju
关键词-EN: Visual Premise Proving, Premise Proving, introduce Visual Premise, task tailored, tailored to refine
类目: Artificial Intelligence (cs.AI)
*备注: Under Review : Code and Data will be made public soon

点击查看摘要

Abstract:We introduce Visual Premise Proving (VPP), a novel task tailored to refine the process of chart question answering by deconstructing it into a series of logical premises. Each of these premises represents an essential step in comprehending a chart’s content and deriving logical conclusions, thereby providing a granular look at a model’s reasoning abilities. This approach represents a departure from conventional accuracy-based evaluation methods, emphasizing the model’s ability to sequentially validate each premise and ideally mimic human analytical processes. A model adept at reasoning is expected to demonstrate proficiency in both data retrieval and the structural understanding of charts, suggesting a synergy between these competencies. However, in our zero-shot study using the sophisticated MATCHA model on a scientific chart question answering dataset, an intriguing pattern emerged. The model showcased superior performance in chart reasoning (27%) over chart structure (19%) and data retrieval (14%). This performance gap suggests that models might more readily generalize reasoning capabilities across datasets, benefiting from consistent mathematical and linguistic semantics, even when challenged by changes in the visual domain that complicate structure comprehension and data retrieval. Furthermore, the efficacy of using accuracy of binary QA for evaluating chart reasoning comes into question if models can deduce correct answers without parsing chart data or structure. VPP highlights the importance of integrating reasoning with visual comprehension to enhance model performance in chart analysis, pushing for a balanced approach in evaluating visual data interpretation capabilities.

[AI-75] Predicting Future Actions of Reinforcement Learning Agents

链接: https://arxiv.org/abs/2410.22459
作者: Stephen Chung,Scott Niekum,David Krueger
关键词-EN: preventing catastrophic outcomes, reinforcement learning agents, future agent actions, catastrophic outcomes, reinforcement learning
类目: Artificial Intelligence (cs.AI)
*备注: 16 pages, 8 figures

点击查看摘要

Abstract:As reinforcement learning agents become increasingly deployed in real-world scenarios, predicting future agent actions and events during deployment is important for facilitating better human-agent interaction and preventing catastrophic outcomes. This paper experimentally evaluates and compares the effectiveness of future action and event prediction for three types of RL agents: explicitly planning, implicitly planning, and non-planning. We employ two approaches: the inner state approach, which involves predicting based on the inner computations of the agents (e.g., plans or neuron activations), and a simulation-based approach, which involves unrolling the agent in a learned world model. Our results show that the plans of explicitly planning agents are significantly more informative for prediction than the neuron activations of the other types. Furthermore, using internal plans proves more robust to model quality compared to simulation-based approaches when predicting actions, while the results for event prediction are more mixed. These findings highlight the benefits of leveraging inner states and simulations to predict future agent actions and events, thereby improving interaction and safety in real-world deployments.

[AI-76] Image2Struct: Benchmarking Structure Extraction for Vision-Language Models NEURIPS2024

链接: https://arxiv.org/abs/2410.22456
作者: Josselin Somerville Roberts,Tony Lee,Chi Heem Wong,Michihiro Yasunaga,Yifan Mai,Percy Liang
关键词-EN: evaluate vision-language models, vision-language models, VLMs, similarity, image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024. First three authors contributed equally

点击查看摘要

Abstract:We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at this https URL.

[AI-77] Addressing Issues with Working Memory in Video Object Segmentation

链接: https://arxiv.org/abs/2410.22451
作者: Clayton Bromley,Alexander Moore,Amar Saini,Douglas Poland,Carmen Carrano
关键词-EN: compare incoming unannotated, incoming unannotated images, predict object masks, models compare incoming, working memory
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 12 pages, 11 figures

点击查看摘要

Abstract:Contemporary state-of-the-art video object segmentation (VOS) models compare incoming unannotated images to a history of image-mask relations via affinity or cross-attention to predict object masks. We refer to the internal memory state of the initial image-mask pair and past image-masks as a working memory buffer. While the current state of the art models perform very well on clean video data, their reliance on a working memory of previous frames leaves room for error. Affinity-based algorithms include the inductive bias that there is temporal continuity between consecutive frames. To account for inconsistent camera views of the desired object, working memory models need an algorithmic modification that regulates the memory updates and avoid writing irrelevant frames into working memory. A simple algorithmic change is proposed that can be applied to any existing working memory-based VOS model to improve performance on inconsistent views, such as sudden camera cuts, frame interjections, and extreme context changes. The resulting model performances show significant improvement on video data with these frame interjections over the same model without the algorithmic addition. Our contribution is a simple decision function that determines whether working memory should be updated based on the detection of sudden, extreme changes and the assumption that the object is no longer in frame. By implementing algorithmic changes, such as this, we can increase the real-world applicability of current VOS models.

[AI-78] A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks

链接: https://arxiv.org/abs/2410.22391
作者: Thomas Schmied,Thomas Adler,Vihang Patil,Maximilian Beck,Korbinian Pöppel,Johannes Brandstetter,Günter Klambauer,Razvan Pascanu,Sepp Hochreiter
关键词-EN: Reinforcement Learning, field of Reinforcement, models trained offline, recent years, action models trained
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, there has been a trend in the field of Reinforcement Learning (RL) towards large action models trained offline on large-scale datasets via sequence modeling. Existing models are primarily based on the Transformer architecture, which result in powerful agents. However, due to slow inference times, Transformer-based approaches are impractical for real-time applications, such as robotics. Recently, modern recurrent architectures, such as xLSTM and Mamba, have been proposed that exhibit parallelization benefits during training similar to the Transformer architecture while offering fast inference. In this work, we study the aptitude of these modern recurrent architectures for large action models. Consequently, we propose a Large Recurrent Action Model (LRAM) with an xLSTM at its core that comes with linear-time inference complexity and natural sequence length extrapolation abilities. Experiments on 432 tasks from 6 domains show that LRAM compares favorably to Transformers in terms of performance and speed.

[AI-79] FNDEX: Fake News and Doxxing Detection with Explainable AI

链接: https://arxiv.org/abs/2410.22390
作者: Dorsaf Sallami,Esma Aïmeur
关键词-EN: diverse online media, online media platforms, internet-driven communication technologies, presented significant challenges, freedom of expression
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The widespread and diverse online media platforms and other internet-driven communication technologies have presented significant challenges in defining the boundaries of freedom of expression. Consequently, the internet has been transformed into a potential cyber weapon. Within this evolving landscape, two particularly hazardous phenomena have emerged: fake news and doxxing. Although these threats have been subjects of extensive scholarly analysis, the crossroads where they intersect remain unexplored. This research addresses this convergence by introducing a novel system. The Fake News and Doxxing Detection with Explainable Artificial Intelligence (FNDEX) system leverages the capabilities of three distinct transformer models to achieve high-performance detection for both fake news and doxxing. To enhance data security, a rigorous three-step anonymization process is employed, rooted in a pattern-based approach for anonymizing personally identifiable information. Finally, this research emphasizes the importance of generating coherent explanations for the outcomes produced by both detection models. Our experiments on realistic datasets demonstrate that our system significantly outperforms the existing baselines

[AI-80] Robust training of implicit generative models for multivariate and heavy-tailed distributions with an invariant statistical loss

链接: https://arxiv.org/abs/2410.22381
作者: José Manuel de Frutos,Manuel A. Vázquez,Pablo Olmos,Joaquín Míguez
关键词-EN: learning highly complex, highly complex data, highly complex, data, complex data distributions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Traditional implicit generative models are capable of learning highly complex data distributions. However, their training involves distinguishing real data from synthetically generated data using adversarial discriminators, which can lead to unstable training dynamics and mode dropping issues. In this work, we build on the \textitinvariant statistical loss (ISL) method introduced in \citede2024training, and extend it to handle heavy-tailed and multivariate data distributions. The data generated by many real-world phenomena can only be properly characterised using heavy-tailed probability distributions, and traditional implicit methods struggle to effectively capture their asymptotic behavior. To address this problem, we introduce a generator trained with ISL, that uses input noise from a generalised Pareto distribution (GPD). We refer to this generative scheme as Pareto-ISL for conciseness. Our experiments demonstrate that Pareto-ISL accurately models the tails of the distributions while still effectively capturing their central characteristics. The original ISL function was conceived for 1D data sets. When the actual data is n -dimensional, a straightforward extension of the method was obtained by targeting the n marginal distributions of the data. This approach is computationally infeasible and ineffective in high-dimensional spaces. To overcome this, we extend the 1D approach using random projections and define a new loss function suited for multivariate data, keeping problems tractable by adjusting the number of projections. We assess its performance in multidimensional generative modeling and explore its potential as a pretraining technique for generative adversarial networks (GANs) to prevent mode collapse, reporting promising results and highlighting its robustness across various hyperparameter settings. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML) Cite as: arXiv:2410.22381 [cs.LG] (or arXiv:2410.22381v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.22381 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-81] Discrete Modeling via Boundary Conditional Diffusion Processes

链接: https://arxiv.org/abs/2410.22380
作者: Yuxuan Gu,Xiaocheng Feng,Lei Huang,Yingsheng Wu,Zekun Zhou,Weihong Zhong,Kun Zhu,Bing Qin
关键词-EN: framework for efficiently, efficiently and effectively, effectively extending, extending the powerful, powerful continuous diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeuraIPS 2024 poster

点击查看摘要

Abstract:We present an novel framework for efficiently and effectively extending the powerful continuous diffusion processes to discrete modeling. Previous approaches have suffered from the discrepancy between discrete data and continuous modeling. Our study reveals that the absence of guidance from discrete boundaries in learning probability contours is one of the main reasons. To address this issue, we propose a two-step forward process that first estimates the boundary as a prior distribution and then rescales the forward trajectory to construct a boundary conditional diffusion model. The reverse process is proportionally adjusted to guarantee that the learned contours yield more precise discrete data. Experimental results indicate that our approach achieves strong performance in both language modeling and discrete image generation tasks. In language modeling, our approach surpasses previous state-of-the-art continuous diffusion language models in three translation tasks and a summarization task, while also demonstrating competitive performance compared to auto-regressive transformers. Moreover, our method achieves comparable results to continuous diffusion models when using discrete ordinal pixels and establishes a new state-of-the-art for categorical image generation on the Cifar-10 dataset.

[AI-82] A Systematic Literature Review of Spatio-Temporal Graph Neural Network Models for Time Series Forecasting and Classification

链接: https://arxiv.org/abs/2410.22377
作者: Flavio Corradini,Marco Gori,Carlo Lucheroni,Marco Piangerelli,Martina Zannotti
关键词-EN: graph neural networks, attracted considerable interest, time series analysis, spatio-temporal graph neural, time series classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:In recent years, spatio-temporal graph neural networks (GNNs) have attracted considerable interest in the field of time series analysis, due to their ability to capture dependencies among variables and across time points. The objective of the presented systematic literature review is hence to provide a comprehensive overview of the various modeling approaches and application domains of GNNs for time series classification and forecasting. A database search was conducted, and over 150 journal papers were selected for a detailed examination of the current state-of-the-art in the field. This examination is intended to offer to the reader a comprehensive collection of proposed models, links to related source code, available datasets, benchmark models, and fitting results. All this information is hoped to assist researchers in future studies. To the best of our knowledge, this is the first systematic literature review presenting a detailed comparison of the results of current spatio-temporal GNN models in different domains. In addition, in its final part this review discusses current limitations and challenges in the application of spatio-temporal GNNs, such as comparability, reproducibility, explainability, poor information capacity, and scalability.

[AI-83] Machine Unlearning using Forgetting Neural Networks

链接: https://arxiv.org/abs/2410.22374
作者: Amartya Hatua,Trung T. Nguyen,Filip Cano,Andrew H. Sung
关键词-EN: Modern computer systems, computer systems store, systems store vast, store vast amounts, Modern computer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Modern computer systems store vast amounts of personal data, enabling advances in AI and ML but risking user privacy and trust. For privacy reasons, it is desired sometimes for an ML model to forget part of the data it was trained on. This paper presents a new approach to machine unlearning using forgetting neural networks (FNN). FNNs are neural networks with specific forgetting layers, that take inspiration from the processes involved when a human brain forgets. While FNNs had been proposed as a theoretical construct, they have not been previously used as a machine unlearning method. We describe four different types of forgetting layers and study their properties. In our experimental evaluation, we report our results on the MNIST handwritten digit recognition and fashion datasets. The effectiveness of the unlearned models was tested using Membership Inference Attacks (MIA). Successful experimental results demonstrate the great potential of our proposed method for dealing with the machine unlearning problem.

[AI-84] Analytic Continual Test-Time Adaptation for Multi-Modality Corruption

链接: https://arxiv.org/abs/2410.22373
作者: Yufei Zhang,Yicheng Xu,Hongxin Wei,Zhiping Lin,Huiping Zhuang
关键词-EN: unlabelled test data, Continual Test-Time Adaptation, Test-Time Adaptation, test data, bridge the gap
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Test-Time Adaptation (TTA) aims to help pre-trained model bridge the gap between source and target datasets using only the pre-trained model and unlabelled test data. A key objective of TTA is to address domain shifts in test data caused by corruption, such as weather changes, noise, or sensor malfunctions. Multi-Modal Continual Test-Time Adaptation (MM-CTTA), an extension of TTA with better real-world applications, further allows pre-trained models to handle multi-modal inputs and adapt to continuously-changing target domains. MM-CTTA typically faces challenges including error accumulation, catastrophic forgetting, and reliability bias, with few existing approaches effectively addressing these issues in multi-modal corruption scenarios. In this paper, we propose a novel approach, Multi-modality Dynamic Analytic Adapter (MDAA), for MM-CTTA tasks. We innovatively introduce analytic learning into TTA, using the Analytic Classifiers (ACs) to prevent model forgetting. Additionally, we develop Dynamic Selection Mechanism (DSM) and Soft Pseudo-label Strategy (SPS), which enable MDAA to dynamically filter reliable samples and integrate information from different modalities. Extensive experiments demonstrate that MDAA achieves state-of-the-art performance on MM-CTTA tasks while ensuring reliable model adaptation.

[AI-85] A Hierarchical Language Model For Interpretable Graph Reasoning

链接: https://arxiv.org/abs/2410.22372
作者: Sambhav Khurana,Xiner Li,Shurui Gui,Shuiwang Ji
关键词-EN: Large language models, Hierarchical Language Model, increasingly explored, Large language, graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are being increasingly explored for graph tasks. Despite their remarkable success in text-based tasks, LLMs’ capabilities in understanding explicit graph structures remain limited, particularly with large graphs. In this work, we introduce Hierarchical Language Model for Graph (HLM-G), which employs a two-block architecture to capture node-centric local information and interaction-centric global structure, effectively enhancing graph structure understanding abilities. The proposed scheme allows LLMs to address various graph queries with high efficacy, efficiency, and robustness, while reducing computational costs on large-scale graph tasks. Furthermore, we demonstrate the interpretability of our model using intrinsic attention weights and established explainers. Comprehensive evaluations across diverse graph reasoning and real-world tasks of node, link, and graph-levels highlight the superiority of our method, marking a significant advancement in the application of LLMs to graph understanding.

[AI-86] Error Bounds for Deep Learning-based Uncertainty Propagation in SDEs

链接: https://arxiv.org/abs/2410.22371
作者: Chun-Wei Kong,Luca Laurenti,Jay McMahon,Morteza Lahijanian
关键词-EN: Stochastic differential equations, Stochastic differential, stochastic processes, partial differential equation, Stochastic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: pre-print under review

点击查看摘要

Abstract:Stochastic differential equations are commonly used to describe the evolution of stochastic processes. The uncertainty of such processes is best represented by the probability density function (PDF), whose evolution is governed by the Fokker-Planck partial differential equation (FP-PDE). However, it is generally infeasible to solve the FP-PDE in closed form. In this work, we show that physics-informed neural networks (PINNs) can be trained to approximate the solution PDF using existing methods. The main contribution is the analysis of the approximation error: we develop a theory to construct an arbitrary tight error bound with PINNs. In addition, we derive a practical error bound that can be efficiently constructed with existing training methods. Finally, we explain that this error-bound theory generalizes to approximate solutions of other linear PDEs. Several numerical experiments are conducted to demonstrate and validate the proposed methods.

[AI-87] Project MPG: towards a generalized performance benchmark for LLM capabilities

链接: https://arxiv.org/abs/2410.22368
作者: Lucas Spangher,Tianle Li,William F. Arnold,Nick Masiewicki,Xerxes Dotiwalla,Rama Parusmathi,Peter Grabowski,Eugene Ie,Dan Gruhl
关键词-EN: LLM benchmarking tasks, extremely wide array, array of LLM, LLM benchmarking, benchmarking tasks
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:There exists an extremely wide array of LLM benchmarking tasks, whereas oftentimes a single number is the most actionable for decision-making, especially by non-experts. No such aggregation schema exists that is not Elo-based, which could be costly or time-consuming. Here we propose a method to aggregate performance across a general space of benchmarks, nicknamed Project “MPG,” dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance. Here, we create two numbers: a “Goodness” number (answer accuracy) and a “Fastness” number (cost or QPS). We compare models against each other and present a ranking according to our general metric as well as subdomains. We find significant agreement between the raw Pearson correlation of our scores and those of Chatbot Arena, even improving on the correlation of the MMLU leaderboard to Chatbot Arena.

[AI-88] Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

链接: https://arxiv.org/abs/2410.22366
作者: Viacheslav Surkov,Chris Wendler,Mikhail Terekhov,Justin Deschenaux,Robert West,Caglar Gulcehre
关键词-EN: SDXL Turbo, Sparse autoencoders, core ingredient, reverse engineering, engineering of large-language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of large-language models (LLMs). For LLMs, they have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigated the possibility of using SAEs to learn interpretable features for a few-step text-to-image diffusion models, such as SDXL Turbo. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo’s denoising U-net. We find that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. In particular, we find one block that deals mainly with image composition, one that is mainly responsible for adding local details, and one for color, illumination, and style. Therefore, our work is an important first step towards better understanding the internals of generative text-to-image models like SDXL Turbo and showcases the potential of features learned by SAEs for the visual domain. Code is available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.22366 [cs.LG] (or arXiv:2410.22366v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.22366 Focus to learn more arXiv-issued DOI via DataCite

[AI-89] Learning Goal-oriented Bimanual Dough Rolling Using Dynamic Heterogeneous Graph Based on Human Demonstration

链接: https://arxiv.org/abs/2410.22355
作者: Junjia Liu,Chenzui Li,Shixiong Wang,Zhipeng Dong,Sylvain Calinon,Miao Li,Fei Chen
关键词-EN: requiring effective techniques, manipulation poses significant, Soft object manipulation, poses significant challenges, manipulation policy learning
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 5 figures Accepted by IEEE ROBIO 2024 conference

点击查看摘要

Abstract:Soft object manipulation poses significant challenges for robots, requiring effective techniques for state representation and manipulation policy learning. State representation involves capturing the dynamic changes in the environment, while manipulation policy learning focuses on establishing the relationship between robot actions and state transformations to achieve specific goals. To address these challenges, this research paper introduces a novel approach: a dynamic heterogeneous graph-based model for learning goal-oriented soft object manipulation policies. The proposed model utilizes graphs as a unified representation for both states and policy learning. By leveraging the dynamic graph, we can extract crucial information regarding object dynamics and manipulation policies. Furthermore, the model facilitates the integration of demonstrations, enabling guided policy learning. To evaluate the efficacy of our approach, we designed a dough rolling task and conducted experiments using both a differentiable simulator and a real-world humanoid robot. Additionally, several ablation studies were performed to analyze the effect of our method, demonstrating its superiority in achieving human-like behavior.

[AI-90] Neuromorphic Programming: Emerging Directions for Brain-Inspired Hardware

链接: https://arxiv.org/abs/2410.22352
作者: Steven Abreu,Jens E. Pedersen
关键词-EN: computers critically depends, relevant tasks, critically depends, ability to program, neuromorphic computers critically
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Programming Languages (cs.PL)
*备注: Accepted to International Conference on Neuromorphic Systems (ICONS) 2024. arXiv admin note: substantial text overlap with arXiv:2310.18260

点击查看摘要

Abstract:The value of brain-inspired neuromorphic computers critically depends on our ability to program them for relevant tasks. Currently, neuromorphic hardware often relies on machine learning methods adapted from deep learning. However, neuromorphic computers have potential far beyond deep learning if we can only harness their energy efficiency and full computational power. Neuromorphic programming will necessarily be different from conventional programming, requiring a paradigm shift in how we think about programming. This paper presents a conceptual analysis of programming within the context of neuromorphic computing, challenging conventional paradigms and proposing a framework that aligns more closely with the physical intricacies of these systems. Our analysis revolves around five characteristics that are fundamental to neuromorphic programming and provides a basis for comparison to contemporary programming methods and languages. By studying past approaches, we contribute a framework that advocates for underutilized techniques and calls for richer abstractions to effectively instrument the new hardware class.

[AI-91] sting GPT-4-o1-preview on math and science problems: A follow-up study

链接: https://arxiv.org/abs/2410.22340
作者: Ernest Davis
关键词-EN: Code Interpreter plug-ins, original high-school level, Scott Aaronson, Wolfram Alpha, Alpha and Code
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In August 2023, Scott Aaronson and I reported the results of testing GPT4 with the Wolfram Alpha and Code Interpreter plug-ins over a collection of 105 original high-school level and college-level science and math problems (Davis and Aaronson, 2023). In September 2024, I tested the recently released model GPT-4o1-preview on the same collection. Overall I found that performance had significantly improved, but was still considerably short of perfect. In particular, problems that involve spatial reasoning are often stumbling blocks.

[AI-92] DAWN: Designing Distributed Agents in a Worldwide Network

链接: https://arxiv.org/abs/2410.22339
作者: Zahra Aminiranjbar,Jianan Tang,Qiudan Wang,Shubha Pant,Mahesh Viswanathan
关键词-EN: Large Language Models, Language Models, Large Language, basic conversational tools, evolution of Large
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLMs) has transformed them from basic conversational tools into sophisticated entities capable of complex reasoning and decision-making. These advancements have led to the development of specialized LLM-based agents designed for diverse tasks such as coding and web browsing. As these agents become more capable, the need for a robust framework that facilitates global communication and collaboration among them towards advanced objectives has become increasingly critical. Distributed Agents in a Worldwide Network (DAWN) addresses this need by offering a versatile framework that integrates LLM-based agents with traditional software systems, enabling the creation of agentic applications suited for a wide range of use cases. DAWN enables distributed agents worldwide to register and be easily discovered through Gateway Agents. Collaborations among these agents are coordinated by a Principal Agent equipped with reasoning strategies. DAWN offers three operational modes: No-LLM Mode for deterministic tasks, Copilot for augmented decision-making, and LLM Agent for autonomous operations. Additionally, DAWN ensures the safety and security of agent collaborations globally through a dedicated safety, security, and compliance layer, protecting the network against attackers and adhering to stringent security and compliance standards. These features make DAWN a robust network for deploying agent-based applications across various industries.

[AI-93] Robot Policy Learning with Temporal Optimal Transport Reward NEURIPS2024

链接: https://arxiv.org/abs/2410.21795
作者: Yuwei Fu,Haichao Zhang,Di Wu,Wei Xu,Benoit Boulet
关键词-EN: Temporal Optimal Transport, problems in Reinforcement, Toggle, requires tedious hand, tedious hand engineering
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Reward specification is one of the most tricky problems in Reinforcement Learning, which usually requires tedious hand engineering in practice. One promising approach to tackle this challenge is to adopt existing expert video demonstrations for policy learning. Some recent work investigates how to learn robot policies from only a single/few expert video demonstrations. For example, reward labeling via Optimal Transport (OT) has been shown to be an effective strategy to generate a proxy reward by measuring the alignment between the robot trajectory and the expert demonstrations. However, previous work mostly overlooks that the OT reward is invariant to temporal order information, which could bring extra noise to the reward signal. To address this issue, in this paper, we introduce the Temporal Optimal Transport (TemporalOT) reward to incorporate temporal order information for learning a more accurate OT-based proxy reward. Extensive experiments on the Meta-world benchmark tasks validate the efficacy of the proposed method. Code is available at: this https URL Comments: NeurIPS 2024 Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2410.21795 [cs.AI] (or arXiv:2410.21795v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2410.21795 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Haichao Zhang [view email] [v1] Tue, 29 Oct 2024 07:00:47 UTC (2,710 KB) Full-text links: Access Paper: View a PDF of the paper titled Robot Policy Learning with Temporal Optimal Transport Reward, by Yuwei Fu and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2024-10 Change to browse by: cs cs.LG cs.RO References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-94] Provably Optimal Memory Capacity for Modern Hopfield Models: Transformer-Compatible Dense Associative Memories as Spherical Codes NEURIPS2024

链接: https://arxiv.org/abs/2410.23126
作者: Jerry Yao-Chieh Hu,Dennis Wu,Han Liu
关键词-EN: Dense Associative Memories, Dense Associative, Kernelized Hopfield Models, class of Dense, Kernelized Hopfield
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:We study the optimal memorization capacity of modern Hopfield models and Kernelized Hopfield Models (KHMs), a transformer-compatible class of Dense Associative Memories. We present a tight analysis by establishing a connection between the memory configuration of KHMs and spherical codes from information theory. Specifically, we treat the stored memory set as a specialized spherical code. This enables us to cast the memorization problem in KHMs into a point arrangement problem on a hypersphere. We show that the optimal capacity of KHMs occurs when the feature space allows memories to form an optimal spherical code. This unique perspective leads to: (i) An analysis of how KHMs achieve optimal memory capacity, and identify corresponding necessary conditions. Importantly, we establish an upper capacity bound that matches the well-known exponential lower bound in the literature. This provides the first tight and optimal asymptotic memory capacity for modern Hopfield models. (ii) A sub-linear time algorithm \mathttU\text-\mathttHop + to reach KHMs’ optimal capacity. (iii) An analysis of the scaling behavior of the required feature dimension relative to the number of stored memories. These efforts improve both the retrieval capability of KHMs and the representation learning of corresponding transformers. Experimentally, we provide thorough numerical results to back up theoretical findings.

[AI-95] st-DTPM: Spatial-Temporal Guided Diffusion Transformer Probabilistic Model for Delayed Scan PET Image Prediction

链接: https://arxiv.org/abs/2410.22732
作者: Ran Hong,Yuxia Huang,Lei Liu,Zhonghui Wu,Bingxuan Li,Xuemei Wang,Qiegen Liu
关键词-EN: observing biological metabolic, biological metabolic activities, PET imaging, dual-time PET imaging, human body
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:PET imaging is widely employed for observing biological metabolic activities within the human body. However, numerous benign conditions can cause increased uptake of radiopharmaceuticals, confounding differentiation from malignant tumors. Several studies have indicated that dual-time PET imaging holds promise in distinguishing between malignant and benign tumor processes. Nevertheless, the hour-long distribution period of radiopharmaceuticals post-injection complicates the determination of optimal timing for the second scan, presenting challenges in both practical applications and research. Notably, we have identified that delay time PET imaging can be framed as an image-to-image conversion problem. Motivated by this insight, we propose a novel spatial-temporal guided diffusion transformer probabilistic model (st-DTPM) to solve dual-time PET imaging prediction problem. Specifically, this architecture leverages the U-net framework that integrates patch-wise features of CNN and pixel-wise relevance of Transformer to obtain local and global information. And then employs a conditional DDPM model for image synthesis. Furthermore, on spatial condition, we concatenate early scan PET images and noisy PET images on every denoising step to guide the spatial distribution of denoising sampling. On temporal condition, we convert diffusion time steps and delay time to a universal time vector, then embed it to each layer of model architecture to further improve the accuracy of predictions. Experimental results demonstrated the superiority of our method over alternative approaches in preserving image quality and structural information, thereby affirming its efficacy in predictive task.

[AI-96] Efficient Feature Extraction and Classification Architecture for MRI-Based Brain Tumor Detection

链接: https://arxiv.org/abs/2410.22619
作者: Plabon Paul,Md. Nazmul Islam,Fazle Rafsani,Pegah Khorasani,Shovito Barua Soumma
关键词-EN: Uncontrolled cell division, Uncontrolled cell, CNN model, brain, CNN
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Uncontrolled cell division in the brain is what gives rise to brain tumors. If the tumor size increases by more than half, there is little hope for the patient’s recovery. This emphasizes the need of rapid and precise brain tumor diagnosis. When it comes to analyzing, diagnosing, and planning therapy for brain tumors, MRI imaging plays a crucial role. A brain tumor’s development history is crucial information for doctors to have. When it comes to distinguishing between human soft tissues, MRI scans are superior. In order to get reliable classification results from MRI scans quickly, deep learning is one of the most practical methods. Early human illness diagnosis has been demonstrated to be more accurate when deep learning methods are used. In the case of diagnosing a brain tumor, when even a little misdiagnosis might have serious consequences, accuracy is especially important. Disclosure of brain tumors in medical images is still a difficult task. Brain MRIs are notoriously imprecise in revealing the presence or absence of tumors. Using MRI scans of the brain, a Convolutional Neural Network (CNN) was trained to identify the presence of a tumor in this research. Results from the CNN model showed an accuracy of 99.17%. The CNN model’s characteristics were also retrieved. In order to evaluate the CNN model’s capability for processing images, we applied the features via the following machine learning models: KNN, Logistic regression, SVM, Random Forest, Naive Bayes, and Perception. CNN and machine learning models were also evaluated using the standard metrics of Precision, Recall, Specificity, and F1 score. The significance of the doctor’s diagnosis enhanced the accuracy of the CNN model’s assistance in identifying the existence of tumor and treating the patient.

[AI-97] Privacy-Preserving Dynamic Assortment Selection

链接: https://arxiv.org/abs/2410.22488
作者: Young Hyun Cho,Will Wei Sun
关键词-EN: personalized assortment recommendations, effective privacy-preserving strategies, concerns over data, highlighting the urgent, growing demand
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the growing demand for personalized assortment recommendations, concerns over data privacy have intensified, highlighting the urgent need for effective privacy-preserving strategies. This paper presents a novel framework for privacy-preserving dynamic assortment selection using the multinomial logit (MNL) bandits model. Our approach employs a perturbed upper confidence bound method, integrating calibrated noise into user utility estimates to balance between exploration and exploitation while ensuring robust privacy protection. We rigorously prove that our policy satisfies Joint Differential Privacy (JDP), which better suits dynamic environments than traditional differential privacy, effectively mitigating inference attack risks. This analysis is built upon a novel objective perturbation technique tailored for MNL bandits, which is also of independent interest. Theoretically, we derive a near-optimal regret bound of \tildeO(\sqrtT) for our policy and explicitly quantify how privacy protection impacts regret. Through extensive simulations and an application to the Expedia hotel dataset, we demonstrate substantial performance enhancements over the benchmark method.

[AI-98] Ethical Statistical Practice and Ethical AI

链接: https://arxiv.org/abs/2410.22475
作者: Rochelle E. Tractenberg
关键词-EN: Artificial Intelligence, ethical statistical practice, statistical practice, ethical statistical, make predictions
类目: Other Statistics (stat.OT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 10 pages; Preprint of submission to Proceedings of JSM 2024 Portland, OR

点击查看摘要

Abstract:Artificial Intelligence (AI) is a field that utilizes computing and often, data and statistics, intensively together to solve problems or make predictions. AI has been evolving with literally unbelievable speed over the past few years, and this has led to an increase in social, cultural, industrial, scientific, and governmental concerns about the ethical development and use of AI systems worldwide. The ASA has issued a statement on ethical statistical practice and AI (ASA, 2024), which echoes similar statements from other groups. Here we discuss the support for ethical statistical practice and ethical AI that has been established in long-standing human rights law and ethical practice standards for computing and statistics. There are multiple sources of support for ethical statistical practice and ethical AI deriving from these source documents, which are critical for strengthening the operationalization of the “Statement on Ethical AI for Statistics Practitioners”. These resources are explicated for interested readers to utilize to guide their development and use of AI in, and through, their statistical practice.

[AI-99] Debiasing Alternative Data for Credit Underwriting Using Causal Inference

链接: https://arxiv.org/abs/2410.22382
作者: Chris Lam
关键词-EN: expand credit access, Alternative data, borrower creditworthiness, valuable insights, insights for lenders
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Alternative data provides valuable insights for lenders to evaluate a borrower’s creditworthiness, which could help expand credit access to underserved groups and lower costs for borrowers. But some forms of alternative data have historically been excluded from credit underwriting because it could act as an illegal proxy for a protected class like race or gender, causing redlining. We propose a method for applying causal inference to a supervised machine learning model to debias alternative data so that it might be used for credit underwriting. We demonstrate how our algorithm can be used against a public credit dataset to improve model accuracy across different racial groups, while providing theoretically robust nondiscrimination guarantees.

[AI-100] MAMMAL – Molecular Aligned Multi-Modal Architecture and Language

链接: https://arxiv.org/abs/2410.22367
作者: Yoel Shoshan,Moshiko Raboh,Michal Ozery-Flato,Vadim Ratner,Alex Golts,Jeffrey K. Weber,Ella Barkan,Simona Rabinovici-Cohen,Sagi Polaczek,Ido Amos,Ben Shapira,Liam Hazan,Matan Ninio,Sivan Ravid,Michael M. Danziger,Joseph A. Morrone,Parthasarathy Suryanarayanan,Michal Rosen-Zvi,Efrat Hexter
关键词-EN: discovery typically consists, Drug discovery typically, target protein key, Drug discovery, disease etiology
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drug discovery typically consists of multiple steps, including identifying a target protein key to a disease’s etiology, validating that interacting with this target could prevent symptoms or cure the disease, discovering a small molecule or biologic therapeutic to interact with it, and optimizing the candidate molecule through a complex landscape of required properties. Drug discovery related tasks often involve prediction and generation while considering multiple entities that potentially interact, which poses a challenge for typical AI models. For this purpose we present MAMMAL - Molecular Aligned Multi-Modal Architecture and Language - a method that we applied to create a versatile multi-task foundation model ibm/biomed.this http URL-ted-458m that learns from large-scale biological datasets (2 billion samples) across diverse modalities, including proteins, small molecules, and genes. We introduce a prompt syntax that supports a wide range of classification, regression, and generation tasks. It allows combining different modalities and entity types as inputs and/or outputs. Our model handles combinations of tokens and scalars and enables the generation of small molecules and proteins, property prediction, and transcriptomic lab test predictions. We evaluated the model on 11 diverse downstream tasks spanning different steps within a typical drug discovery pipeline, where it reaches new SOTA in 9 tasks and is comparable to SOTA in 2 tasks. This performance is achieved while using a unified architecture serving all tasks, in contrast to the original SOTA performance achieved using tailored architectures. The model code and pretrained weights are publicly available at this https URL and this https URL. Subjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2410.22367 [q-bio.QM] (or arXiv:2410.22367v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2410.22367 Focus to learn more arXiv-issued DOI via DataCite

[AI-101] Vascular Segmentation of Functional Ultrasound Images using Deep Learning

链接: https://arxiv.org/abs/2410.22365
作者: Hana Sebia(AISTROSIGHT),Thomas Guyet(AISTROSIGHT),Mickaël Pereira(CERMEP - imagerie du vivant),Marco Valdebenito(CERMEP - imagerie du vivant),Hugues Berry(AISTROSIGHT),Benjamin Vidal(CERMEP - imagerie du vivant, CRNL)
关键词-EN: numerous applications, fundamental task, task with numerous, dynamic CBV quantification, CBV quantification
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Segmentation of medical images is a fundamental task with numerous applications. While MRI, CT, and PET modalities have significantly benefited from deep learning segmentation techniques, more recent modalities, like functional ultrasound (fUS), have seen limited progress. fUS is a non invasive imaging method that measures changes in cerebral blood volume (CBV) with high spatio-temporal resolution. However, distinguishing arterioles from venules in fUS is challenging due to opposing blood flow directions within the same pixel. Ultrasound localization microscopy (ULM) can enhance resolution by tracking microbubble contrast agents but is invasive, and lacks dynamic CBV quantification. In this paper, we introduce the first deep learning-based segmentation tool for fUS images, capable of differentiating signals from different vascular compartments, based on ULM automatic annotation and enabling dynamic CBV quantification. We evaluate various UNet architectures on fUS images of rat brains, achieving competitive segmentation performance, with 90% accuracy, a 71% F1 score, and an IoU of 0.59, using only 100 temporal frames from a fUS stack. These results are comparable to those from tubular structure segmentation in other imaging modalities. Additionally, models trained on resting-state data generalize well to images captured during visual stimulation, highlighting robustness. This work offers a non-invasive, cost-effective alternative to ULM, enhancing fUS data interpretation and improving understanding of vessel function. Our pipeline shows high linear correlation coefficients between signals from predicted and actual compartments in both cortical and deeperregions, showcasing its ability to accurately capture blood flow dynamics.

[AI-102] MMM-RS: A Multi-modal Multi-GSD Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation NEURIPS2024

链接: https://arxiv.org/abs/2410.22362
作者: Jialin Luo,Yuanzhi Wang,Ziqi Gu,Yide Qiu,Shuaizhen Yao,Fuyun Wang,Chunyan Xu,Wenhua Zhang,Dan Wang,Zhen Cui
关键词-EN: stable training process, diffusion-based generative paradigm, accurate distribution modeling, achieved impressive general, remote sensing
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Recently, the diffusion-based generative paradigm has achieved impressive general image generation capabilities with text prompts due to its accurate distribution modeling and stable training process. However, generating diverse remote sensing (RS) images that are tremendously different from general images in terms of scale and perspective remains a formidable challenge due to the lack of a comprehensive remote sensing image generation dataset with various modalities, ground sample distances (GSD), and scenes. In this paper, we propose a Multi-modal, Multi-GSD, Multi-scene Remote Sensing (MMM-RS) dataset and benchmark for text-to-image generation in diverse remote sensing scenarios. Specifically, we first collect nine publicly available RS datasets and conduct standardization for all samples. To bridge RS images to textual semantic information, we utilize a large-scale pretrained vision-language model to automatically output text prompts and perform hand-crafted rectification, resulting in information-rich text-image pairs (including multi-modal images). In particular, we design some methods to obtain the images with different GSD and various environments (e.g., low-light, foggy) in a single sample. With extensive manual screening and refining annotations, we ultimately obtain a MMM-RS dataset that comprises approximately 2.1 million text-image pairs. Extensive experimental results verify that our proposed MMM-RS dataset allows off-the-shelf diffusion models to generate diverse RS images across various modalities, scenes, weather conditions, and GSD. The dataset is available at this https URL.

[AI-103] MAPUNetR: A Hybrid Vision Transformer and U-Net Architecture for Efficient and Interpretable Medical Image Segmentation

链接: https://arxiv.org/abs/2410.22223
作者: Ovais Iqbal Shah,Danish Raza Rizvi,Aqib Nazir Mir
关键词-EN: informing treatment strategies, tracking disease progression, enhancing diagnostic accuracy, Medical image segmentation, pivotal in healthcare
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image segmentation is pivotal in healthcare, enhancing diagnostic accuracy, informing treatment strategies, and tracking disease progression. This process allows clinicians to extract critical information from visual data, enabling personalized patient care. However, developing neural networks for segmentation remains challenging, especially when preserving image resolution, which is essential in detecting subtle details that influence diagnoses. Moreover, the lack of transparency in these deep learning models has slowed their adoption in clinical practice. Efforts in model interpretability are increasingly focused on making these models’ decision-making processes more transparent. In this paper, we introduce MAPUNetR, a novel architecture that synergizes the strengths of transformer models with the proven U-Net framework for medical image segmentation. Our model addresses the resolution preservation challenge and incorporates attention maps highlighting segmented regions, increasing accuracy and interpretability. Evaluated on the BraTS 2020 dataset, MAPUNetR achieved a dice score of 0.88 and a dice coefficient of 0.92 on the ISIC 2018 dataset. Our experiments show that the model maintains stable performance and potential as a powerful tool for medical image segmentation in clinical practice.

计算机视觉

[CV-0] ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

链接: https://arxiv.org/abs/2410.23287
作者: Anurag Bagchi,Zhipeng Bao,Yu-Xiong Wang,Pavel Tokmakov,Martial Hebert
关键词-EN: natural language, segmenting a wide, wide range, Referral Video Process, Video Process Segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page at this https URL

点击查看摘要

Abstract:We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method capitalizes on visual-language representations learned by video diffusion models on Internet-scale datasets. A key insight of our approach is preserving as much of the generative model’s original representation as possible, while fine-tuning it on narrow-domain Referral Object Segmentation datasets. As a result, our framework can accurately segment and track rare and unseen objects, despite being trained on object masks from a limited set of categories. Additionally, it can generalize to non-object dynamic concepts, such as waves crashing in the ocean, as demonstrated in our newly introduced benchmark for Referral Video Process Segmentation (Ref-VPS). Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to twelve points in terms of region similarity on out-of-domain data, leveraging the power of Internet-scale pre-training.

[CV-1] RelationBooth: Towards Relation-Aware Customized Object Generation

链接: https://arxiv.org/abs/2410.23280
作者: Qingyu Shi,Lu Qi,Jianzong Wu,Jinbin Bai,Jingbo Wang,Yunhai Tong,Xiangtai Li,Ming-Husang Yang
关键词-EN: delivering personalized content, personalized content based, Customized image generation, aligning large-scale, user-provided image prompts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Customized image generation is crucial for delivering personalized content based on user-provided image prompts, aligning large-scale text-to-image diffusion models with individual needs. However, existing models often overlook the relationships between customized objects in generated images. Instead, this work addresses that gap by focusing on relation-aware customized image generation, which aims to preserve the identities from image prompts while maintaining the predicate relations described in text prompts. Specifically, we introduce RelationBooth, a framework that disentangles identity and relation learning through a well-curated dataset. Our training data consists of relation-specific images, independent object images containing identity information, and text prompts to guide relation generation. Then, we propose two key modules to tackle the two main challenges: generating accurate and natural relations, especially when significant pose adjustments are required, and avoiding object confusion in cases of overlap. First, we introduce a keypoint matching loss that effectively guides the model in adjusting object poses closely tied to their relationships. Second, we incorporate local features from the image prompts to better distinguish between objects, preventing confusion in overlapping cases. Extensive results on three benchmarks demonstrate the superiority of RelationBooth in generating precise relations while preserving object identities across a diverse set of objects and relations. The source code and trained models will be made available to the public.

[CV-2] OpenSatMap: A Fine-grained High-resolution Satellite Dataset for Large-scale Map Construction NEURIPS2024

链接: https://arxiv.org/abs/2410.23278
作者: Hongbo Zhao,Lue Fan,Yuntao Chen,Haochen Wang,yuran Yang,Xiaojuan Jin,Yixin Zhang,Gaofeng Meng,Zhaoxiang Zhang
关键词-EN: map construction, large-scale map construction, construct large-scale maps, map, satellite-based map construction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024 DB Track. Project Page: this https URL

点击查看摘要

Abstract:In this paper, we propose OpenSatMap, a fine-grained, high-resolution satellite dataset for large-scale map construction. Map construction is one of the foundations of the transportation industry, such as navigation and autonomous driving. Extracting road structures from satellite images is an efficient way to construct large-scale maps. However, existing satellite datasets provide only coarse semantic-level labels with a relatively low resolution (up to level 19), impeding the advancement of this field. In contrast, the proposed OpenSatMap (1) has fine-grained instance-level annotations; (2) consists of high-resolution images (level 20); (3) is currently the largest one of its kind; (4) collects data with high diversity. Moreover, OpenSatMap covers and aligns with the popular nuScenes dataset and Argoverse 2 dataset to potentially advance autonomous driving technologies. By publishing and maintaining the dataset, we provide a high-quality benchmark for satellite-based map construction and downstream tasks like autonomous driving.

[CV-3] PointRecon: Online Point-based 3D Reconstruction via Ray-based 2D-3D Matching

链接: https://arxiv.org/abs/2410.23245
作者: Chen Ziwen,Zexiang Xu,Li Fuxin
关键词-EN: monocular RGB videos, posed monocular RGB, RGB videos, monocular RGB, posed monocular
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a novel online, point-based 3D reconstruction method from posed monocular RGB videos. Our model maintains a global point cloud representation of the scene, continuously updating the features and 3D locations of points as new images are observed. It expands the point cloud with newly detected points while carefully removing redundancies. The point cloud updates and depth predictions for new points are achieved through a novel ray-based 2D-3D feature matching technique, which is robust against errors in previous point position predictions. In contrast to offline methods, our approach processes infinite-length sequences and provides real-time updates. Additionally, the point cloud imposes no pre-defined resolution or scene size constraints, and its unified global representation ensures view consistency across perspectives. Experiments on the ScanNet dataset show that our method achieves state-of-the-art quality among online MVS approaches. Project page: this https URL

[CV-4] LGU-SLAM: Learnable Gaussian Uncertainty Matching with Deformable Correlation Sampling for Deep Visual SLAM

链接: https://arxiv.org/abs/2410.23231
作者: Yucheng Huang,Luping Ji,Hudong Liu,Mao Ye
关键词-EN: visual Simultaneous Localization, Deep visual Simultaneous, Localization and Mapping, leveraging deep visual, deep visual odometry
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep visual Simultaneous Localization and Mapping (SLAM) techniques, e.g., DROID, have made significant advancements by leveraging deep visual odometry on dense flow fields. In general, they heavily rely on global visual similarity matching. However, the ambiguous similarity interference in uncertain regions could often lead to excessive noise in correspondences, ultimately misleading SLAM in geometric modeling. To address this issue, we propose a Learnable Gaussian Uncertainty (LGU) matching. It mainly focuses on precise correspondence construction. In our scheme, a learnable 2D Gaussian uncertainty model is designed to associate matching-frame pairs. It could generate input-dependent Gaussian distributions for each correspondence map. Additionally, a multi-scale deformable correlation sampling strategy is devised to adaptively fine-tune the sampling of each direction by a priori look-up ranges, enabling reliable correlation construction. Furthermore, a KAN-bias GRU component is adopted to improve a temporal iterative enhancement for accomplishing sophisticated spatio-temporal modeling with limited parameters. The extensive experiments on real-world and synthetic datasets are conducted to validate the effectiveness and superiority of our method.

[CV-5] ELMGS: Enhancing memory and computation scaLability through coMpression for 3D Gaussian Splatting

链接: https://arxiv.org/abs/2410.23213
作者: Muhammad Salman Ali,Sung-Ho Bae,Enzo Tartaglione
关键词-EN: Neural Radiance Fields, Gaussian Splatting models, Gaussian Splatting, Neural Radiance, Radiance Fields
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D models have recently been popularized by the potentiality of end-to-end training offered first by Neural Radiance Fields and most recently by 3D Gaussian Splatting models. The latter has the big advantage of naturally providing fast training convergence and high editability. However, as the research around these is still in its infancy, there is still a gap in the literature regarding the model’s scalability. In this work, we propose an approach enabling both memory and computation scalability of such models. More specifically, we propose an iterative pruning strategy that removes redundant information encoded in the model. We also enhance compressibility for the model by including in the optimization strategy a differentiable quantization and entropy coding estimator. Our results on popular benchmarks showcase the effectiveness of the proposed approach and open the road to the broad deployability of such a solution even on resource-constrained devices.

[CV-6] HEX: Hierarchical Emergence Exploitation in Self-Supervised Algorithms

链接: https://arxiv.org/abs/2410.23200
作者: Kiran Kokilepersaud,Seulgi Kim,Mohit Prabhushankar,Ghassan AlRegib
关键词-EN: SSL algorithms, SSL approaches, SSL, dimensional collapse, SSL approaches typically
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose an algorithm that can be used on top of a wide variety of self-supervised (SSL) approaches to take advantage of hierarchical structures that emerge during training. SSL approaches typically work through some invariance term to ensure consistency between similar samples and a regularization term to prevent global dimensional collapse. Dimensional collapse refers to data representations spanning a lower-dimensional subspace. Recent work has demonstrated that the representation space of these algorithms gradually reflects a semantic hierarchical structure as training progresses. Data samples of the same hierarchical grouping tend to exhibit greater dimensional collapse locally compared to the dataset as a whole due to sharing features in common with each other. Ideally, SSL algorithms would take advantage of this hierarchical emergence to have an additional regularization term to account for this local dimensional collapse effect. However, the construction of existing SSL algorithms does not account for this property. To address this, we propose an adaptive algorithm that performs a weighted decomposition of the denominator of the InfoNCE loss into two terms: local hierarchical and global collapse regularization respectively. This decomposition is based on an adaptive threshold that gradually lowers to reflect the emerging hierarchical structure of the representation space throughout training. It is based on an analysis of the cosine similarity distribution of samples in a batch. We demonstrate that this hierarchical emergence exploitation (HEX) approach can be integrated across a wide variety of SSL algorithms. Empirically, we show performance improvements of up to 5.6% relative improvement over baseline SSL approaches on classification accuracy on Imagenet with 100 epochs of training.

[CV-7] Continuous Spatio-Temporal Memory Networks for 4D Cardiac Cine MRI Segmentation WACV2025

链接: https://arxiv.org/abs/2410.23191
作者: Meng Ye,Bingyu Xin,Leon Axel,Dimitris Metaxas
关键词-EN: magnetic resonance image, Current cardiac cine, cine magnetic resonance, abundant temporal information, cardiac cine magnetic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to WACV 2025

点击查看摘要

Abstract:Current cardiac cine magnetic resonance image (cMR) studies focus on the end diastole (ED) and end systole (ES) phases, while ignoring the abundant temporal information in the whole image sequence. This is because whole sequence segmentation is currently a tedious process and inaccurate. Conventional whole sequence segmentation approaches first estimate the motion field between frames, which is then used to propagate the mask along the temporal axis. However, the mask propagation results could be prone to error, especially for the basal and apex slices, where through-plane motion leads to significant morphology and structural change during the cardiac cycle. Inspired by recent advances in video object segmentation (VOS), based on spatio-temporal memory (STM) networks, we propose a continuous STM (CSTM) network for semi-supervised whole heart and whole sequence cMR segmentation. Our CSTM network takes full advantage of the spatial, scale, temporal and through-plane continuity prior of the underlying heart anatomy structures, to achieve accurate and fast 4D segmentation. Results of extensive experiments across multiple cMR datasets show that our method can improve the 4D cMR segmentation performance, especially for the hard-to-segment regions.

[CV-8] FAIR-TAT: Improving Model Fairness Using Targeted Adversarial Training

链接: https://arxiv.org/abs/2410.23142
作者: Tejaswini Medi,Steffen Jung,Margret Keuper
关键词-EN: Deep neural networks, Deep neural, Adversarial Training, adversarial, neural networks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep neural networks are susceptible to adversarial attacks and common corruptions, which undermine their robustness. In order to enhance model resilience against such challenges, Adversarial Training (AT) has emerged as a prominent solution. Nevertheless, adversarial robustness is often attained at the expense of model fairness during AT, i.e., disparity in class-wise robustness of the model. While distinctive classes become more robust towards such adversaries, hard to detect classes suffer. Recently, research has focused on improving model fairness specifically for perturbed images, overlooking the accuracy of the most likely non-perturbed data. Additionally, despite their robustness against the adversaries encountered during model training, state-of-the-art adversarial trained models have difficulty maintaining robustness and fairness when confronted with diverse adversarial threats or common corruptions. In this work, we address the above concerns by introducing a novel approach called Fair Targeted Adversarial Training (FAIR-TAT). We show that using targeted adversarial attacks for adversarial training (instead of untargeted attacks) can allow for more favorable trade-offs with respect to adversarial fairness. Empirical results validate the efficacy of our approach.

[CV-9] Why Fine-grained Labels in Pretraining Benefit Generalization?

链接: https://arxiv.org/abs/2410.23129
作者: Guan Zhe Hong,Yin Cui,Ariel Fuxman,Stanely Chan,Enming Luo
关键词-EN: Recent studies show, Recent studies, coarse-labeled data, studies show, yields better generalization
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: arXiv admin note: substantial text overlap with arXiv:2303.16887

点击查看摘要

Abstract:Recent studies show that pretraining a deep neural network with fine-grained labeled data, followed by fine-tuning on coarse-labeled data for downstream tasks, often yields better generalization than pretraining with coarse-labeled data. While there is ample empirical evidence supporting this, the theoretical justification remains an open problem. This paper addresses this gap by introducing a “hierarchical multi-view” structure to confine the input data distribution. Under this framework, we prove that: 1) coarse-grained pretraining only allows a neural network to learn the common features well, while 2) fine-grained pretraining helps the network learn the rare features in addition to the common ones, leading to improved accuracy on hard downstream test samples.

[CV-10] NASM: Neural Anisotropic Surface Meshing SIGGRAPH

链接: https://arxiv.org/abs/2410.23109
作者: Hongbo Li,Haikuan Zhu,Sikai Zhong,Ningna Wang,Cheng Lin,Xiaohu Guo,Shiqing Xin,Wenping Wang,Jing Hua,Zichun Zhong
关键词-EN: learning-based method, Euclidean embedding space, anisotropic surface meshing, paper introduces, Euclidean embedding
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Graphics (cs.GR)
*备注: SIGGRAPH Asia 2024 (Conference Track)

点击查看摘要

Abstract:This paper introduces a new learning-based method, NASM, for anisotropic surface meshing. Our key idea is to propose a graph neural network to embed an input mesh into a high-dimensional (high-d) Euclidean embedding space to preserve curvature-based anisotropic metric by using a dot product loss between high-d edge vectors. This can dramatically reduce the computational time and increase the scalability. Then, we propose a novel feature-sensitive remeshing on the generated high-d embedding to automatically capture sharp geometric features. We define a high-d normal metric, and then derive an automatic differentiation on a high-d centroidal Voronoi tessellation (CVT) optimization with the normal metric to simultaneously preserve geometric features and curvature anisotropy that exhibit in the original 3D shapes. To our knowledge, this is the first time that a deep learning framework and a large dataset are proposed to construct a high-d Euclidean embedding space for 3D anisotropic surface meshing. Experimental results are evaluated and compared with the state-of-the-art in anisotropic surface meshing on a large number of surface models from Thingi10K dataset as well as tested on extensive unseen 3D shapes from Multi-Garment Network dataset and FAUST human dataset.

[CV-11] Automated Image-Based Identification and Consistent Classification of Fire Patterns with Quantitative Shape Analysis and Spatial Location Identification

链接: https://arxiv.org/abs/2410.23105
作者: Pengkun Liu,Shuna Ni,Stanislav I. Stoliarov,Pingbo Tang
关键词-EN: investigators’ visual observations, traditionally classified based, Fire, Fire patterns, behavior and origin
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Fire patterns, consisting of fire effects that offer insights into fire behavior and origin, are traditionally classified based on investigators’ visual observations, leading to subjective interpretations. This study proposes a framework for quantitative fire pattern classification to support fire investigators, aiming for consistency and accuracy. The framework integrates four components. First, it leverages human-computer interaction to extract fire patterns from surfaces, combining investigator expertise with computational analysis. Second, it employs an aspect ratio-based random forest model to classify fire pattern shapes. Third, fire scene point cloud segmentation enables precise identification of fire-affected areas and the mapping of 2D fire patterns to 3D scenes. Lastly, spatial relationships between fire patterns and indoor elements support an interpretation of the fire scene. These components provide a method for fire pattern analysis that synthesizes qualitative and quantitative data. The framework’s classification results achieve 93% precision on synthetic data and 83% on real fire patterns.

[CV-12] First Place Solution to the ECCV 2024 ROAD Challenge @ ROAD Atomic Activity Recognition 2024

链接: https://arxiv.org/abs/2410.23092
作者: Ruyang Li,Tengfei Zhang,Heng Zhang,Tiejun Liu,Yanwei Wang,Xuelei Li
关键词-EN: team technical solution, ECCV ROAD, Track, report presents, presents our team
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This report presents our team’s technical solution for participating in Track 3 of the 2024 ECCV ROAD++ Challenge. The task of Track 3 is atomic activity recognition, which aims to identify 64 types of atomic activities in road scenes based on video content. Our approach primarily addresses the challenges of small objects, discriminating between single object and a group of objects, as well as model overfitting in this task. Firstly, we construct a multi-branch activity recognition framework that not only separates different object categories but also the tasks of single object and object group recognition, thereby enhancing recognition accuracy. Subsequently, we develop various model ensembling strategies, including integrations of multiple frame sampling sequences, different frame sampling sequence lengths, multiple training epochs, and different backbone networks. Furthermore, we propose an atomic activity recognition data augmentation method, which greatly expands the sample space by flipping video frames and road topology, effectively mitigating model overfitting. Our methods rank first in the test set of Track 3 for the ROAD++ Challenge 2024, and achieve 69% mAP.

[CV-13] CausalDiff: Causality-Inspired Disentanglement via Diffusion Model for Adversarial Defense NEURIPS2024

链接: https://arxiv.org/abs/2410.23091
作者: Mingkun Zhang,Keping Bi,Wei Chen,Quanrun Chen,Jiafeng Guo,Xueqi Cheng
关键词-EN: defend neural classifiers, remain vulnerable, ongoing efforts, efforts to defend, defend neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by NeurIPS 2024

点击查看摘要

Abstract:Despite ongoing efforts to defend neural classifiers from adversarial attacks, they remain vulnerable, especially to unseen attacks. In contrast, humans are difficult to be cheated by subtle manipulations, since we make judgments only based on essential factors. Inspired by this observation, we attempt to model label generation with essential label-causative factors and incorporate label-non-causative factors to assist data generation. For an adversarial example, we aim to discriminate the perturbations as non-causative factors and make predictions only based on the label-causative factors. Concretely, we propose a casual diffusion model (CausalDiff) that adapts diffusion models for conditional data generation and disentangles the two types of casual factors by learning towards a novel casual information bottleneck objective. Empirically, CausalDiff has significantly outperformed state-of-the-art defense methods on various unseen attacks, achieving an average robustness of 86.39% (+4.01%) on CIFAR-10, 56.25% (+3.13%) on CIFAR-100, and 82.62% (+4.93%) on GTSRB (German Traffic Sign Recognition Benchmark).

[CV-14] PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures

链接: https://arxiv.org/abs/2410.23089
作者: Tianxiang Wu,Minxin Nie,Ziqiang Cao
关键词-EN: Multimodal Large Language, Large Language Models, capabilitiesof Large Language, Large Language, solving visual-language tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Multimodal Large Language Models (MLLMs) have activated the capabilitiesof Large Language Models (LLMs) in solving visual-language tasks by integratingvisual information. The prevailing approach in existing MLLMs involvesemploying an image encoder to extract visual features, converting thesefeatures into visual tokens via an adapter, and then integrating them with theprompt into the LLM. However, because the process of image encoding isprompt-agnostic, the extracted visual features only provide a coarsedescription of the image, impossible to focus on the requirements of theprompt. On one hand, it is easy for image features to lack information aboutthe prompt-specified objects, resulting in unsatisfactory responses. On theother hand, the visual features contain a large amount of irrelevantinformation, which not only increases the burden on memory but also worsens thegeneration effectiveness. To address the aforementioned issues, we propose\textbfPIP-MM, a framework that \textbfPre-\textbfIntegrates\textbfPrompt information into the visual encoding process using existingmodules of MLLMs. Specifically, We utilize the frozen LLM in the MLLM tovectorize the input prompt, which summarizes the requirements of the this http URL, we input the prompt vector into our trained Multi-Layer Perceptron (MLP)to align with the visual input requirements, and subsequently replace the classembedding in the image encoder. Since our model only requires adding atrainable MLP, it can be applied to any MLLM. To validate the effectiveness ofPIP-MM, we conducted experiments on multiple benchmarks. Automated evaluationmetrics and manual assessments demonstrate the strong performance of this http URL noteworthy is that our model maintains excellent generationresults even when half of the visual tokens are reduced.

[CV-15] First Place Solution to the ECCV 2024 ROAD Challenge @ ROAD Spatiotemporal Agent Detection 2024

链接: https://arxiv.org/abs/2410.23077
作者: Tengfei Zhang,Heng Zhang,Ruyang Li,Qi Deng,Yaqian Zhao,Rengang Li
关键词-EN: ECCV ROAD, report presents, presents our team, Track, spatiotemporal agent detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This report presents our team’s solutions for the Track 1 of the 2024 ECCV ROAD++ Challenge. The task of Track 1 is spatiotemporal agent detection, which aims to construct an “agent tube” for road agents in consecutive video frames. Our solutions focus on the challenges in this task, including extreme-size objects, low-light scenarios, class imbalance, and fine-grained classification. Firstly, the extreme-size object detection heads are introduced to improve the detection performance of large and small objects. Secondly, we design a dual-stream detection model with a low-light enhancement stream to improve the performance of spatiotemporal agent detection in low-light scenes, and the feature fusion module to integrate features from different branches. Subsequently, we develop a multi-branch detection framework to mitigate the issues of class imbalance and fine-grained classification, and we design a pre-training and fine-tuning approach to optimize the above multi-branch framework. Besides, we employ some common data augmentation techniques, and improve the loss function and upsampling operation. We rank first in the test set of Track 1 for the ROAD++ Challenge 2024, and achieve 30.82% average video-mAP.

[CV-16] RSNet: A Light Framework for The Detection of Multi-scale Remote Sensing Targets

链接: https://arxiv.org/abs/2410.23073
作者: Hongyu Chen,Chengcheng Chen,Fei Wang,Yuhu Shi,Weiming Zeng
关键词-EN: synthetic aperture radar, deep learning techniques, Recent developments, learning techniques achieve, techniques achieve remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recent developments in synthetic aperture radar (SAR) ship detection have seen deep learning techniques achieve remarkable progress in accuracy and speed. However, the detection of small targets against complex backgrounds remains a significant challenge. To tackle these difficulties, this letter presents RSNet, a lightweight framework aimed at enhancing ship detection capabilities in SAR imagery. RSNet features the Waveletpool-ContextGuided (WCG) backbone for enhanced accuracy with fewer parameters, and the Waveletpool-StarFusion (WSF) head for efficient parameter reduction. Additionally, a Lightweight-Shared (LS) module minimizes the detection head’s parameter load. Experiments on the SAR Ship Detection Dataset (SSDD) and High-Resolution SAR Image Dataset (HRSID) demonstrate that RSNet achieves a strong balance between lightweight design and detection performance, surpassing many state-of-the-art detectors, reaching 72.5% and 67.6% in \textbf(\mathbfmAP_.50:95) respectively with 1.49M parameters. Our code will be released soon.

[CV-17] Neural Attention Field: Emerging Point Relevance in 3D Scenes for One-Shot Dexterous Grasping

链接: https://arxiv.org/abs/2410.23039
作者: Qianxu Wang,Congyue Deng,Tyler Ga Wei Lum,Yuanpei Chen,Yaodong Yang,Jeannette Bohg,Yixin Zhu,Leonidas Guibas
关键词-EN: challenging problem, context variations, One-shot transfer, feature fields, feature
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:One-shot transfer of dexterous grasps to novel scenes with object and context variations has been a challenging problem. While distilled feature fields from large vision models have enabled semantic correspondences across 3D scenes, their features are point-based and restricted to object surfaces, limiting their capability of modeling complex semantic feature distributions for hand-object interactions. In this work, we propose the \textitneural attention field for representing semantic-aware dense feature fields in the 3D space by modeling inter-point relevance instead of individual point features. Core to it is a transformer decoder that computes the cross-attention between any 3D query point with all the scene points, and provides the query point feature with an attention-based aggregation. We further propose a self-supervised framework for training the transformer decoder from only a few 3D pointclouds without hand demonstrations. Post-training, the attention field can be applied to novel scenes for semantics-aware dexterous grasping from one-shot demonstration. Experiments show that our method provides better optimization landscapes by encouraging the end-effector to focus on task-relevant scene regions, resulting in significant improvements in success rates on real robots compared with the feature-field-based methods.

[CV-18] DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes

链接: https://arxiv.org/abs/2410.23004
作者: Jialiang Zhang,Haoran Liu,Danshi Li,Xinqiang Yu,Haoran Geng,Yufei Ding,Jiayi Chen,He Wang
关键词-EN: remains highly challenging, scenes remains highly, dexterous hands due, remains highly, highly challenging
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Grasping in cluttered scenes remains highly challenging for dexterous hands due to the scarcity of data. To address this problem, we present a large-scale synthetic benchmark, encompassing 1319 objects, 8270 scenes, and 427 million grasps. Beyond benchmarking, we also propose a novel two-stage grasping method that learns efficiently from data by using a diffusion model that conditions on local geometry. Our proposed generative method outperforms all baselines in simulation experiments. Furthermore, with the aid of test-time-depth restoration, our method demonstrates zero-shot sim-to-real transfer, attaining 90.7% real-world dexterous grasping success rate in cluttered scenes.

[CV-19] LumiSculpt: A Consistency Lighting Control Network for Video Generation

链接: https://arxiv.org/abs/2410.22979
作者: Yuxin Zhang,Dandan Zheng,Biao Gong,Jingdong Chen,Ming Yang,Weiming Dong,Changsheng Xu
关键词-EN: video generation, significantly influencing, generated content, plays a pivotal, pivotal role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Lighting plays a pivotal role in ensuring the naturalness of video generation, significantly influencing the aesthetic quality of the generated content. However, due to the deep coupling between lighting and the temporal features of videos, it remains challenging to disentangle and model independent and coherent lighting attributes, limiting the ability to control lighting in video generation. In this paper, inspired by the established controllable T2I models, we propose LumiSculpt, which, for the first time, enables precise and consistent lighting control in T2V generation this http URL equips the video generation with strong interactive capabilities, allowing the input of custom lighting reference image sequences. Furthermore, the core learnable plug-and-play module of LumiSculpt facilitates remarkable control over lighting intensity, position, and trajectory in latent video diffusion models based on the advanced DiT this http URL, to effectively train LumiSculpt and address the issue of insufficient lighting data, we construct LumiHuman, a new lightweight and flexible dataset for portrait lighting of images and videos. Experimental results demonstrate that LumiSculpt achieves precise and high-quality lighting control in video generation.

[CV-20] EnsIR: An Ensemble Algorithm for Image Restoration via Gaussian Mixture Models

链接: https://arxiv.org/abs/2410.22959
作者: Shangquan Sun,Wenqi Ren,Zikun Liu,Hyunhee Park,Rui Wang,Xiaochun Cao
关键词-EN: experienced significant advancements, significant advancements due, experienced significant, significant advancements, advancements due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages for main manuscript, additional 17 pages for appendix, 18 figures, 17MB

点击查看摘要

Abstract:Image restoration has experienced significant advancements due to the development of deep learning. Nevertheless, it encounters challenges related to ill-posed problems, resulting in deviations between single model predictions and ground-truths. Ensemble learning, as a powerful machine learning technique, aims to address these deviations by combining the predictions of multiple base models. Most existing works adopt ensemble learning during the design of restoration models, while only limited research focuses on the inference-stage ensemble of pre-trained restoration models. Regression-based methods fail to enable efficient inference, leading researchers in academia and industry to prefer averaging as their choice for post-training ensemble. To address this, we reformulate the ensemble problem of image restoration into Gaussian mixture models (GMMs) and employ an expectation maximization (EM)-based algorithm to estimate ensemble weights for aggregating prediction candidates. We estimate the range-wise ensemble weights on a reference set and store them in a lookup table (LUT) for efficient ensemble inference on the test set. Our algorithm is model-agnostic and training-free, allowing seamless integration and enhancement of various pre-trained image restoration models. It consistently outperforms regression based methods and averaging ensemble approaches on 14 benchmarks across 3 image restoration tasks, including super-resolution, deblurring and deraining. The codes and all estimated weights have been released in Github.

[CV-21] AdaptiveISP: Learning an Adaptive Image Signal Processor for Object Detection NEURIPS2024

链接: https://arxiv.org/abs/2410.22939
作者: Yujin Wang,Tianyi Xu,Fan Zhang,Tianfan Xue,Jinwei Gu
关键词-EN: Image Signal Processors, raw sensor signals, Signal Processors, convert raw sensor, ISP
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at NeurIPS2024

点击查看摘要

Abstract:Image Signal Processors (ISPs) convert raw sensor signals into digital images, which significantly influence the image quality and the performance of downstream computer vision tasks. Designing ISP pipeline and tuning ISP parameters are two key steps for building an imaging and vision system. To find optimal ISP configurations, recent works use deep neural networks as a proxy to search for ISP parameters or ISP pipelines. However, these methods are primarily designed to maximize the image quality, which are sub-optimal in the performance of high-level computer vision tasks such as detection, recognition, and tracking. Moreover, after training, the learned ISP pipelines are mostly fixed at the inference time, whose performance degrades in dynamic scenes. To jointly optimize ISP structures and parameters, we propose AdaptiveISP, a task-driven and scene-adaptive ISP. One key observation is that for the majority of input images, only a few processing modules are needed to improve the performance of downstream recognition tasks, and only a few inputs require more processing. Based on this, AdaptiveISP utilizes deep reinforcement learning to automatically generate an optimal ISP pipeline and the associated ISP parameters to maximize the detection performance. Experimental results show that AdaptiveISP not only surpasses the prior state-of-the-art methods for object detection but also dynamically manages the trade-off between detection performance and computational cost, especially suitable for scenes with large dynamic range variations. Project website: this https URL.

[CV-22] Bringing NeRFs to the Latent Space: Inverse Graphics Autoencoder

链接: https://arxiv.org/abs/2410.22936
作者: Antoine Schnepf,Karim Kassab,Jean-Yves Franceschi,Laurent Caraffa,Flavian Vasile,Jeremie Mary,Andrew Comport,Valerie Gouet-Brunet
关键词-EN: inverse graphics, Inverse Graphics Autoencoder, latent, applying inverse graphics, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While pre-trained image autoencoders are increasingly utilized in computer vision, the application of inverse graphics in 2D latent spaces has been under-explored. Yet, besides reducing the training and rendering complexity, applying inverse graphics in the latent space enables a valuable interoperability with other latent-based 2D methods. The major challenge is that inverse graphics cannot be directly applied to such image latent spaces because they lack an underlying 3D geometry. In this paper, we propose an Inverse Graphics Autoencoder (IG-AE) that specifically addresses this issue. To this end, we regularize an image autoencoder with 3D-geometry by aligning its latent space with jointly trained latent 3D scenes. We utilize the trained IG-AE to bring NeRFs to the latent space with a latent NeRF training pipeline, which we implement in an open-source extension of the Nerfstudio framework, thereby unlocking latent scene learning for its supported methods. We experimentally confirm that Latent NeRFs trained with IG-AE present an improved quality compared to a standard autoencoder, all while exhibiting training and rendering accelerations with respect to NeRFs trained in the image space. Our project page can be found at this https URL .

[CV-23] An Individual Identity-Driven Framework for Animal Re-Identification

链接: https://arxiv.org/abs/2410.22927
作者: Yihao Wu,Di Zhao,Jingfeng Zhang,Yun Sing Koh
关键词-EN: large wildlife populations, Reliable re-identification, Animal ReID, ecological research, wildlife conservation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Reliable re-identification of individuals within large wildlife populations is crucial for biological studies, ecological research, and wildlife conservation. Classic computer vision techniques offer a promising direction for Animal Re-identification (Animal ReID), but their backbones’ close-set nature limits their applicability and generalizability. Despite the demonstrated effectiveness of vision-language models like CLIP in re-identifying persons and vehicles, their application to Animal ReID remains limited due to unique challenges, such as the various visual representations of animals, including variations in poses and forms. To address these limitations, we leverage CLIP’s cross-modal capabilities to introduce a two-stage framework, the \textbfIndividual \textbfAnimal \textbfIDentity-Driven (IndivAID) framework, specifically designed for Animal ReID. In the first stage, IndivAID trains a text description generator by extracting individual semantic information from each image, generating both image-specific and individual-specific textual descriptions that fully capture the diverse visual concepts of each individual across animal images. In the second stage, IndivAID refines its learning of visual concepts by dynamically incorporating individual-specific textual descriptions with an integrated attention module to further highlight discriminative features of individuals for Animal ReID. Evaluation against state-of-the-art methods across eight benchmark datasets and a real-world Stoat dataset demonstrates IndivAID’s effectiveness and applicability. Code is available at \urlthis https URL.

[CV-24] High-Fidelity Document Stain Removal via A Large-Scale Real-World Dataset and A Memory-Augmented Transformer WACV2025

链接: https://arxiv.org/abs/2410.22922
作者: Mingxian Li,Hao Sun,Yingtie Lei,Xiaofeng Zhang,Yihang Dong,Yilin Zhou,Zimeng Li,Xuhang Chen
关键词-EN: hindering downstream applications, document stain removal, stain removal, Document, stain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by WACV2025

点击查看摘要

Abstract:Document images are often degraded by various stains, significantly impacting their readability and hindering downstream applications such as document digitization and analysis. The absence of a comprehensive stained document dataset has limited the effectiveness of existing document enhancement methods in removing stains while preserving fine-grained details. To address this challenge, we construct StainDoc, the first large-scale, high-resolution ( 2145\times2245 ) dataset specifically designed for document stain removal. StainDoc comprises over 5,000 pairs of stained and clean document images across multiple scenes. This dataset encompasses a diverse range of stain types, severities, and document backgrounds, facilitating robust training and evaluation of document stain removal algorithms. Furthermore, we propose StainRestorer, a Transformer-based document stain removal approach. StainRestorer employs a memory-augmented Transformer architecture that captures hierarchical stain representations at part, instance, and semantic levels via the DocMemory module. The Stain Removal Transformer (SRTransformer) leverages these feature representations through a dual attention mechanism: an enhanced spatial attention with an expanded receptive field, and a channel attention captures channel-wise feature importance. This combination enables precise stain removal while preserving document content integrity. Extensive experiments demonstrate StainRestorer’s superior performance over state-of-the-art methods on the StainDoc dataset and its variants StainDoc_Mark and StainDoc_Seal, establishing a new benchmark for document stain removal. Our work highlights the potential of memory-augmented Transformers for this task and contributes a valuable dataset to advance future research.

[CV-25] UniRiT: Towards Few-Shot Non-Rigid Point Cloud Registration

链接: https://arxiv.org/abs/2410.22909
作者: Geng Li,Haozhi Cao,Mingyang Liu,Chenxi Jiang,Jianfei Yang
关键词-EN: scene understanding, surgical navigation, critical challenge, Non-rigid point cloud, Non-rigid
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 14 figures, under review

点击查看摘要

Abstract:Non-rigid point cloud registration is a critical challenge in 3D scene understanding, particularly in surgical navigation. Although existing methods achieve excellent performance when trained on large-scale, high-quality datasets, these datasets are prohibitively expensive to collect and annotate, e.g., organ data in authentic medical scenarios. With insufficient training samples and data noise, existing methods degrade significantly since non-rigid patterns are more flexible and complicated than rigid ones, and the distributions across samples are more distinct, leading to higher difficulty in representation learning with few data. In this work, we aim to deal with this challenging few-shot non-rigid point cloud registration problem. Based on the observation that complex non-rigid transformation patterns can be decomposed into rigid and small non-rigid transformations, we propose a novel and effective framework, UniRiT. UniRiT adopts a two-step registration strategy that first aligns the centroids of the source and target point clouds and then refines the registration with non-rigid transformations, thereby significantly reducing the problem complexity. To validate the performance of UniRiT on real-world datasets, we introduce a new dataset, MedMatch3D, which consists of real human organs and exhibits high variability in sample distribution. We further establish a new challenging benchmark for few-shot non-rigid registration. Extensive empirical results demonstrate that UniRiT achieves state-of-the-art performance on MedMatch3D, improving the existing best approach by 94.22%.

[CV-26] HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models

链接: https://arxiv.org/abs/2410.22901
作者: Shengkai Zhang,Nianhong Jiao,Tian Li,Chaojie Yang,Chenhui Xue,Boya Niu,Jun Gao
关键词-EN: complex downstream tasks, propose an effective, enables the execution, execution of complex, complex downstream
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures, 2 tables

点击查看摘要

Abstract:We propose an effective method for inserting adapters into text-to-image foundation models, which enables the execution of complex downstream tasks while preserving the generalization ability of the base model. The core idea of this method is to optimize the attention mechanism related to 2D feature maps, which enhances the performance of the adapter. This approach was validated on the task of meme video generation and achieved significant results. We hope this work can provide insights for post-training tasks of large text-to-image models. Additionally, as this method demonstrates good compatibility with SD1.5 derivative models, it holds certain value for the open-source community. Therefore, we will release the related code (\urlthis https URL).

[CV-27] Wormhole Loss for Partial Shape Matching NEURIPS

链接: https://arxiv.org/abs/2410.22899
作者: Amit Bracha,Thomas Dagès,Ron Kimmel
关键词-EN: fundamental question arises, matching process, question arises, fundamental question, partial shape matching
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at the conference on Neural Information Processing Systems (NeurIPS) 2024

点击查看摘要

Abstract:When matching parts of a surface to its whole, a fundamental question arises: Which points should be included in the matching process? The issue is intensified when using isometry to measure similarity, as it requires the validation of whether distances measured between pairs of surface points should influence the matching process. The approach we propose treats surfaces as manifolds equipped with geodesic distances, and addresses the partial shape matching challenge by introducing a novel criterion to meticulously search for consistent distances between pairs of points. The new criterion explores the relation between intrinsic geodesic distances between the points, geodesic distances between the points and surface boundaries, and extrinsic distances between boundary points measured in the embedding space. It is shown to be less restrictive compared to previous measures and achieves state-of-the-art results when used as a loss function in training networks for partial shape matching.

[CV-28] Prune and Repaint: Content-Aware Image Retargeting for any Ratio NEURIPS24

链接: https://arxiv.org/abs/2410.22865
作者: Feihong Shen,Chao Li,Yifeng Geng,Yongjian Deng,Hao Chen
关键词-EN: presentation environments, task of adjusting, suit different display, display devices, devices or presentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS24

点击查看摘要

Abstract:Image retargeting is the task of adjusting the aspect ratio of images to suit different display devices or presentation environments. However, existing retargeting methods often struggle to balance the preservation of key semantics and image quality, resulting in either deformation or loss of important objects, or the introduction of local artifacts such as discontinuous pixels and inconsistent regenerated content. To address these issues, we propose a content-aware retargeting method called PruneRepaint. It incorporates semantic importance for each pixel to guide the identification of regions that need to be pruned or preserved in order to maintain key semantics. Additionally, we introduce an adaptive repainting module that selects image regions for repainting based on the distribution of pruned pixels and the proportion between foreground size and target aspect ratio, thus achieving local smoothness after pruning. By focusing on the content and structure of the foreground, our PruneRepaint approach adaptively avoids key content loss and deformation, while effectively mitigating artifacts with local repainting. We conduct experiments on the public RetargetMe benchmark and demonstrate through objective experimental results and subjective user studies that our method outperforms previous approaches in terms of preserving semantics and aesthetics, as well as better generalization across diverse aspect ratios. Codes will be available at this https URL.

[CV-29] AtGCN: A Graph Convolutional Network For Ataxic Gait Detection

链接: https://arxiv.org/abs/2410.22862
作者: Karan Bania,Tanmay Verlekar
关键词-EN: Video-based gait analysis, Video-based gait, diagnosing pathologies, ataxic gait, task of diagnosing
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Video-based gait analysis can be defined as the task of diagnosing pathologies, such as ataxia, using videos of patients walking in front of a camera. This paper presents a graph convolution network called AtGCN for detecting ataxic gait and identifying its severity using 2D videos. The problem is especially challenging as the deviation of an ataxic gait from a healthy gait is very subtle. The datasets for ataxic gait detection are also quite small, with the largest dataset having only 149 videos. The paper addresses the first problem using special spatiotemporal graph convolution that successfully captures important gait-related features. To handle the small dataset size, a deep spatiotemporal graph convolution network pre-trained on an action recognition dataset is systematically truncated and then fine-tuned on the ataxia dataset to obtain the AtGCN model. The paper also presents an augmentation strategy that segments a video sequence into multiple gait cycles. The proposed AtGCN model then operates on a graph of body part locations belonging to a single gait cycle. The evaluation results support the strength of the proposed AtGCN model, as it outperforms the state-of-the-art in detection and severity prediction with an accuracy of 93.46% and a MAE of 0.4169, respectively.

[CV-30] DAVINCI: A Single-Stage Architecture for Constrained CAD Sketch Inference BMVC2024

链接: https://arxiv.org/abs/2410.22857
作者: Ahmet Serdar Karadeniz,Dimitrios Mallis,Nesryne Mejri,Kseniya Cherenkova,Anis Kacem,Djamila Aouada
关键词-EN: single-stage Computer-Aided Design, Computer-Aided Design, CAD sketch, CAD, work presents DAVINCI
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: Accepted at BMVC 2024

点击查看摘要

Abstract:This work presents DAVINCI, a unified architecture for single-stage Computer-Aided Design (CAD) sketch parameterization and constraint inference directly from raster sketch images. By jointly learning both outputs, DAVINCI minimizes error accumulation and enhances the performance of constrained CAD sketch inference. Notably, DAVINCI achieves state-of-the-art results on the large-scale SketchGraphs dataset, demonstrating effectiveness on both precise and hand-drawn raster CAD sketches. To reduce DAVINCI’s reliance on large-scale annotated datasets, we explore the efficacy of CAD sketch augmentations. We introduce Constraint-Preserving Transformations (CPTs), i.e. random permutations of the parametric primitives of a CAD sketch that preserve its constraints. This data augmentation strategy allows DAVINCI to achieve reasonable performance when trained with only 0.1% of the SketchGraphs dataset. Furthermore, this work contributes a new version of SketchGraphs, augmented with CPTs. The newly introduced CPTSketchGraphs dataset includes 80 million CPT-augmented sketches, thus providing a rich resource for future research in the CAD sketch domain.

[CV-31] SFDFusion: An Efficient Spatial-Frequency Domain Fusion Network for Infrared and Visible Image Fusion ECAI2024

链接: https://arxiv.org/abs/2410.22837
作者: Kun Hu,Qingle Zhang,Maoxun Yuan,Yitian Zhang
关键词-EN: frequency domain, domain, rich texture details, Frequency Domain Fusion, frequency domain information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accept in ECAI 2024

点击查看摘要

Abstract:Infrared and visible image fusion aims to utilize the complementary information from two modalities to generate fused images with prominent targets and rich texture details. Most existing algorithms only perform pixel-level or feature-level fusion from different modalities in the spatial domain. They usually overlook the information in the frequency domain, and some of them suffer from inefficiency due to excessively complex structures. To tackle these challenges, this paper proposes an efficient Spatial-Frequency Domain Fusion (SFDFusion) network for infrared and visible image fusion. First, we propose a Dual-Modality Refinement Module (DMRM) to extract complementary information. This module extracts useful information from both the infrared and visible modalities in the spatial domain and enhances fine-grained spatial details. Next, to introduce frequency domain information, we construct a Frequency Domain Fusion Module (FDFM) that transforms the spatial domain to the frequency domain through Fast Fourier Transform (FFT) and then integrates frequency domain information. Additionally, we design a frequency domain fusion loss to provide guidance for the fusion process. Extensive experiments on public datasets demonstrate that our method produces fused images with significant advantages in various fusion metrics and visual effects. Furthermore, our method demonstrates high efficiency in image fusion and good performance on downstream detection tasks, thereby satisfying the real-time demands of advanced visual tasks.

[CV-32] Situational Scene Graph for Structured Human-centric Situation Understanding WACV2025

链接: https://arxiv.org/abs/2410.22829
作者: Chinthani Sugandhika,Chen Li,Deepu Rajan,Basura Fernando
关键词-EN: modelling spatio-temporal relationships, Situational Scene Graph, semantic properties, modelling spatio-temporal, Scene Graph
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for WACV 2025

点击查看摘要

Abstract:Graph based representation has been widely used in modelling spatio-temporal relationships in video understanding. Although effective, existing graph-based approaches focus on capturing the human-object relationships while ignoring fine-grained semantic properties of the action components. These semantic properties are crucial for understanding the current situation, such as where does the action takes place, what tools are used and functional properties of the objects. In this work, we propose a graph-based representation called Situational Scene Graph (SSG) to encode both human-object relationships and the corresponding semantic properties. The semantic details are represented as predefined roles and values inspired by situation frame, which is originally designed to represent a single action. Based on our proposed representation, we introduce the task of situational scene graph generation and propose a multi-stage pipeline Interactive and Complementary Network (InComNet) to address the task. Given that the existing datasets are not applicable to the task, we further introduce a SSG dataset whose annotations consist of semantic role-value frames for human, objects and verb predicates of human-object relations. Finally, we demonstrate the effectiveness of our proposed SSG representation by testing on different downstream tasks. Experimental results show that the unified representation can not only benefit predicate classification and semantic role-value classification, but also benefit reasoning tasks on human-centric situation understanding. We will release the code and the dataset soon.

[CV-33] Epipolar-Free 3D Gaussian Splatting for Generalizable Novel View Synthesis NEURIPS2024

链接: https://arxiv.org/abs/2410.22817
作者: Zhiyuan Min,Yawei Luo,Jianwen Sun,Yi Yang
关键词-EN: scene-specific retraining required, feed-forward inference manner, inference manner, required in conventional, sparse-view observations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Generalizable 3D Gaussian splitting (3DGS) can reconstruct new scenes from sparse-view observations in a feed-forward inference manner, eliminating the need for scene-specific retraining required in conventional 3DGS. However, existing methods rely heavily on epipolar priors, which can be unreliable in complex realworld scenes, particularly in non-overlapping and occluded regions. In this paper, we propose eFreeSplat, an efficient feed-forward 3DGS-based model for generalizable novel view synthesis that operates independently of epipolar line constraints. To enhance multiview feature extraction with 3D perception, we employ a selfsupervised Vision Transformer (ViT) with cross-view completion pre-training on large-scale datasets. Additionally, we introduce an Iterative Cross-view Gaussians Alignment method to ensure consistent depth scales across different views. Our eFreeSplat represents an innovative approach for generalizable novel view synthesis. Different from the existing pure geometry-free methods, eFreeSplat focuses more on achieving epipolar-free feature matching and encoding by providing 3D priors through cross-view pretraining. We evaluate eFreeSplat on wide-baseline novel view synthesis tasks using the RealEstate10K and ACID datasets. Extensive experiments demonstrate that eFreeSplat surpasses state-of-the-art baselines that rely on epipolar priors, achieving superior geometry reconstruction and novel view synthesis quality. Project page: this https URL.

[CV-34] Adaptive Multi Scale Document Binarisation Using Vision Mamba

链接: https://arxiv.org/abs/2410.22811
作者: Mohd. Azfar,Siddhant Bharadwaj,Ashwin Sasikumar
关键词-EN: Enhancing and preserving, document image analysis, effective document image, image analysis, preserving the readability
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Enhancing and preserving the readability of document images, particularly historical ones, is crucial for effective document image analysis. Numerous models have been proposed for this task, including convolutional-based, transformer-based, and hybrid convolutional-transformer architectures. While hybrid models address the limitations of purely convolutional or transformer-based methods, they often suffer from issues like quadratic time complexity. In this work, we propose a Mamba-based architecture for document binarisation, which efficiently handles long sequences by scaling linearly and optimizing memory usage. Additionally, we introduce novel modifications to the skip connections by incorporating Difference of Gaussians (DoG) features, inspired by conventional signal processing techniques. These multiscale high-frequency features enable the model to produce high-quality, detailed outputs.

[CV-35] Wavelet Burst Accumulation for turbulence mitigation

链接: https://arxiv.org/abs/2410.22802
作者: Jerome Gilles,Stanley Osher
关键词-EN: Fourier burst accumulation, weighted Fourier burst, recently proposed weighted, proposed weighted Fourier, burst accumulation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we investigate the extension of the recently proposed weighted Fourier burst accumulation (FBA) method into the wavelet domain. The purpose of FBA is to reconstruct a clean and sharp image from a sequence of blurred frames. This concept lies in the construction of weights to amplify dominant frequencies in the Fourier spectrum of each frame. The reconstructed image is then obtained by taking the inverse Fourier transform of the average of all processed spectra. In this paper, we first suggest to replace the rigid registration step used in the original algorithm by a non-rigid registration in order to be able to process sequences acquired through atmospheric turbulence. Second, we propose to work in a wavelet domain instead of the Fourier one. This leads us to the construction of two types of algorithms. Finally, we propose an alternative approach to replace the weighting idea by an approach promoting the sparsity in the used space. Several experiments are provided to illustrate the efficiency of the proposed methods.

[CV-36] Open Turbulent Image Set (OTIS)

链接: https://arxiv.org/abs/2410.22791
作者: Nicholas B. Ferrante,Jerome Gilles
关键词-EN: Long distance imaging, Long distance, distance imaging, imaging is subject, Open Turbulent Images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Long distance imaging is subject to the impact of the turbulent atmosphere. This results into geometric distortions and some blur effect in the observed frames. Despite the existence of several turbulence mitigation algorithms in the literature, no common dataset exists to objectively evaluate their efficiency. In this paper, we describe a new dataset called OTIS (Open Turbulent Images Set) which contains several sequences (either static or dynamic) acquired through the turbulent atmosphere. For almost all sequences, we provide the corresponding groundtruth in order to make the comparison between algorithms easier. We also discuss possible metrics to perform such comparisons.

[CV-37] Bregman implementation of Meyers G-norm for cartoon textures decomposition

链接: https://arxiv.org/abs/2410.22777
作者: Jerome Gilles,Stanley Osher
关键词-EN: Split Bregman iterations, textures decomposition model, simple algorithm based, Split Bregman, model of Meyer
类目: Computer Vision and Pattern Recognition (cs.CV); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:In this paper, we design a very simple algorithm based on Split Bregman iterations to numerically solve the cartoon + textures decomposition model of Meyer. This results in a significant gain in speed compared to Chambolle’s nonlinear projectors.

[CV-38] Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models

链接: https://arxiv.org/abs/2410.22775
作者: Arash Marioriyad,Parham Rezaei,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban
关键词-EN: shown remarkable proficiency, Stable Diffusion, textual descriptions, shown remarkable, remarkable proficiency
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image (T2I) generative models, such as Stable Diffusion and DALL-E, have shown remarkable proficiency in producing high-quality, realistic, and natural images from textual descriptions. However, these models sometimes fail to accurately capture all the details specified in the input prompts, particularly concerning entities, attributes, and spatial relationships. This issue becomes more pronounced when the prompt contains novel or complex compositions, leading to what are known as compositional generation failure modes. Recently, a new open-source diffusion-based T2I model, FLUX, has been introduced, demonstrating strong performance in high-quality image generation. Additionally, autoregressive T2I models like LlamaGen have claimed competitive visual quality performance compared to diffusion-based models. In this study, we evaluate the compositional generation capabilities of these newly introduced models against established models using the T2I-CompBench benchmark. Our findings reveal that LlamaGen, as a vanilla autoregressive model, is not yet on par with state-of-the-art diffusion models for compositional generation tasks under the same criteria, such as model size and inference time. On the other hand, the open-source diffusion-based model FLUX exhibits compositional generation capabilities comparable to the state-of-the-art closed-source model DALL-E3.

[CV-39] FuseAnyPart: Diffusion-Driven Facial Parts Swapping via Multiple Reference Images NEURIPS2024

链接: https://arxiv.org/abs/2410.22771
作者: Zheng Yu,Yaohua Wang,Siying Cui,Aixi Zhang,Wei-Long Zheng,Senzhang Wang
关键词-EN: target image unchanged, target image, selectively transfer regions, Facial parts swapping, parts swapping aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by the NeurIPS 2024 (Spotlight). Homepage: this https URL

点击查看摘要

Abstract:Facial parts swapping aims to selectively transfer regions of interest from the source image onto the target image while maintaining the rest of the target image unchanged. Most studies on face swapping designed specifically for full-face swapping, are either unable or significantly limited when it comes to swapping individual facial parts, which hinders fine-grained and customized character designs. However, designing such an approach specifically for facial parts swapping is challenged by a reasonable multiple reference feature fusion, which needs to be both efficient and effective. To overcome this challenge, FuseAnyPart is proposed to facilitate the seamless “fuse-any-part” customization of the face. In FuseAnyPart, facial parts from different people are assembled into a complete face in latent space within the Mask-based Fusion Module. Subsequently, the consolidated feature is dispatched to the Addition-based Injection Module for fusion within the UNet of the diffusion model to create novel characters. Extensive experiments qualitatively and quantitatively validate the superiority and robustness of FuseAnyPart. Source codes are available at this https URL.

[CV-40] Analysis of Classifier Training on Synthetic Data for Cross-Domain Datasets

链接: https://arxiv.org/abs/2410.22748
作者: Andoni Cortés,Clemente Rodríguez,Gorka Velez,Javier Barandiarán,Marcos Nieto
关键词-EN: collect huge amounts, deep learning, major challenges, challenges of deep, necessity to collect
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:A major challenges of deep learning (DL) is the necessity to collect huge amounts of training data. Often, the lack of a sufficiently large dataset discourages the use of DL in certain applications. Typically, acquiring the required amounts of data costs considerable time, material and effort. To mitigate this problem, the use of synthetic images combined with real data is a popular approach, widely adopted in the scientific community to effectively train various detectors. In this study, we examined the potential of synthetic data-based training in the field of intelligent transportation systems. Our focus is on camera-based traffic sign recognition applications for advanced driver assistance systems and autonomous driving. The proposed augmentation pipeline of synthetic datasets includes novel augmentation processes such as structured shadows and gaussian specular highlights. A well-known DL model was trained with different datasets to compare the performance of synthetic and real image-based trained models. Additionally, a new, detailed method to objectively compare these models is proposed. Synthetic images are generated using a semi-supervised errors-guide method which is also described. Our experiments showed that a synthetic image-based approach outperforms in most cases real image-based training when applied to cross-domain test datasets (+10% precision for GTSRB dataset) and consequently, the generalization of the model is improved decreasing the cost of acquiring images.

[CV-41] ETO:Efficient Transformer-based Local Feature Matching by Organizing Multiple Homography Hypotheses

链接: https://arxiv.org/abs/2410.22733
作者: Junjie Ni,Guofeng Zhang,Guanglin Li,Yijin Li,Xinyang Liu,Zhaoyang Huang,Hujun Bao
关键词-EN: http URL advancements, deep learning techniques, learning local feature, http URL, http URL technique
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We tackle the efficiency problem of learning local feature this http URL advancements have given rise to purely CNN-based and transformer-based approaches, each augmented with deep learning techniques. While CNN-based methods often excel in matching speed, transformer-based methods tend to provide more accurate matches. We propose an efficient transformer-based network architecture for local feature this http URL technique is built on constructing multiple homography hypotheses to approximate the continuous correspondence in the real world and uni-directional cross-attention to accelerate the refinement. On the YFCC100M dataset, our matching accuracy is competitive with LoFTR, a state-of-the-art transformer-based architecture, while the inference speed is boosted to 4 times, even outperforming the CNN-based this http URL evaluations on other open datasets such as Megadepth, ScanNet, and HPatches demonstrate our method’s efficacy, highlighting its potential to significantly enhance a wide array of downstream applications.

[CV-42] One Prompt to Verify Your Models: Black-Box Text-to-Image Models Verification via Non-Transferable Adversarial Attacks

链接: https://arxiv.org/abs/2410.22725
作者: Ji Guo,Wenbo Jiang,Rui Zhang,Guoming Lu,Hongwei Li,Weiren Wu
关键词-EN: provide cheaper API, cheaper API services, numerous third-party platforms, cheaper API, API services
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, the success of Text-to-Image (T2I) models has led to the rise of numerous third-party platforms, which claim to provide cheaper API services and more flexibility in model options. However, this also raises a new security concern: Are these third-party services truly offering the models they claim? To address this problem, we propose the first T2I model verification method named Text-to-Image Model Verification via Non-Transferable Adversarial Attacks (TVN). The non-transferability of adversarial examples means that these examples are only effective on a target model and ineffective on other models, thereby allowing for the verification of the target model. TVN utilizes the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to optimize the cosine similarity of a prompt’s text encoding, generating non-transferable adversarial prompts. By calculating the CLIP-text scores between the non-transferable adversarial prompts without perturbations and the images, we can verify if the model matches the claimed target model, based on a 3-sigma threshold. The experiments showed that TVN performed well in both closed-set and open-set scenarios, achieving a verification accuracy of over 90%. Moreover, the adversarial prompts generated by TVN significantly reduced the CLIP-text scores of the target model, while having little effect on other models.

[CV-43] SCRREAM : SCan Register REnder And Map:A Framework for Annotating Accurate and Dense 3D Indoor Scenes with a Benchmark

链接: https://arxiv.org/abs/2410.22715
作者: HyunJun Jung,Weihang Li,Shun-Cheng Wu,William Bittner,Nikolas Brasch,Jifei Song,Eduardo Pérez-Pellitero,Zhensong Zhang,Arthur Moreau,Nassir Navab,Benjamin Busam
关键词-EN: obtain improved generalization, generally prioritized scale, improved generalization, generally prioritized, prioritized scale
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traditionally, 3d indoor datasets have generally prioritized scale over ground-truth accuracy in order to obtain improved generalization. However, using these datasets to evaluate dense geometry tasks, such as depth rendering, can be problematic as the meshes of the dataset are often incomplete and may produce wrong ground truth to evaluate the details. In this paper, we propose SCRREAM, a dataset annotation framework that allows annotation of fully dense meshes of objects in the scene and registers camera poses on the real image sequence, which can produce accurate ground truth for both sparse 3D as well as dense 3D tasks. We show the details of the dataset annotation pipeline and showcase four possible variants of datasets that can be obtained from our framework with example scenes, such as indoor reconstruction and SLAM, scene editing object removal, human reconstruction and 6d pose estimation. Recent pipelines for indoor reconstruction and SLAM serve as new benchmarks. In contrast to previous indoor dataset, our design allows to evaluate dense geometry tasks on eleven sample scenes against accurately rendered ground truth depth maps.

[CV-44] LoFLAT: Local Feature Matching using Focused Linear Attention Transformer

链接: https://arxiv.org/abs/2410.22710
作者: Naijian Cao,Renjie He,Yuchao Dai,Mingyi He
关键词-EN: Local feature matching, Feature Transformer Module, Feature Extraction Module, Transformer-based detector-free local, detector-free local feature
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Local feature matching is an essential technique in image matching and plays a critical role in a wide range of vision-based applications. However, existing Transformer-based detector-free local feature matching methods encounter challenges due to the quadratic computational complexity of attention mechanisms, especially at high resolutions. However, while existing Transformer-based detector-free local feature matching methods have reduced computational costs using linear attention mechanisms, they still struggle to capture detailed local interactions, which affects the accuracy and robustness of precise local correspondences. In order to enhance representations of attention mechanisms while preserving low computational complexity, we propose the LoFLAT, a novel Local Feature matching using Focused Linear Attention Transformer in this paper. Our LoFLAT consists of three main modules: the Feature Extraction Module, the Feature Transformer Module, and the Matching Module. Specifically, the Feature Extraction Module firstly uses ResNet and a Feature Pyramid Network to extract hierarchical features. The Feature Transformer Module further employs the Focused Linear Attention to refine attention distribution with a focused mapping function and to enhance feature diversity with a depth-wise convolution. Finally, the Matching Module predicts accurate and robust matches through a coarse-to-fine strategy. Extensive experimental evaluations demonstrate that the proposed LoFLAT outperforms the LoFTR method in terms of both efficiency and accuracy.

[CV-45] FilterViT and DropoutViT: Lightweight Vision Transformer Models for Efficient Attention Mechanisms

链接: https://arxiv.org/abs/2410.22709
作者: Bohang Sun(School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China)
关键词-EN: version of MobileViT, early-stage downsampling, enhanced version, leverages an attention-based, Traditional QKV operations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this study, we introduce FilterViT, an enhanced version of MobileViT, which leverages an attention-based mechanism for early-stage downsampling. Traditional QKV operations on high-resolution feature maps are computationally intensive due to the abundance of tokens. To address this, we propose a filter attention mechanism using a convolutional neural network (CNN) to generate an importance mask, focusing attention on key image regions. The method significantly reduces computational complexity while maintaining interpretability, as it highlights essential image areas. Experimental results show that FilterViT achieves substantial gains in both efficiency and accuracy compared to other models. We also introduce DropoutViT, a variant that uses a stochastic approach for pixel selection, further enhancing robustness.

[CV-46] Geometry Cloak: Preventing TGS-based 3D Reconstruction from Copyrighted Images NEURIPS2024

链接: https://arxiv.org/abs/2410.22705
作者: Qi Song,Ziyuan Luo,Ka Chun Cheung,Simon See,Renjie Wan
关键词-EN: Triplane Gaussian Splatting, Gaussian Splatting, Triplane Gaussian, single image input, enabled high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Single-view 3D reconstruction methods like Triplane Gaussian Splatting (TGS) have enabled high-quality 3D model generation from just a single image input within seconds. However, this capability raises concerns about potential misuse, where malicious users could exploit TGS to create unauthorized 3D models from copyrighted images. To prevent such infringement, we propose a novel image protection approach that embeds invisible geometry perturbations, termed “geometry cloaks”, into images before supplying them to TGS. These carefully crafted perturbations encode a customized message that is revealed when TGS attempts 3D reconstructions of the cloaked image. Unlike conventional adversarial attacks that simply degrade output quality, our method forces TGS to fail the 3D reconstruction in a specific way - by generating an identifiable customized pattern that acts as a watermark. This watermark allows copyright holders to assert ownership over any attempted 3D reconstructions made from their protected images. Extensive experiments have verified the effectiveness of our geometry cloak. Our project is available at this https URL.

[CV-47] Persistent Homology for MCI Classification: A Comparative Analysis between Graph and Vietoris-Rips Filtrations

链接: https://arxiv.org/abs/2410.22681
作者: Debanjali Bhattacharya,Rajneet Kaur,Ninad Aithal,Neelam Sinha,Thomas Gregor Issac
关键词-EN: Mild cognitive impairment, Mild cognitive, MCI, Indian Urban cohort, declines and disruptions
类目: Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT)
*备注: 17 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Mild cognitive impairment (MCI), often linked to early neurodegeneration, is characterized by subtle cognitive declines and disruptions in brain connectivity. The present study offers a detailed analysis of topological changes associated with MCI, focusing on two subtypes: Early MCI and Late MCI. This analysis utilizes fMRI time series data from two distinct populations: the publicly available ADNI dataset (Western cohort) and the in-house TLSA dataset (Indian Urban cohort). Persistent Homology, a topological data analysis method, is employed with two distinct filtration techniques - Vietoris-Rips and graph filtration-for classifying MCI subtypes. For Vietoris-Rips filtration, inter-ROI Wasserstein distance matrices between persistent diagrams are used for classification, while graph filtration relies on the top ten most persistent homology features. Comparative analysis shows that the Vietoris-Rips filtration significantly outperforms graph filtration, capturing subtle variations in brain connectivity with greater accuracy. The Vietoris-Rips filtration method achieved the highest classification accuracy of 85.7% for distinguishing between age and gender matched healthy controls and MCI, whereas graph filtration reached a maximum accuracy of 71.4% for the same task. This superior performance highlights the sensitivity of Vietoris-Rips filtration in detecting intricate topological features associated with neurodegeneration. The findings underscore the potential of persistent homology, particularly when combined with the Wasserstein distance, as a powerful tool for early diagnosis and precise classification of cognitive impairments, offering valuable insights into brain connectivity changes in MCI.

[CV-48] Practical and Accurate Reconstruction of an Illuminants Spectral Power Distribution for Inverse Rendering Pipelines

链接: https://arxiv.org/abs/2410.22679
作者: Parisha Joshi,Daljit Singh J.Dhillon
关键词-EN: Inverse rendering pipelines, virtual reality scenes, realizing photo-realistic reconstruction, Inverse rendering, reality scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 3 pages, 3 Figures, Submitted as a Tiny Paper at ICVGIP’24, Bangalore, India

点击查看摘要

Abstract:Inverse rendering pipelines are gaining prominence in realizing photo-realistic reconstruction of real-world objects for emulating them in virtual reality scenes. Apart from material reflectances, spectral rendering and in-scene illuminants’ spectral power distributions (SPDs) play important roles in producing photo-realistic images. We present a simple, low-cost technique to capture and reconstruct the SPD of uniform illuminants. Instead of requiring a costly spectrometer for such measurements, our method uses a diffractive compact disk (CD-ROM) and a machine learning approach for accurate estimation. We show our method to work well with spotlights under simulations and few real-world examples. Presented results clearly demonstrate the reliability of our approach through quantitative and qualitative evaluations, especially in spectral rendering of iridescent materials.

[CV-49] FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution NEURIPS24

链接: https://arxiv.org/abs/2410.22655
作者: Shuai Wang,Zexian Li,Tianhui Song,Xubin Li,Tiezheng Ge,Bo Zheng,Limin Wang
关键词-EN: Arbitrary-resolution image generation, requires handling varying, handling varying resolutions, maintaining high visual, high visual quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted on NeurIPS24

点击查看摘要

Abstract:Arbitrary-resolution image generation still remains a challenging task in AIGC, as it requires handling varying resolutions and aspect ratios while maintaining high visual quality. Existing transformer-based diffusion methods suffer from quadratic computation cost and limited resolution extrapolation capabilities, making them less effective for this task. In this paper, we propose FlowDCN, a purely convolution-based generative model with linear time and memory complexity, that can efficiently generate high-quality images at arbitrary resolutions. Equipped with a new design of learnable group-wise deformable convolution block, our FlowDCN yields higher flexibility and capability to handle different resolutions with a single model. FlowDCN achieves the state-of-the-art 4.30 sFID on 256\times256 ImageNet Benchmark and comparable resolution extrapolation results, surpassing transformer-based counterparts in terms of convergence speed (only \frac15 images), visual quality, parameters ( 8% reduction) and FLOPs ( 20% reduction). We believe FlowDCN offers a promising solution to scalable and flexible image synthesis.

[CV-50] SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset

链接: https://arxiv.org/abs/2410.22648
作者: Ngoc Dung Huynh,Mohamed Reda Bouadjenek,Sunil Aryal,Imran Razzak,Hakim Hacid
关键词-EN: Visual Question Answering, develop AI-based systems, promising area, develop AI-based, VQA
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual Question Answering (VQA) has emerged as a promising area of research to develop AI-based systems for enabling interactive and immersive learning. Numerous VQA datasets have been introduced to facilitate various tasks, such as answering questions or identifying unanswerable ones. However, most of these datasets are constructed using real-world images, leaving the performance of existing models on cartoon images largely unexplored. Hence, in this paper, we present “SimpsonsVQA”, a novel dataset for VQA derived from The Simpsons TV show, designed to promote inquiry-based learning. Our dataset is specifically designed to address not only the traditional VQA task but also to identify irrelevant questions related to images, as well as the reverse scenario where a user provides an answer to a question that the system must evaluate (e.g., as correct, incorrect, or ambiguous). It aims to cater to various visual applications, harnessing the visual content of “The Simpsons” to create engaging and informative interactive systems. SimpsonsVQA contains approximately 23K images, 166K QA pairs, and 500K judgments (this https URL). Our experiments show that current large vision-language models like ChatGPT4o underperform in zero-shot settings across all three tasks, highlighting the dataset’s value for improving model performance on cartoon images. We anticipate that SimpsonsVQA will inspire further research, innovation, and advancements in inquiry-based learning VQA.

[CV-51] Unbiased Regression Loss for DETRs

链接: https://arxiv.org/abs/2410.22638
作者: Edric,Ueta Daisuke,Kurokawa Yukimasa,Karlekar Jayashree,Sugiri Pranata
关键词-EN: DETR-based detectors, regression loss, unbiased regression loss, loss, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a novel unbiased regression loss for DETR-based detectors. The conventional L_1 regression loss tends to bias towards larger boxes, as they disproportionately contribute more towards the overall loss compared to smaller boxes. Consequently, the detection performance for small objects suffers. To alleviate this bias, the proposed new unbiased loss, termed Sized L_1 loss, normalizes the size of all boxes based on their individual width and height. Our experiments demonstrate consistent improvements in both fully-supervised and semi-supervised settings using the MS-COCO benchmark dataset.

[CV-52] Consistency Diffusion Bridge Models NEURIPS2024

链接: https://arxiv.org/abs/2410.22637
作者: Guande He,Kaiwen Zheng,Jianfei Chen,Fan Bao,Jun Zhu
关键词-EN: learning stochastic processes, generative modeling, stochastic processes, builds stochastic processes, variety of domains
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Diffusion models (DMs) have become the dominant paradigm of generative modeling in a variety of domains by learning stochastic processes from noise to data. Recently, diffusion denoising bridge models (DDBMs), a new formulation of generative modeling that builds stochastic processes between fixed data endpoints based on a reference diffusion process, have achieved empirical success across tasks with coupled data distribution, such as image-to-image translation. However, DDBM’s sampling process typically requires hundreds of network evaluations to achieve decent performance, which may impede their practical deployment due to high computational demands. In this work, inspired by the recent advance of consistency models in DMs, we tackle this problem by learning the consistency function of the probability-flow ordinary differential equation (PF-ODE) of DDBMs, which directly predicts the solution at a starting step given any point on the ODE trajectory. Based on a dedicated general-form ODE solver, we propose two paradigms: consistency bridge distillation and consistency bridge training, which is flexible to apply on DDBMs with broad design choices. Experimental results show that our proposed method could sample 4\times to 50\times faster than the base DDBM and produce better visual quality given the same step in various tasks with pixel resolution ranging from 64 \times 64 to 256 \times 256 , as well as supporting downstream tasks such as semantic interpolation in the data space.

[CV-53] CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation

链接: https://arxiv.org/abs/2410.22629
作者: Ziyang Gong,Zhixiang Wei,Di Wang,Xianzheng Ma,Hongruixuan Chen,Yuru Jia,Yupeng Deng,Zhenming Ji,Xiangwei Zhu,Naoto Yokoya,Jing Zhang,Bo Du,Liangpei Zhang
关键词-EN: Remote Sensing Domain, Sensing Domain Generalization, Remote Sensing, valuable research frontier, field of Remote
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The codes and models will be available at this https URL

点击查看摘要

Abstract:The field of Remote Sensing Domain Generalization (RSDG) has emerged as a critical and valuable research frontier, focusing on developing models that generalize effectively across diverse scenarios. Despite the substantial domain gaps in RS images that are characterized by variabilities such as location, wavelength, and sensor type, research in this area remains underexplored: (1) Current cross-domain methods primarily focus on Domain Adaptation (DA), which adapts models to predefined domains rather than to unseen ones; (2) Few studies targeting the RSDG issue, especially for semantic segmentation tasks, where existing models are developed for specific unknown domains, struggling with issues of underfitting on other unknown scenarios; (3) Existing RS foundation models tend to prioritize in-domain performance over cross-domain generalization. To this end, we introduce the first vision foundation model for RSDG semantic segmentation, CrossEarth. CrossEarth demonstrates strong cross-domain generalization through a specially designed data-level Earth-Style Injection pipeline and a model-level Multi-Task Training pipeline. In addition, for the semantic segmentation task, we have curated an RSDG benchmark comprising 28 cross-domain settings across various regions, spectral bands, platforms, and climates, providing a comprehensive framework for testing the generalizability of future RSDG models. Extensive experiments on this benchmark demonstrate the superiority of CrossEarth over existing state-of-the-art methods.

[CV-54] Symbolic Graph Inference for Compound Scene Understanding

链接: https://arxiv.org/abs/2410.22626
作者: FNU Aryan,Simon Stepputtis,Sarthak Bhagat,Joseph Campbell,Kwonjoon Lee,Hossein Nourkhiz Mahjoub,Katia Sycara
关键词-EN: fundamental capability needed, ranging from question-answering, question-answering to robotics, fundamental capability, capability needed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Scene understanding is a fundamental capability needed in many domains, ranging from question-answering to robotics. Unlike recent end-to-end approaches that must explicitly learn varying compositions of the same scene, our method reasons over their constituent objects and analyzes their arrangement to infer a scene’s meaning. We propose a novel approach that reasons over a scene’s scene- and knowledge-graph, capturing spatial information while being able to utilize general domain knowledge in a joint graph search. Empirically, we demonstrate the feasibility of our method on the ADE20K dataset and compare it to current scene understanding approaches.

[CV-55] PV-VTT: A Privacy-Centric Dataset for Mission-Specific Anomaly Detection and Natural Language Interpretation WACV2025

链接: https://arxiv.org/abs/2410.22623
作者: Ryozo Masukawa,Sanggeon Yun,Yoshiki Yamaguchi,Mohsen Imani
关键词-EN: Video crime detection, privacy violations, artificial intelligence, Video, significant application
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to WACV 2025

点击查看摘要

Abstract:Video crime detection is a significant application of computer vision and artificial intelligence. However, existing datasets primarily focus on detecting severe crimes by analyzing entire video clips, often neglecting the precursor activities (i.e., privacy violations) that could potentially prevent these crimes. To address this limitation, we present PV-VTT (Privacy Violation Video To Text), a unique multimodal dataset aimed at identifying privacy violations. PV-VTT provides detailed annotations for both video and text in scenarios. To ensure the privacy of individuals in the videos, we only provide video feature vectors, avoiding the release of any raw video data. This privacy-focused approach allows researchers to use the dataset while protecting participant confidentiality. Recognizing that privacy violations are often ambiguous and context-dependent, we propose a Graph Neural Network (GNN)-based video description model. Our model generates a GNN-based prompt with image for Large Language Model (LLM), which deliver cost-effective and high-quality video descriptions. By leveraging a single video frame along with relevant text, our method reduces the number of input tokens required, maintaining descriptive quality while optimizing LLM API-usage. Extensive experiments validate the effectiveness and interpretability of our approach in video description tasks and flexibility of our PV-VTT dataset.

[CV-56] FISC: Federated Domain Generalization via Interpolative Style Transfer and Contrastive Learning

链接: https://arxiv.org/abs/2410.22622
作者: Dung Thuy Nguyen,Taylor T. Johnson,Kevin Leach
关键词-EN: enabling collaborative learning, promise in preserving, preserving privacy, privacy and enabling, enabling collaborative
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) shows promise in preserving privacy and enabling collaborative learning. However, most current solutions focus on private data collected from a single domain. A significant challenge arises when client data comes from diverse domains (i.e., domain shift), leading to poor performance on unseen domains. Existing Federated Domain Generalization approaches address this problem but assume each client holds data for an entire domain, limiting their practicality in real-world scenarios with domain-based heterogeneity and client sampling. To overcome this, we introduce FISC, a novel FL domain generalization paradigm that handles more complex domain distributions across clients. FISC enables learning across domains by extracting an interpolative style from local styles and employing contrastive learning. This strategy gives clients multi-domain representations and unbiased convergent targets. Empirical results on multiple datasets, including PACS, Office-Home, and IWildCam, show FISC outperforms state-of-the-art (SOTA) methods. Our method achieves accuracy improvements ranging from 3.64% to 57.22% on unseen domains. Our code is available at this https URL. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2410.22622 [cs.LG] (or arXiv:2410.22622v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.22622 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-57] GRADE: Quantifying Sample Diversity in Text-to-Image Models

链接: https://arxiv.org/abs/2410.22592
作者: Royi Rassin,Aviv Slobodkin,Shauli Ravfogel,Yanai Elazar,Yoav Goldberg
关键词-EN: generating realistic images, realistic images based, remarkable at generating, generating realistic, diversity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: For project page and code see this https URL

点击查看摘要

Abstract:Text-to-image (T2I) models are remarkable at generating realistic images based on textual descriptions. However, textual prompts are inherently underspecified: they do not specify all possible attributes of the required image. This raises two key questions: Do T2I models generate diverse outputs on underspecified prompts? How can we automatically measure diversity? We propose GRADE: Granular Attribute Diversity Evaluation, an automatic method for quantifying sample diversity. GRADE leverages the world knowledge embedded in large language models and visual question-answering systems to identify relevant concept-specific axes of diversity (e.g., shape'' and color’’ for the concept ``cookie’'). It then estimates frequency distributions of concepts and their attributes and quantifies diversity using (normalized) entropy. GRADE achieves over 90% human agreement while exhibiting weak correlation to commonly used diversity metrics. We use GRADE to measure the overall diversity of 12 T2I models using 400 concept-attribute pairs, revealing that all models display limited variation. Further, we find that these models often exhibit default behaviors, a phenomenon where the model consistently generates concepts with the same attributes (e.g., 98% of the cookies are round). Finally, we demonstrate that a key reason for low diversity is due to underspecified captions in training data. Our work proposes a modern, semantically-driven approach to measure sample diversity and highlights the stunning homogeneity in outputs by T2I models.

[CV-58] Pre-Trained Vision Models as Perception Backbones for Safety Filters in Autonomous Driving

链接: https://arxiv.org/abs/2410.22585
作者: Yuxuan Yang,Hussein Sibai
关键词-EN: achieved impressive success, safety filters, major concern, achieved impressive, remains a major
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:End-to-end vision-based autonomous driving has achieved impressive success, but safety remains a major concern. The safe control problem has been addressed in low-dimensional settings using safety filters, e.g., those based on control barrier functions. Designing safety filters for vision-based controllers in the high-dimensional settings of autonomous driving can similarly alleviate the safety problem, but is significantly more challenging. In this paper, we address this challenge by using frozen pre-trained vision representation models as perception backbones to design vision-based safety filters, inspired by these models’ success as backbones of robotic control policies. We empirically evaluate the offline performance of four common pre-trained vision models in this context. We try three existing methods for training safety filters for black-box dynamics, as the dynamics over representation spaces are not known. We use the DeepAccident dataset that consists of action-annotated videos from multiple cameras on vehicles in CARLA simulating real accident scenarios. Our results show that the filters resulting from our approach are competitive with the ones that are given the ground truth state of the ego vehicle and its environment.

[CV-59] Remote Sensing for Weed Detection and Control

链接: https://arxiv.org/abs/2410.22554
作者: Ishita Bansal,Peder Olsen,Roberto Estevão
关键词-EN: winter wheat, winter wheat fields, Italian ryegrass, grass weed commonly, weed commonly found
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Italian ryegrass is a grass weed commonly found in winter wheat fields that are competitive with winter wheat for moisture and nutrients. Ryegrass can cause substantial reductions in yield and grain quality if not properly controlled with the use of herbicides. To control the cost and environmental impact we detect weeds in drone and satellite imagery. Satellite imagery is too coarse to be used for precision spraying, but can aid in planning drone flights and treatments. Drone images on the other hand have sufficiently good resolution for precision spraying. However, ryegrass is hard to distinguish from the crop and annotation requires expert knowledge. We used the Python segmentation models library to test more than 600 different neural network architectures for weed segmentation in drone images and we map accuracy versus the cost of the model prediction for these. Our best system applies herbicides to over 99% of the weeds while only spraying an area 30% larger than the annotated weed area. These models yield large savings if the weed covers a small part of the field.

[CV-60] FairSkin: Fair Diffusion for Skin Disease Image Generation

链接: https://arxiv.org/abs/2410.22551
作者: Ruichen Zhang,Yuguang Yao,Zhen Tan,Zhiming Li,Pan Wang,Jingtong Hu,Sijia Liu,Tianlong Chen
关键词-EN: reducing healthcare disparities, advancing diagnostic accuracy, Frechet Inception Distance, clinical data augmentation, healthcare disparities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image generation is a prevailing technique for clinical data augmentation for advancing diagnostic accuracy and reducing healthcare disparities. Diffusion Model (DM) has become a leading method in generating synthetic medical images, but it suffers from a critical twofold bias: (1) The quality of images generated for Caucasian individuals is significantly higher, as measured by the Frechet Inception Distance (FID). (2) The ability of the downstream-task learner to learn critical features from disease images varies across different skin tones. These biases pose significant risks, particularly in skin disease detection, where underrepresentation of certain skin tones can lead to misdiagnosis or neglect of specific conditions. To address these challenges, we propose FairSkin, a novel DM framework that mitigates these biases through a three-level resampling mechanism, ensuring fairer representation across racial and disease categories. Our approach significantly improves the diversity and quality of generated images, contributing to more equitable skin disease detection in clinical settings.

[CV-61] AffectNet: A Database for Enhancing Facial Expression Recognition with Soft-Labels

链接: https://arxiv.org/abs/2410.22506
作者: Ali Pourramezan Fard,Mohammad Mehdi Hosseini,Timothy D. Sweeny,Mohammad H. Mahoor
关键词-EN: Automated Facial Expression, Automated Facial, Facial Expression, inter-class similarities, Facial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Automated Facial Expression Recognition (FER) is challenging due to intra-class variations and inter-class similarities. FER can be especially difficult when facial expressions reflect a mixture of various emotions (aka compound expressions). Existing FER datasets, such as AffectNet, provide discrete emotion labels (hard-labels), where a single category of emotion is assigned to an expression. To alleviate inter- and intra-class challenges, as well as provide a better facial expression descriptor, we propose a new approach to create FER datasets through a labeling method in which an image is labeled with more than one emotion (called soft-labels), each with different confidences. Specifically, we introduce the notion of soft-labels for facial expression datasets, a new approach to affective computing for more realistic recognition of facial expressions. To achieve this goal, we propose a novel methodology to accurately calculate soft-labels: a vector representing the extent to which multiple categories of emotion are simultaneously present within a single facial expression. Finding smoother decision boundaries, enabling multi-labeling, and mitigating bias and imbalanced data are some of the advantages of our proposed method. Building upon AffectNet, we introduce AffectNet+, the next-generation facial expression dataset. This dataset contains soft-labels, three categories of data complexity subsets, and additional metadata such as age, gender, ethnicity, head pose, facial landmarks, valence, and arousal. AffectNet+ will be made publicly accessible to researchers.

[CV-62] he PV-ALE Dataset: Enhancing Apple Leaf Disease Classification Through Transfer Learning with Convolutional Neural Networks

链接: https://arxiv.org/abs/2410.22490
作者: Joseph Damilola Akinyemi,Kolawole John Adebayo
关键词-EN: global food security, security landscape continues, food security landscape, crop disease diagnosis, food security concerns
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear in th Sixth International Conference on Soft Computing and its Engineering Applications (icSoftComp2024)

点击查看摘要

Abstract:As the global food security landscape continues to evolve, the need for accurate and reliable crop disease diagnosis has never been more pressing. To address global food security concerns, we extend the widely used PlantVillage dataset with additional apple leaf disease classes, enhancing diversity and complexity. Experimental evaluations on both original and extended datasets reveal that existing models struggle with the new additions, highlighting the need for more robust and generalizable computer vision models. Test F1 scores of 99.63% and 97.87% were obtained on the original and extended datasets, respectively. Our study provides a more challenging and diverse benchmark, paving the way for the development of accurate and reliable models for identifying apple leaf diseases under varying imaging conditions. The expanded dataset is available at this https URL enabling future research to build upon our findings.

[CV-63] Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation

链接: https://arxiv.org/abs/2410.22489
作者: Zhaochong An,Guolei Sun,Yun Liu,Runjia Li,Min Wu,Ming-Ming Cheng,Ender Konukoglu,Serge Belongie
关键词-EN: annotated support samples, minimal annotated support, point cloud segmentation, aims at generalizing, support samples
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text-aware semantic guidance. Additionally, we propose a simple yet effective Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly-ignored free modalities for FS-PCS, providing valuable insights for future research. The code is available at this https URL .

[CV-64] Unified Domain Generalization and Adaptation for Multi-View 3D Object Detection NEURIPS2024

链接: https://arxiv.org/abs/2410.22461
作者: Gyusam Chang,Jiwon Lee,Donghyun Kim,Jinkyu Kim,Dongwook Lee,Daehyun Ji,Sujin Jang,Sangpil Kim
关键词-EN: challenging vision tasks, Recent advances, leveraging multi-view cameras, vision tasks, object detection leveraging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Recent advances in 3D object detection leveraging multi-view cameras have demonstrated their practical and economical value in various challenging vision tasks. However, typical supervised learning approaches face challenges in achieving satisfactory adaptation toward unseen and unlabeled target datasets (\ie, direct transfer) due to the inevitable geometric misalignment between the source and target domains. In practice, we also encounter constraints on resources for training models and collecting annotations for the successful deployment of 3D object detectors. In this paper, we propose Unified Domain Generalization and Adaptation (UDGA), a practical solution to mitigate those drawbacks. We first propose Multi-view Overlap Depth Constraint that leverages the strong association between multi-view, significantly alleviating geometric gaps due to perspective view changes. Then, we present a Label-Efficient Domain Adaptation approach to handle unfamiliar targets with significantly fewer amounts of labels (\ie, 1 % and 5 %) , while preserving well-defined source knowledge for training efficiency. Overall, UDGA framework enables stable detection performance in both source and target domains, effectively bridging inevitable domain gaps, while demanding fewer annotations. We demonstrate the robustness of UDGA with large-scale benchmarks: nuScenes, Lyft, and Waymo, where our framework outperforms the current state-of-the-art methods.

[CV-65] Brain age identification from diffusion MRI synergistically predicts neurodegenerative disease

链接: https://arxiv.org/abs/2410.22454
作者: Chenyu Gao,Michael E. Kim,Karthik Ramadass,Praitayini Kanakaraj,Aravind R. Krishnan,Adam M. Saunders,Nancy R. Newlin,Ho Hin Lee,Qi Yang,Warren D. Taylor,Brian D. Boyd,Lori L. Beason-Held,Susan M. Resnick,Lisa L. Barnes,David A. Bennett,Katherine D. Van Schaik,Derek B. Archer,Timothy J. Hohman,Angela L. Jefferson,Ivana Išgum,Daniel Moyer,Yuankai Huo,Kurt G. Schilling,Lianrui Zuo,Shunxing Bao,Nazirah Mohd Khairi,Zhiyuan Li,Christos Davatzikos,Bennett A. Landman
关键词-EN: Estimated brain age, brain age, MRI-based brain age, supporting early detection, provide early insights
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Estimated brain age from magnetic resonance image (MRI) and its deviation from chronological age can provide early insights into potential neurodegenerative diseases, supporting early detection and implementation of prevention strategies. Diffusion MRI (dMRI), a widely used modality for brain age estimation, presents an opportunity to build an earlier biomarker for neurodegenerative disease prediction because it captures subtle microstructural changes that precede more perceptible macrostructural changes. However, the coexistence of macro- and micro-structural information in dMRI raises the question of whether current dMRI-based brain age estimation models are leveraging the intended microstructural information or if they inadvertently rely on the macrostructural information. To develop a microstructure-specific brain age, we propose a method for brain age identification from dMRI that minimizes the model’s use of macrostructural information by non-rigidly registering all images to a standard template. Imaging data from 13,398 participants across 12 datasets were used for the training and evaluation. We compare our brain age models, trained with and without macrostructural information minimized, with an architecturally similar T1-weighted (T1w) MRI-based brain age model and two state-of-the-art T1w MRI-based brain age models that primarily use macrostructural information. We observe difference between our dMRI-based brain age and T1w MRI-based brain age across stages of neurodegeneration, with dMRI-based brain age being older than T1w MRI-based brain age in participants transitioning from cognitively normal (CN) to mild cognitive impairment (MCI), but younger in participants already diagnosed with Alzheimer’s disease (AD). Approximately 4 years before MCI diagnosis, dMRI-based brain age yields better performance than T1w MRI-based brain ages in predicting transition from CN to MCI.

[CV-66] Embedding Watermarks in Diffusion Process for Model Intellectual Property Protection

链接: https://arxiv.org/abs/2410.22445
作者: Jijia Yang,Sen Peng,Xiaohua Jia
关键词-EN: necessitates substantial investment, widespread deployment, necessitates substantial, substantial investment, practical application
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:In practical application, the widespread deployment of diffusion models often necessitates substantial investment in training. As diffusion models find increasingly diverse applications, concerns about potential misuse highlight the imperative for robust intellectual property protection. Current protection strategies either employ backdoor-based methods, integrating a watermark task as a simpler training objective with the main model task, or embedding watermarks directly into the final output samples. However, the former approach is fragile compared to existing backdoor defense techniques, while the latter fundamentally alters the expected output. In this work, we introduce a novel watermarking framework by embedding the watermark into the whole diffusion process, and theoretically ensure that our final output samples contain no additional information. Furthermore, we utilize statistical algorithms to verify the watermark from internally generated model samples without necessitating triggers as conditions. Detailed theoretical analysis and experimental validation demonstrate the effectiveness of our proposed method.

[CV-67] Gradient Distance Function

链接: https://arxiv.org/abs/2410.22422
作者: Hieu Le,Federico Stella,Benoit Guillard,Pascal Fua
关键词-EN: deep learning framework, Unsigned Distance Functions, Gradient Distance Functions, Distance Functions, represent non-watertight surfaces
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: We developed this concurrently with ‘Neural Vector Field,’ and there are similarities between the two works so please pay them a visit as well. Here, we demonstrate how directly learning the gradient vector is much easier than learning the UDF

点击查看摘要

Abstract:Unsigned Distance Functions (UDFs) can be used to represent non-watertight surfaces in a deep learning framework. However, UDFs tend to be brittle and difficult to learn, in part because the surface is located exactly where the UDF is non-differentiable. In this work, we show that Gradient Distance Functions (GDFs) can remedy this by being differentiable at the surface while still being able to represent open surfaces. This is done by associating to each 3D point a 3D vector whose norm is taken to be the unsigned distance to the surface and whose orientation is taken to be the direction towards the closest surface point. We demonstrate the effectiveness of GDFs on ShapeNet Car, Multi-Garment, and 3D-Scene datasets with both single-shape reconstruction networks or categorical auto-decoders.

[CV-68] Exploiting Semantic Scene Reconstruction for Estimating Building Envelope Characteristics

链接: https://arxiv.org/abs/2410.22383
作者: Chenghao Xu,Malcolm Mielle,Antoine Laborde,Ali Waseem,Florent Forest,Olga Fink
关键词-EN: climate neutrality goal, neutrality goal requires, goal requires retrofitting, requires retrofitting existing, building
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Achieving the EU’s climate neutrality goal requires retrofitting existing buildings to reduce energy use and emissions. A critical step in this process is the precise assessment of geometric building envelope characteristics to inform retrofitting decisions. Previous methods for estimating building characteristics, such as window-to-wall ratio, building footprint area, and the location of architectural elements, have primarily relied on applying deep-learning-based detection or segmentation techniques on 2D images. However, these approaches tend to focus on planar facade properties, limiting their accuracy and comprehensiveness when analyzing complete building envelopes in 3D. While neural scene representations have shown exceptional performance in indoor scene reconstruction, they remain under-explored for external building envelope analysis. This work addresses this gap by leveraging cutting-edge neural surface reconstruction techniques based on signed distance function (SDF) representations for 3D building analysis. We propose BuildNet3D, a novel framework to estimate geometric building characteristics from 2D image inputs. By integrating SDF-based representation with semantic modality, BuildNet3D recovers fine-grained 3D geometry and semantics of building envelopes, which are then used to automatically extract building characteristics. Our framework is evaluated on a range of complex building structures, demonstrating high accuracy and generalizability in estimating window-to-wall ratio and building footprint. The results underscore the effectiveness of BuildNet3D for practical applications in building analysis and retrofitting. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.22383 [cs.CV] (or arXiv:2410.22383v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.22383 Focus to learn more arXiv-issued DOI via DataCite

[CV-69] Accelerating Augmentation Invariance Pretraining

链接: https://arxiv.org/abs/2410.22364
作者: Jinhong Lin,Cheng-En Wu,Yibing Wei,Pedro Morgado
关键词-EN: Vision Transformers, work tackles, challenges of contrastive, contrastive learning, contrastive learning methods
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Our work tackles the computational challenges of contrastive learning methods, particularly for the pretraining of Vision Transformers (ViTs). Despite the effectiveness of contrastive learning, the substantial computational resources required for training often hinder their practical application. To mitigate this issue, we propose an acceleration framework, leveraging ViT’s unique ability to generalize across inputs of varying sequence lengths. Our method employs a mix of sequence compression strategies, including randomized token dropout and flexible patch scaling, to reduce the cost of gradient estimation and accelerate convergence. We further provide an in-depth analysis of the gradient estimation error of various acceleration strategies as well as their impact on downstream tasks, offering valuable insights into the trade-offs between acceleration and performance. We also propose a novel procedure to identify an optimal acceleration schedule to adjust the sequence compression ratios to the training progress, ensuring efficient training without sacrificing downstream performance. Our approach significantly reduces computational overhead across various self-supervised learning algorithms on large-scale datasets. In ImageNet, our method achieves speedups of 4 \times in MoCo, 3.3 \times in SimCLR, and 2.5 \times in DINO, demonstrating substantial efficiency gains. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2410.22364 [cs.CV] (or arXiv:2410.22364v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2410.22364 Focus to learn more arXiv-issued DOI via DataCite

[CV-70] bit2bit: 1-bit quanta video reconstruction via self-supervised photon prediction NEURIPS2024

链接: https://arxiv.org/abs/2410.23247
作者: Yehe Liu,Alexander Krull,Hector Basevi,Ales Leonardis,Michael W. Jenkins
关键词-EN: emerging sensor technology, Quanta image sensors, arrays representing photon, representing photon detection, sensor technology
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Quanta image sensors, such as SPAD arrays, are an emerging sensor technology, producing 1-bit arrays representing photon detection events over exposures as short as a few nanoseconds. In practice, raw data are post-processed using heavy spatiotemporal binning to create more useful and interpretable images at the cost of degrading spatiotemporal resolution. In this work, we propose bit2bit, a new method for reconstructing high-quality image stacks at the original spatiotemporal resolution from sparse binary quanta image data. Inspired by recent work on Poisson denoising, we developed an algorithm that creates a dense image sequence from sparse binary photon data by predicting the photon arrival location probability distribution. However, due to the binary nature of the data, we show that the assumption of a Poisson distribution is inadequate. Instead, we model the process with a Bernoulli lattice process from the truncated Poisson. This leads to the proposal of a novel self-supervised solution based on a masked loss function. We evaluate our method using both simulated and real data. On simulated data from a conventional video, we achieve 34.35 mean PSNR with extremely photon-sparse binary input (0.06 photons per pixel per frame). We also present a novel dataset containing a wide range of real SPAD high-speed videos under various challenging imaging conditions. The scenes cover strong/weak ambient light, strong motion, ultra-fast events, etc., which will be made available to the community, on which we demonstrate the promise of our approach. Both reconstruction quality and throughput substantially surpass the state-of-the-art methods (e.g., Quanta Burst Photography (QBP)). Our approach significantly enhances the visualization and usability of the data, enabling the application of existing analysis techniques.

[CV-71] Nested ResNet: A Vision-Based Method for Detecting the Sensing Area of a Drop-in Gamma Probe

链接: https://arxiv.org/abs/2410.23154
作者: Songyu Xu,Yicheng Hu,Jionglong Su,Daniel Elson,Baoru Huang
关键词-EN: lymph node detection, robotic-assisted minimally invasive, minimally invasive surgery, node detection, robotic-assisted minimally
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Purpose: Drop-in gamma probes are widely used in robotic-assisted minimally invasive surgery (RAMIS) for lymph node detection. However, these devices only provide audio feedback on signal intensity, lacking the visual feedback necessary for precise localisation. Previous work attempted to predict the sensing area location using laparoscopic images, but the prediction accuracy was unsatisfactory. Improvements are needed in the deep learning-based regression approach. Methods: We introduce a three-branch deep learning framework to predict the sensing area of the probe. Specifically, we utilise the stereo laparoscopic images as input for the main branch and develop a Nested ResNet architecture. The framework also incorporates depth estimation via transfer learning and orientation guidance through probe axis sampling. The combined features from each branch enhanced the accuracy of the prediction. Results: Our approach has been evaluated on a publicly available dataset, demonstrating superior performance over previous methods. In particular, our method resulted in a 22.10% decrease in 2D mean error and a 41.67% reduction in 3D mean error. Additionally, qualitative comparisons further demonstrated the improved precision of our approach. Conclusion: With extensive evaluation, our solution significantly enhances the accuracy and reliability of sensing area predictions. This advancement enables visual feedback during the use of the drop-in gamma probe in surgery, providing surgeons with more accurate and reliable localisation. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2410.23154 [eess.IV] (or arXiv:2410.23154v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2410.23154 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Baoru Huang [view email] [v1] Wed, 30 Oct 2024 16:08:43 UTC (2,371 KB)

[CV-72] Compositional Segmentation of Cardiac Images Leveraging Metadata WACV

链接: https://arxiv.org/abs/2410.23130
作者: Abbas Khan,Muhammad Asad,Martin Benning,Caroline Roney,Gregory Slabaugh
关键词-EN: automated cardiac function, cardiac function assessment, Cross-Modal Feature Integration, structures over time, essential for automated
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:Cardiac image segmentation is essential for automated cardiac function assessment and monitoring of changes in cardiac structures over time. Inspired by coarse-to-fine approaches in image analysis, we propose a novel multitask compositional segmentation approach that can simultaneously localize the heart in a cardiac image and perform part-based segmentation of different regions of interest. We demonstrate that this compositional approach achieves better results than direct segmentation of the anatomies. Further, we propose a novel Cross-Modal Feature Integration (CMFI) module to leverage the metadata related to cardiac imaging collected during image acquisition. We perform experiments on two different modalities, MRI and ultrasound, using public datasets, Multi-disease, Multi-View, and Multi-Centre (MMs-2) and Multi-structure Ultrasound Segmentation (CAMUS) data, to showcase the efficiency of the proposed compositional segmentation method and Cross-Modal Feature Integration module incorporating metadata within the proposed compositional segmentation network. The source code is available: this https URL.

[CV-73] AI-assisted prostate cancer detection and localisation on biparametric MR by classifying radiologist-positives

链接: https://arxiv.org/abs/2410.23084
作者: Xiangcen Wu,Yipei Wang,Qianye Yang,Natasha Thorley,Shonit Punwani,Veeru Kasivisvanathan,Ester Bonmati,Yipeng Hu
关键词-EN: Prostate cancer diagnosis, whilst modern AI-based, modern AI-based methods, detect clinically significant, clinically significant cancers
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Prostate cancer diagnosis through MR imaging have currently relied on radiologists’ interpretation, whilst modern AI-based methods have been developed to detect clinically significant cancers independent of radiologists. In this study, we propose to develop deep learning models that improve the overall cancer diagnostic accuracy, by classifying radiologist-identified patients or lesions (i.e. radiologist-positives), as opposed to the existing models that are trained to discriminate over all patients. We develop a single voxel-level classification model, with a simple percentage threshold to determine positive cases, at levels of lesions, Barzell-zones and patients. Based on the presented experiments from two clinical data sets, consisting of histopathology-labelled MR images from more than 800 and 500 patients in the respective UCLA and UCL PROMIS studies, we show that the proposed strategy can improve the diagnostic accuracy, by augmenting the radiologist reading of the MR imaging. Among varying definition of clinical significance, the proposed strategy, for example, achieved a specificity of 44.1% (with AI assistance) from 36.3% (by radiologists alone), at a controlled sensitivity of 80.0% on the publicly available UCLA data set. This provides measurable clinical values in a range of applications such as reducing unnecessary biopsies, lowering cost in cancer screening and quantifying risk in therapies.

[CV-74] owards Population Scale Testis Volume Segmentation in DIXON MRI

链接: https://arxiv.org/abs/2410.22866
作者: Jan Ernsting,Phillip Nikolas Beeken,Lynn Ogoniak,Jacqueline Kockwelp,Tim Hahn,Alexander Siegfried Busch,Benjamin Risse
关键词-EN: male fertility, main predictors, predictors of male, assessed in clinical, clinical workup
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Testis size is known to be one of the main predictors of male fertility, usually assessed in clinical workup via palpation or imaging. Despite its potential, population-level evaluation of testicular volume using imaging remains underexplored. Previous studies, limited by small and biased datasets, have demonstrated the feasibility of machine learning for testis volume segmentation. This paper presents an evaluation of segmentation methods for testicular volume using Magnet Resonance Imaging data from the UKBiobank. The best model achieves a median dice score of 0.87 , compared to median dice score of 0.83 for human interrater reliability on the same dataset, enabling large-scale annotation on a population scale for the first time. Our overall aim is to provide a trained model, comparative baseline methods, and annotated training data to enhance accessibility and reproducibility in testis MRI segmentation research.

[CV-75] Latent Diffusion Implicit Amplification: Efficient Continuous-Scale Super-Resolution for Remote Sensing Images

链接: https://arxiv.org/abs/2410.22830
作者: Hanlin Wu,Jiangwei Mo,Xiaohui Sun,Jie Ma
关键词-EN: significantly improved performance, Recent advancements, general image generation, performance in super-resolution, significantly improved
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in diffusion models have significantly improved performance in super-resolution (SR) tasks. However, previous research often overlooks the fundamental differences between SR and general image generation. General image generation involves creating images from scratch, while SR focuses specifically on enhancing existing low-resolution (LR) images by adding typically missing high-frequency details. This oversight not only increases the training difficulty but also limits their inference efficiency. Furthermore, previous diffusion-based SR methods are typically trained and inferred at fixed integer scale factors, lacking flexibility to meet the needs of up-sampling with non-integer scale factors. To address these issues, this paper proposes an efficient and elastic diffusion-based SR model (E ^2 DiffSR), specially designed for continuous-scale SR in remote sensing imagery. E ^2 DiffSR employs a two-stage latent diffusion paradigm. During the first stage, an autoencoder is trained to capture the differential priors between high-resolution (HR) and LR images. The encoder intentionally ignores the existing LR content to alleviate the encoding burden, while the decoder introduces an SR branch equipped with a continuous scale upsampling module to accomplish the reconstruction under the guidance of the differential prior. In the second stage, a conditional diffusion model is learned within the latent space to predict the true differential prior encoding. Experimental results demonstrate that E ^2 DiffSR achieves superior objective metrics and visual quality compared to the state-of-the-art SR methods. Additionally, it reduces the inference time of diffusion-based SR methods to a level comparable to that of non-diffusion methods.

[CV-76] Deep Priors for Video Quality Prediction

链接: https://arxiv.org/abs/2410.22566
作者: Siddharath Narayan Shakya,Parimala Kancharla
关键词-EN: deep video prior, video, deep video, video prior, completely blind video
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP) 2024 conference tinny paper

点击查看摘要

Abstract:In this work, we designed a completely blind video quality assessment algorithm using the deep video prior. This work mainly explores the utility of deep video prior in estimating the visual quality of the video. In our work, we have used a single distorted video and a reference video pair to learn the deep video prior. At inference time, the learned deep prior is used to restore the original videos from the distorted videos. The ability of learned deep video prior to restore the original video from the distorted video is measured to quantify distortion in the video. Our hypothesis is that the learned deep video prior fails in restoring the highly distorted videos. The restoring ability of deep video prior is proportional to the distortion present in the video. Therefore, we propose to use the distance between the distorted video and the restored video as the perceptual quality of the video. Our algorithm is trained using a single video pair and it does not need any labelled data. We show that our proposed algorithm outperforms the existing unsupervised video quality assessment algorithms in terms of LCC and SROCC on a synthetically distorted video quality assessment dataset.

[CV-77] Adaptive Aggregation Weights for Federated Segmentation of Pancreas MRI

链接: https://arxiv.org/abs/2410.22530
作者: Hongyi Pan,Gorkem Durak,Zheyuan Zhang,Yavuz Taktak,Elif Keles,Halil Ertugrul Aktas,Alpay Medetalibeyoglu,Yury Velichko,Concetto Spampinato,Ivo Schoots,Marco J. Bruno,Rajesh N. Keswani,Pallavi Tiwari,Candice Bolan,Tamas Gonda,Michael G. Goggins,Michael B. Wallace,Ziyue Xu,Ulas Bagci
关键词-EN: sharing sensitive data, medical imaging tasks, enables collaborative model, collaborative model training, Federated learning
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) enables collaborative model training across institutions without sharing sensitive data, making it an attractive solution for medical imaging tasks. However, traditional FL methods, such as Federated Averaging (FedAvg), face difficulties in generalizing across domains due to variations in imaging protocols and patient demographics across institutions. This challenge is particularly evident in pancreas MRI segmentation, where anatomical variability and imaging artifacts significantly impact performance. In this paper, we conduct a comprehensive evaluation of FL algorithms for pancreas MRI segmentation and introduce a novel approach that incorporates adaptive aggregation weights. By dynamically adjusting the contribution of each client during model aggregation, our method accounts for domain-specific differences and improves generalization across heterogeneous datasets. Experimental results demonstrate that our approach enhances segmentation accuracy and reduces the impact of domain shift compared to conventional FL methods while maintaining privacy-preserving capabilities. Significant performance improvements are observed across multiple hospitals (centers).

[CV-78] EfficientNet with Hybrid Attention Mechanisms for Enhanced Breast Histopathology Classification: A Comprehensive Approach

链接: https://arxiv.org/abs/2410.22392
作者: Naren Sengodan
关键词-EN: Breast cancer histopathology, early cancer detection, reduce mortality rates, Breast cancer, cancer histopathology image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Breast cancer histopathology image classification is crucial for early cancer detection, offering the potential to reduce mortality rates through timely diagnosis. This paper introduces a novel approach integrating Hybrid EfficientNet models with advanced attention mechanisms, including Convolutional Block Attention Module (CBAM), Self-Attention, and Deformable Attention, to enhance feature extraction and focus on critical image regions. We evaluate the performance of our models across multiple magnification scales using publicly available histopathological datasets. Our method achieves significant improvements, with accuracy reaching 98.42% at 400X magnification, surpassing several state-of-the-art models, including VGG and ResNet architectures. The results are validated using metrics such as accuracy, F1-score, precision, and recall, demonstrating the clinical potential of our model in improving diagnostic accuracy. Furthermore, the proposed method shows increased computational efficiency, making it suitable for integration into real-time diagnostic workflows.

机器学习

[LG-0] Bridging the Human to Robot Dexterity Gap through Object-Oriented Rewards

链接: https://arxiv.org/abs/2410.23289
作者: Irmak Guzey,Yinlong Dai,Georgy Savva,Raunaq Bhirangi,Lerrel Pinto
关键词-EN: Training robots directly, Training robots, computer vision, emerging area, area in robotics
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training robots directly from human videos is an emerging area in robotics and computer vision. While there has been notable progress with two-fingered grippers, learning autonomous tasks for multi-fingered robot hands in this way remains challenging. A key reason for this difficulty is that a policy trained on human hands may not directly transfer to a robot hand due to morphology differences. In this work, we present HuDOR, a technique that enables online fine-tuning of policies by directly computing rewards from human videos. Importantly, this reward function is built using object-oriented trajectories derived from off-the-shelf point trackers, providing meaningful learning signals despite the morphology gap and visual differences between human and robot hands. Given a single video of a human solving a task, such as gently opening a music box, HuDOR enables our four-fingered Allegro hand to learn the task with just an hour of online interaction. Our experiments across four tasks show that HuDOR achieves a 4x improvement over baselines. Code and videos are available on our website, this https URL.

[LG-1] Attribute-to-Delete: Machine Unlearning via Datamodel Matching

链接: https://arxiv.org/abs/2410.23232
作者: Kristian Georgiev,Roy Rinberg,Sung Min Park,Shivam Garg,Andrew Ilyas,Aleksander Madry,Seth Neel
关键词-EN: recently attracted significant, attracted significant research, significant research interest, machine learning model, pre-trained machine learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning – efficiently removing the effect of a small “forget set” of training data on a pre-trained machine learning model – has recently attracted significant research interest. Despite this interest, however, recent work shows that existing machine unlearning techniques do not hold up to thorough evaluation in non-convex settings. In this work, we introduce a new machine unlearning technique that exhibits strong empirical performance even in such challenging settings. Our starting point is the perspective that the goal of unlearning is to produce a model whose outputs are statistically indistinguishable from those of a model re-trained on all but the forget set. This perspective naturally suggests a reduction from the unlearning problem to that of data attribution, where the goal is to predict the effect of changing the training set on a model’s outputs. Thus motivated, we propose the following meta-algorithm, which we call Datamodel Matching (DMM): given a trained model, we (a) use data attribution to predict the output of the model if it were re-trained on all but the forget set points; then (b) fine-tune the pre-trained model to match these predicted outputs. In a simple convex setting, we show how this approach provably outperforms a variety of iterative unlearning algorithms. Empirically, we use a combination of existing evaluations and a new metric based on the KL-divergence to show that even in non-convex settings, DMM achieves strong unlearning performance relative to existing algorithms. An added benefit of DMM is that it is a meta-algorithm, in the sense that future advances in data attribution translate directly into better unlearning algorithms, pointing to a clear direction for future progress in unlearning.

[LG-2] Emergence of meta-stable clustering in mean-field transformer models

链接: https://arxiv.org/abs/2410.23228
作者: Giuseppe Bruno,Federico Pasqualotto,Andrea Agazzi
关键词-EN: interacting particle system, mean-field interacting particle, Partial Differential Equation, mean-field Partial Differential, stack of Transformer
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注: 37 Pages, 6 figures

点击查看摘要

Abstract:We model the evolution of tokens within a deep stack of Transformer layers as a continuous-time flow on the unit sphere, governed by a mean-field interacting particle system, building on the framework introduced in (Geshkovski et al., 2023). Studying the corresponding mean-field Partial Differential Equation (PDE), which can be interpreted as a Wasserstein gradient flow, in this paper we provide a mathematical investigation of the long-term behavior of this system, with a particular focus on the emergence and persistence of meta-stable phases and clustering phenomena, key elements in applications like next-token prediction. More specifically, we perform a perturbative analysis of the mean-field PDE around the iid uniform initialization and prove that, in the limit of large number of tokens, the model remains close to a meta-stable manifold of solutions with a given structure (e.g., periodicity). Further, the structure characterizing the meta-stable manifold is explicitly identified, as a function of the inverse temperature parameter of the model, by the index maximizing a certain rescaling of Gegenbauer polynomials.

[LG-3] (FL)2: Overcoming Few Labels in Federated Semi-Supervised Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.23227
作者: Seungjoo Lee,Thanh-Long V. Le,Jaemin Shin,Sung-Ju Lee
关键词-EN: trains accurate global, accurate global models, preserving clients’ privacy-sensitive, distributed machine learning, machine learning framework
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning framework that trains accurate global models while preserving clients’ privacy-sensitive data. However, most FL approaches assume that clients possess labeled data, which is often not the case in practice. Federated Semi-Supervised Learning (FSSL) addresses this label deficiency problem, targeting situations where only the server has a small amount of labeled data while clients do not. However, a significant performance gap exists between Centralized Semi-Supervised Learning (SSL) and FSSL. This gap arises from confirmation bias, which is more pronounced in FSSL due to multiple local training epochs and the separation of labeled and unlabeled data. We propose (FL)^2 , a robust training method for unlabeled clients using sharpness-aware consistency regularization. We show that regularizing the original pseudo-labeling loss is suboptimal, and hence we carefully select unlabeled samples for regularization. We further introduce client-specific adaptive thresholding and learning status-aware aggregation to adjust the training process based on the learning progress of each client. Our experiments on three benchmark datasets demonstrate that our approach significantly improves performance and bridges the gap with SSL, particularly in scenarios with scarce labeled data.

[LG-4] Does equivariance matter at scale?

链接: https://arxiv.org/abs/2410.23179
作者: Johann Brehmer,Sönke Behrends,Pim de Haan,Taco Cohen
关键词-EN: large data sets, design neural architectures, beneficial to design, design neural, structure and symmetries
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given large data sets and sufficient compute, is it beneficial to design neural architectures for the structure and symmetries of each problem? Or is it more efficient to learn them from data? We study empirically how equivariant and non-equivariant networks scale with compute and training samples. Focusing on a benchmark problem of rigid-body interactions and on general-purpose transformer architectures, we perform a series of experiments, varying the model size, training steps, and dataset size. We find evidence for three conclusions. First, equivariance improves data efficiency, but training non-equivariant models with data augmentation can close this gap given sufficient epochs. Second, scaling with compute follows a power law, with equivariant models outperforming non-equivariant ones at each tested compute budget. Finally, the optimal allocation of a compute budget onto model size and training duration differs between equivariant and non-equivariant models.

[LG-5] he Persistence of Neural Collapse Despite Low-Rank Bias: An Analytic Perspective Through Unconstrained Features

链接: https://arxiv.org/abs/2410.23169
作者: Connall Garrod,Jonathan P. Keating
关键词-EN: Modern deep neural, deep neural collapse, Modern deep, neural collapse, deep neural
类目: Machine Learning (cs.LG)
*备注: 40 pages

点击查看摘要

Abstract:Modern deep neural networks have been observed to exhibit a simple structure in their final layer features and weights, commonly referred to as neural collapse. This phenomenon has also been noted in layers beyond the final one, an extension known as deep neural collapse. Recent findings indicate that such a structure is generally not optimal in the deep unconstrained feature model, an approximation of an expressive network. This is attributed to a low-rank bias induced by regularization, which favors solutions with lower-rank than those typically associated with deep neural collapse. In this work, we extend these observations to the cross-entropy loss and analyze how the low-rank bias influences various solutions. Additionally, we explore how this bias induces specific structures in the singular values of the weights at global optima. Furthermore, we examine the loss surface of these models and provide evidence that the frequent observation of deep neural collapse in practice, despite its suboptimality, may result from its higher degeneracy on the loss surface.

[LG-6] okenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

链接: https://arxiv.org/abs/2410.23168
作者: Haiyang Wang,Yue Fan,Muhammad Ferjad Naeem,Yongqin Xian,Jan Eric Lenssen,Liwei Wang,Federico Tombari,Bernt Schiele
关键词-EN: foundation models due, model, parameters, model parameters, Transformers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at \urlthis https URL.

[LG-7] Directional anomaly detection

链接: https://arxiv.org/abs/2410.23158
作者: Oliver Urs Lenz,Matthijs van Leeuwen
关键词-EN: Semi-supervised anomaly detection, normal training data, Semi-supervised anomaly, principle that potential, normal training
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-supervised anomaly detection is based on the principle that potential anomalies are those records that look different from normal training data. However, in some cases we are specifically interested in anomalies that correspond to high attribute values (or low, but not both). We present two asymmetrical distance measures that take this directionality into account: ramp distance and signed distance. Through experiments on synthetic and real-life datasets we show that ramp distance performs as well or better than the absolute distance traditionally used in anomaly detection. While signed distance also performs well on synthetic data, it performs substantially poorer on real-life datasets. We argue that this reflects the fact that in practice, good scores on some attributes should not be allowed to compensate for bad scores on others.

[LG-8] QWO: Speeding Up Permutation-Based Causal Discovery in LiGAMs

链接: https://arxiv.org/abs/2410.23155
作者: Mohammad Shahverdikondori,Ehsan Mokhtarian,Negar Kiyavash
关键词-EN: Gaussian Acyclic Models, Linear Gaussian Acyclic, scientific domains, discovery is essential, essential for understanding
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:Causal discovery is essential for understanding relationships among variables of interest in many scientific domains. In this paper, we focus on permutation-based methods for learning causal graphs in Linear Gaussian Acyclic Models (LiGAMs), where the permutation encodes a causal ordering of the variables. Existing methods in this setting are not scalable due to their high computational complexity. These methods are comprised of two main components: (i) constructing a specific DAG, \mathcalG^\pi , for a given permutation \pi , which represents the best structure that can be learned from the available data while adhering to \pi , and (ii) searching over the space of permutations (i.e., causal orders) to minimize the number of edges in \mathcalG^\pi . We introduce QWO, a novel approach that significantly enhances the efficiency of computing \mathcalG^\pi for a given permutation \pi . QWO has a speed-up of O(n^2) ( n is the number of variables) compared to the state-of-the-art BIC-based method, making it highly scalable. We show that our method is theoretically sound and can be integrated into existing search strategies such as GRASP and hill-climbing-based methods to improve their performance.

[LG-9] HiBO: Hierarchical Bayesian Optimization via Adaptive Search Space Partitioning

链接: https://arxiv.org/abs/2410.23148
作者: Wenxuan Li,Taiyi Wang,Eiko Yoneki
关键词-EN: traditional Bayesian Optimization, Optimizing black-box functions, Bayesian Optimization, Optimizing black-box, traditional Bayesian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimizing black-box functions in high-dimensional search spaces has been known to be challenging for traditional Bayesian Optimization (BO). In this paper, we introduce HiBO, a novel hierarchical algorithm integrating global-level search space partitioning information into the acquisition strategy of a local BO-based optimizer. HiBO employs a search-tree-based global-level navigator to adaptively split the search space into partitions with different sampling potential. The local optimizer then utilizes this global-level information to guide its acquisition strategy towards most promising regions within the search space. A comprehensive set of evaluations demonstrates that HiBO outperforms state-of-the-art methods in high-dimensional synthetic benchmarks and presents significant practical effectiveness in the real-world task of tuning configurations of database management systems (DBMSs).

[LG-10] FoLDTree: A ULDA-Based Decision Tree Framework for Efficient Oblique Splits and Feature Selection

链接: https://arxiv.org/abs/2410.23147
作者: Siyu Wang
关键词-EN: true decision boundaries, oblique decision tree, decision tree methods, Linear Discriminant Analysis, Uncorrelated Linear Discriminant
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Traditional decision trees are limited by axis-orthogonal splits, which can perform poorly when true decision boundaries are oblique. While oblique decision tree methods address this limitation, they often face high computational costs, difficulties with multi-class classification, and a lack of effective feature selection. In this paper, we introduce LDATree and FoLDTree, two novel frameworks that integrate Uncorrelated Linear Discriminant Analysis (ULDA) and Forward ULDA into a decision tree structure. These methods enable efficient oblique splits, handle missing values, support feature selection, and provide both class labels and probabilities as model outputs. Through evaluations on simulated and real-world datasets, LDATree and FoLDTree consistently outperform axis-orthogonal and other oblique decision tree methods, achieving accuracy levels comparable to the random forest. The results highlight the potential of these frameworks as robust alternatives to traditional single-tree methods.

[LG-11] Federated Learning under Periodic Client Participation and Heterogeneous Data: A New Communication-Efficient Algorithm and Analysis NEURIPS2024

链接: https://arxiv.org/abs/2410.23131
作者: Michael Crawshaw,Mingrui Liu
关键词-EN: federated learning, devices in practice, common to assume, feasible with user, user devices
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Neurips 2024

点击查看摘要

Abstract:In federated learning, it is common to assume that clients are always available to participate in training, which may not be feasible with user devices in practice. Recent works analyze federated learning under more realistic participation patterns, such as cyclic client availability or arbitrary participation. However, all such works either require strong assumptions (e.g., all clients participate almost surely within a bounded window), do not achieve linear speedup and reduced communication rounds, or are not applicable in the general non-convex setting. In this work, we focus on nonconvex optimization and consider participation patterns in which the chance of participation over a fixed window of rounds is equal among all clients, which includes cyclic client availability as a special case. Under this setting, we propose a new algorithm, named Amplified SCAFFOLD, and prove that it achieves linear speedup, reduced communication, and resilience to data heterogeneity simultaneously. In particular, for cyclic participation, our algorithm is proved to enjoy \mathcalO(\epsilon^-2) communication rounds to find an \epsilon -stationary point in the non-convex stochastic setting. In contrast, the prior work under the same setting requires \mathcalO(\kappa^2 \epsilon^-4) communication rounds, where \kappa denotes the data heterogeneity. Therefore, our algorithm significantly reduces communication rounds due to better dependency in terms of \epsilon and \kappa . Our analysis relies on a fine-grained treatment of the nested dependence between client participation and errors in the control variates, which results in tighter guarantees than previous work. We also provide experimental results with (1) synthetic data and (2) real-world data with a large number of clients (N = 250) , demonstrating the effectiveness of our algorithm under periodic client participation.

[LG-12] Statistical-Computational Trade-offs for Density Estimation NEURIPS2024

链接: https://arxiv.org/abs/2410.23087
作者: Anders Aamand,Alexandr Andoni,Justin Y. Chen,Piotr Indyk,Shyam Narayanan,Sandeep Silwal,Haike Xu
关键词-EN: data structure, estimation problem defined, data, problem defined, query time
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: To appear at NeurIPS 2024

点击查看摘要

Abstract:We study the density estimation problem defined as follows: given k distributions p_1, \ldots, p_k over a discrete domain [n] , as well as a collection of samples chosen from a query'' distribution q over [n] , output p_i that is close’’ to q . Recently~\citeaamand2023data gave the first and only known result that achieves sublinear bounds in \em both the sampling complexity and the query time while preserving polynomial data structure space. However, their improvement over linear samples and time is only by subpolynomial factors. Our main result is a lower bound showing that, for a broad class of data structures, their bounds cannot be significantly improved. In particular, if an algorithm uses O(n/\log^c k) samples for some constant c0 and polynomial space, then the query time of the data structure must be at least k^1-O(1)/\log \log k , i.e., close to linear in the number of distributions k . This is a novel \emphstatistical-computational trade-off for density estimation, demonstrating that any data structure must use close to a linear number of samples or take close to linear query time. The lower bound holds even in the realizable case where q=p_i for some i , and when the distributions are flat (specifically, all distributions are uniform over half of the domain [n] ). We also give a simple data structure for our lower bound instance with asymptotically matching upper bounds. Experiments show that the data structure is quite efficient in practice. Comments: To appear at NeurIPS 2024 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2410.23087 [cs.DS] (or arXiv:2410.23087v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2410.23087 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-13] Legitimate ground-truth-free metrics for deep uncertainty classification scoring

链接: https://arxiv.org/abs/2410.23046
作者: Arthur Pignet,Chiara Regniez,John Klein
关键词-EN: production remains limited, machine learning practices, safer machine learning, Uncertainty Quantification, remains limited
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the increasing demand for safer machine learning practices, the use of Uncertainty Quantification (UQ) methods in production remains limited. This limitation is exacerbated by the challenge of validating UQ methods in absence of UQ ground truth. In classification tasks, when only a usual set of test data is at hand, several authors suggested different metrics that can be computed from such test points while assessing the quality of quantified uncertainties. This paper investigates such metrics and proves that they are theoretically well-behaved and actually tied to some uncertainty ground truth which is easily interpretable in terms of model prediction trustworthiness ranking. Equipped with those new results, and given the applicability of those metrics in the usual supervised paradigm, we argue that our contributions will help promoting a broader use of UQ in deep learning.

[LG-14] oward Understanding In-context vs. In-weight Learning

链接: https://arxiv.org/abs/2410.23042
作者: Bryan Chan,Xinyi Chen,András György,Dale Schuurmans
关键词-EN: training data, distributional properties, recently been demonstrated, demonstrated empirically, simplified distributional properties
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:It has recently been demonstrated empirically that in-context learning emerges in transformers when certain distributional properties are present in the training data, but this ability can also diminish upon further training. We provide a new theoretical understanding of these phenomena by identifying simplified distributional properties that give rise to the emergence and eventual disappearance of in-context learning. We do so by first analyzing a simplified model that uses a gating mechanism to choose between an in-weight and an in-context predictor. Through a combination of a generalization error and regret analysis we identify conditions where in-context and in-weight learning emerge. These theoretical findings are then corroborated experimentally by comparing the behaviour of a full transformer on the simplified distributions to that of the stylized model, demonstrating aligned results. We then extend the study to a full large language model, showing how fine-tuning on various collections of natural language prompts can elicit similar in-context and in-weight learning behaviour.

[LG-15] Planning and Learning in Risk-Aware Restless Multi-Arm Bandit Problem

链接: https://arxiv.org/abs/2410.23029
作者: Nima Akbarzadeh,Erick Delage,Yossiri Adulyasak
关键词-EN: Markov decision process, optimally distributing limited, distributing limited resources, restless multi-arm bandits, Markov decision
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In restless multi-arm bandits, a central agent is tasked with optimally distributing limited resources across several bandits (arms), with each arm being a Markov decision process. In this work, we generalize the traditional restless multi-arm bandit problem with a risk-neutral objective by incorporating risk-awareness. We establish indexability conditions for the case of a risk-aware objective and provide a solution based on Whittle index. In addition, we address the learning problem when the true transition probabilities are unknown by proposing a Thompson sampling approach and show that it achieves bounded regret that scales sublinearly with the number of episodes and quadratically with the number of arms. The efficacy of our method in reducing risk exposure in restless multi-arm bandits is illustrated through a set of numerical experiments.

[LG-16] Scoring Rules and Calibration for Imprecise Probabilities

链接: https://arxiv.org/abs/2410.23001
作者: Christian Fröhlich,Robert C. Williamson
关键词-EN: proper scoring rules, probability for rain, rain tomorrow, scoring rules, proper scoring
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:What does it mean to say that, for example, the probability for rain tomorrow is between 20% and 30%? The theory for the evaluation of precise probabilistic forecasts is well-developed and is grounded in the key concepts of proper scoring rules and calibration. For the case of imprecise probabilistic forecasts (sets of probabilities), such theory is still lacking. In this work, we therefore generalize proper scoring rules and calibration to the imprecise case. We develop these concepts as relative to data models and decision problems. As a consequence, the imprecision is embedded in a clear context. We establish a close link to the paradigm of (group) distributional robustness and in doing so provide new insights for it. We argue that proper scoring rules and calibration serve two distinct goals, which are aligned in the precise case, but intriguingly are not necessarily aligned in the imprecise case. The concept of decision-theoretic entropy plays a key role for both goals. Finally, we demonstrate the theoretical insights in machine learning practice, in particular we illustrate subtle pitfalls relating to the choice of loss function in distributional robustness.

[LG-17] Dynamic Matching with Post-allocation Service and its Application to Refugee Resettlement

链接: https://arxiv.org/abs/2410.22992
作者: Kirk Bansak,Soonbong Lee,Vahideh Manshadi,Rad Niazadeh,Elisabeth Paulson
关键词-EN: fixed annual quota, major refugee resettlement, dynamic matching problem, annual quota, matched immediately
类目: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Preliminary conference version appeared in ACM Economics and Computation (EC 2024)

点击查看摘要

Abstract:Motivated by our collaboration with a major refugee resettlement agency in the U.S., we study a dynamic matching problem where each new arrival (a refugee case) must be matched immediately and irrevocably to one of the static resources (a location with a fixed annual quota). In addition to consuming the static resource, each case requires post-allocation service from a server, such as a translator. Given the time-consuming nature of service, a server may not be available at a given time, thus we refer to it as a dynamic resource. Upon matching, the case will wait to avail service in a first-come-first-serve manner. Bursty matching to a location may result in undesirable congestion at its corresponding server. Consequently, the central planner (the agency) faces a dynamic matching problem with an objective that combines the matching reward (captured by pair-specific employment outcomes) with the cost for congestion for dynamic resources and over-allocation for the static ones. Motivated by the observed fluctuations in the composition of refugee pools across the years, we design algorithms that do not rely on distributional knowledge constructed based on past years’ data. To that end, we develop learning-based algorithms that are asymptotically optimal in certain regimes, easy to interpret, and computationally fast. Our design is based on learning the dual variables of the underlying optimization problem; however, the main challenge lies in the time-varying nature of the dual variables associated with dynamic resources. To overcome this challenge, our theoretical development brings together techniques from Lyapunov analysis, adversarial online learning, and stochastic optimization. On the application side, when tested on real data from our partner agency, our method outperforms existing ones making it a viable candidate for replacing the current practice upon experimentation.

[LG-18] V2X-Assisted Distributed Computing and Control Framework for Connected and Automated Vehicles under Ramp Merging Scenario

链接: https://arxiv.org/abs/2410.22987
作者: Qiong Wu,Jiahou Chu,Pingyi Fan,Kezhi Wang,Nan Cheng,Wen Chen,Khaled B. Letaief
关键词-EN: ramp merging scenario, paper investigates distributed, transportation cyber-physical system, ramp merging, merging scenario
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This paper has been submitted to IEEE Journal. The source code has been released at: this https URL

点击查看摘要

Abstract:This paper investigates distributed computing and cooperative control of connected and automated vehicles (CAVs) in ramp merging scenario under transportation cyber-physical system. Firstly, a centralized cooperative trajectory planning problem is formulated subject to the safely constraints and traffic performance in ramp merging scenario, where the trajectories of all vehicles are jointly optimized. To get rid of the reliance on a central controller and reduce computation time, a distributed solution to this problem implemented among CAVs through Vehicles-to-Everything (V2X) communication is proposed. Unlike existing method, our method can distribute the computational task among CAVs and carry out parallel solving through V2X communication. Then, a multi-vehicles model predictive control (MPC) problem aimed at maximizing system stability and minimizing control input is formulated based on the solution of the first problem subject to strict safety constants and input limits. Due to these complex constraints, this problem becomes high-dimensional, centralized, and non-convex. To solve it in a short time, a decomposition and convex reformulation method, namely distributed cooperative iterative model predictive control (DCIMPC), is proposed. This method leverages the communication capability of CAVs to decompose the problem, making full use of the computational resources on vehicles to achieve fast solutions and distributed control. The two above problems with their corresponding solving methods form the systemic framework of the V2X assisted distributed computing and control. Simulations have been conducted to evaluate the framework’s convergence, safety, and solving speed. Additionally, extra experiments are conducted to validate the performance of DCIMPC. The results show that our method can greatly improve computation speed without sacrificing system performance.

[LG-19] Dual-Optimized Adaptive Graph Reconstruction for Multi-View Graph Clustering ACM-MM2024

链接: https://arxiv.org/abs/2410.22983
作者: Zichen Wen,Tianyi Wu,Yazhou Ren,Yawen Ling,Chenhang Cui,Xiaorong Pu,Lifang He
关键词-EN: important machine learning, machine learning task, multi-view graph clustering, graph, encompassing various domains
类目: Machine Learning (cs.LG)
*备注: Accepted by ACM MM 2024

点击查看摘要

Abstract:Multi-view clustering is an important machine learning task for multi-media data, encompassing various domains such as images, videos, and texts. Moreover, with the growing abundance of graph data, the significance of multi-view graph clustering (MVGC) has become evident. Most existing methods focus on graph neural networks (GNNs) to extract information from both graph structure and feature data to learn distinguishable node representations. However, traditional GNNs are designed with the assumption of homophilous graphs, making them unsuitable for widely prevalent heterophilous graphs. Several techniques have been introduced to enhance GNNs for heterophilous graphs. While these methods partially mitigate the heterophilous graph issue, they often neglect the advantages of traditional GNNs, such as their simplicity, interpretability, and efficiency. In this paper, we propose a novel multi-view graph clustering method based on dual-optimized adaptive graph reconstruction, named DOAGC. It mainly aims to reconstruct the graph structure adapted to traditional GNNs to deal with heterophilous graph issues while maintaining the advantages of traditional GNNs. Specifically, we first develop an adaptive graph reconstruction mechanism that accounts for node correlation and original structural information. To further optimize the reconstruction graph, we design a dual optimization strategy and demonstrate the feasibility of our optimization strategy through mutual information theory. Numerous experiments demonstrate that DOAGC effectively mitigates the heterophilous graph problem.

[LG-20] DisenTS: Disentangled Channel Evolving Pattern Modeling for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2410.22981
作者: Zhiding Liu,Jiqian Yang,Qingyang Mao,Yuze Zhao,Mingyue Cheng,Zhi Li,Qi Liu,Enhong Chen
关键词-EN: Multivariate time series, time series forecasting, series forecasting plays, time series, real-world applications
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate time series forecasting plays a crucial role in various real-world applications. Significant efforts have been made to integrate advanced network architectures and training strategies that enhance the capture of temporal dependencies, thereby improving forecasting accuracy. On the other hand, mainstream approaches typically utilize a single unified model with simplistic channel-mixing embedding or cross-channel attention operations to account for the critical intricate inter-channel dependencies. Moreover, some methods even trade capacity for robust prediction based on the channel-independent assumption. Nonetheless, as time series data may display distinct evolving patterns due to the unique characteristics of each channel (including multiple strong seasonalities and trend changes), the unified modeling methods could yield suboptimal results. To this end, we propose DisenTS, a tailored framework for modeling disentangled channel evolving patterns in general multivariate time series forecasting. The central idea of DisenTS is to model the potential diverse patterns within the multivariate time series data in a decoupled manner. Technically, the framework employs multiple distinct forecasting models, each tasked with uncovering a unique evolving pattern. To guide the learning process without supervision of pattern partition, we introduce a novel Forecaster Aware Gate (FAG) module that generates the routing signals adaptively according to both the forecasters’ states and input series’ characteristics. The forecasters’ states are derived from the Linear Weight Approximation (LWA) strategy, which quantizes the complex deep neural networks into compact matrices. Additionally, the Similarity Constraint (SC) is further proposed to guide each model to specialize in an underlying pattern by minimizing the mutual information between the representations.

[LG-21] Dynamic Threshold-based Two-layer Online Unsupervised Anomaly Detector

链接: https://arxiv.org/abs/2410.22967
作者: Yachao Yuan,Yu Huang,Yali Yuan,Jin Wang
关键词-EN: Internet of Things, Anomaly Detection Systems, Detection Systems, Adaptive NAD, develop Anomaly Detection
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The proliferation of the Internet of Things (IoT) has heightened the vulnerability to cyber threats, making it imperative to develop Anomaly Detection Systems (ADSs) capable of adapting to emerging or novel attacks. Prior research has predominantly concentrated on offline unsupervised learning techniques to protect ADSs, which are impractical for real-world applications. Furthermore, these studies often rely heavily on the assumption of known legitimate behaviors and fall short of meeting the interpretability requirements in security contexts, thereby hindering their practical adoption. In response, this paper introduces Adaptive NAD, a comprehensive framework aimed at enhancing and interpreting online unsupervised anomaly detection within security domains. We propose an interpretable two-layer anomaly detection approach that generates dependable, high-confidence pseudo-labels. Subsequently, we incorporate an online learning mechanism that updates Adaptive NAD using an innovative threshold adjustment method to accommodate new threats. Experimental findings reveal that Adaptive NAD surpasses state-of-the-art solutions by achieving improvements of over 5.4% and 23.0% in SPAUC on the CIC-Darknet2020 and CIC-DoHBrw-2020 datasets, respectively. The code for Adaptive NAD is publicly available at this https URL.

[LG-22] Scalable Sampling for High Utility Patterns

链接: https://arxiv.org/abs/2410.22964
作者: Lamine Diop,Marc Plantevit
关键词-EN: Discovering valuable insights, Discovering valuable, crucial task, insights from data, data through meaningful
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: Accepted at 2024 IEEE International Conference on Big Data

点击查看摘要

Abstract:Discovering valuable insights from data through meaningful associations is a crucial task. However, it becomes challenging when trying to identify representative patterns in quantitative databases, especially with large datasets, as enumeration-based strategies struggle due to the vast search space involved. To tackle this challenge, output space sampling methods have emerged as a promising solution thanks to its ability to discover valuable patterns with reduced computational overhead. However, existing sampling methods often encounter limitations when dealing with large quantitative database, resulting in scalability-related challenges. In this work, we propose a novel high utility pattern sampling algorithm and its on-disk version both designed for large quantitative databases based on two original theorems. Our approach ensures both the interactivity required for user-centered methods and strong statistical guarantees through random sampling. Thanks to our method, users can instantly discover relevant and representative utility pattern, facilitating efficient exploration of the database within seconds. To demonstrate the interest of our approach, we present a compelling use case involving archaeological knowledge graph sub-profiles discovery. Experiments on semantic and none-semantic quantitative databases show that our approach outperforms the state-of-the art methods.

[LG-23] A Study of Secure Algorithms for Vertical Federated Learning: Take Secure Logistic Regression as an Example

链接: https://arxiv.org/abs/2410.22960
作者: Huan-Chih Wang,Ja-Ling Wu
关键词-EN: companies build services, machine learning techniques, entering the era, era of big, build services
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: accepted by the 20th International Conference on Security Management (SAM 2021)

点击查看摘要

Abstract:After entering the era of big data, more and more companies build services with machine learning techniques. However, it is costly for companies to collect data and extract helpful handcraft features on their own. Although it is a way to combine with other companies’ data for boosting the model’s performance, this approach may be prohibited by laws. In other words, finding the balance between sharing data with others and keeping data from privacy leakage is a crucial topic worthy of close attention. This paper focuses on distributed data and conducts secure model training tasks on a vertical federated learning scheme. Here, secure implies that the whole process is executed in the encrypted domain. Therefore, the privacy concern is released.

[LG-24] Retrieval-Augmented Generation with Estimation of Source Reliability

链接: https://arxiv.org/abs/2410.22954
作者: Jeongyeon Hwang,Junyoung Park,Hyejin Park,Sangdon Park,Jungseul Ok
关键词-EN: addresses key limitations, large language models, Retrieval-augmented generation, incorporating external databases, addresses key
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) addresses key limitations of large language models (LLMs), such as hallucinations and outdated knowledge, by incorporating external databases. These databases typically consult multiple sources to encompass up-to-date and various information. However, standard RAG methods often overlook the heterogeneous source reliability in the multi-source database and retrieve documents solely based on relevance, making them prone to propagating misinformation. To address this, we propose Reliability-Aware RAG (RA-RAG) which estimates the reliability of multiple sources and incorporates this information into both retrieval and aggregation processes. Specifically, it iteratively estimates source reliability and true answers for a set of queries with no labelling. Then, it selectively retrieves relevant documents from a few of reliable sources and aggregates them using weighted majority voting, where the selective retrieval ensures scalability while not compromising the performance. We also introduce a benchmark designed to reflect real-world scenarios with heterogeneous source reliability and demonstrate the effectiveness of RA-RAG compared to a set of baselines.

[LG-25] MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering NEURIPS2024

链接: https://arxiv.org/abs/2410.22949
作者: Yizhen Luo,Zikun Nie,Massimo Hong,Suyuan Zhao,Hao Zhou,Zaiqing Nie
关键词-EN: amino acid sequences, acid sequences holds, sequences holds tremendous, holds tremendous significance, Studying protein mutations
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: NeurIPS 2024 poster

点击查看摘要

Abstract:Studying protein mutations within amino acid sequences holds tremendous significance in life sciences. Protein language models (PLMs) have demonstrated strong capabilities in broad biological applications. However, due to architectural design and lack of supervision, PLMs model mutations implicitly with evolutionary plausibility, which is not satisfactory to serve as explainable and engineerable tools in real-world studies. To address these issues, we present MutaPLM, a unified framework for interpreting and navigating protein mutations with protein language models. MutaPLM introduces a protein delta network that captures explicit protein mutation representations within a unified feature space, and a transfer learning pipeline with a chain-of-thought (CoT) strategy to harvest protein mutation knowledge from biomedical texts. We also construct MutaDescribe, the first large-scale protein mutation dataset with rich textual annotations, which provides cross-modal supervision signals. Through comprehensive experiments, we demonstrate that MutaPLM excels at providing human-understandable explanations for mutational effects and prioritizing novel mutations with desirable properties. Our code, model, and data are open-sourced at this https URL.

[LG-26] ELBOing Stein: Variational Bayes with Stein Mixture Inference

链接: https://arxiv.org/abs/2410.22948
作者: Ola Rønning,Eric Nalisnick,Christophe Ley,Padhraic Smyth,Thomas Hamelryck
关键词-EN: approximate Bayesian inference, variational gradient descent, performs approximate Bayesian, Stein variational gradient, gradient descent
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Stein variational gradient descent (SVGD) [Liu and Wang, 2016] performs approximate Bayesian inference by representing the posterior with a set of particles. However, SVGD suffers from variance collapse, i.e. poor predictions due to underestimating uncertainty [Ba et al., 2021], even for moderately-dimensional models such as small Bayesian neural networks (BNNs). To address this issue, we generalize SVGD by letting each particle parameterize a component distribution in a mixture model. Our method, Stein Mixture Inference (SMI), optimizes a lower bound to the evidence (ELBO) and introduces user-specified guides parameterized by particles. SMI extends the Nonlinear SVGD framework [Wang and Liu, 2019] to the case of variational Bayes. SMI effectively avoids variance collapse, judging by a previously described test developed for this purpose, and performs well on standard data sets. In addition, SMI requires considerably fewer particles than SVGD to accurately estimate uncertainty for small BNNs. The synergistic combination of NSVGD, ELBO optimization and user-specified guides establishes a promising approach towards variational Bayesian inference in the case of tall and wide data.

[LG-27] KALAM: toolKit for Automating high-Level synthesis of Analog computing systeMs

链接: https://arxiv.org/abs/2410.22946
作者: Ankita Nandi,Krishil Gandhi,Mahendra Pratap Singh,Shantanu Chakrabartty,Chetan Singh Thakur
关键词-EN: Diverse computing paradigms, Diverse computing, Analog computing systeMs, analog computing, emerged to meet
类目: ystems and Control (eess.SY); Hardware Architecture (cs.AR); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5 Pages, 4 figures

点击查看摘要

Abstract:Diverse computing paradigms have emerged to meet the growing needs for intelligent energy-efficient systems. The Margin Propagation (MP) framework, being one such initiative in the analog computing domain, stands out due to its scalability across biasing conditions, temperatures, and diminishing process technology nodes. However, the lack of digital-like automation tools for designing analog systems (including that of MP analog) hinders their adoption for designing large systems. The inherent scalability and modularity of MP systems present a unique opportunity in this regard. This paper introduces KALAM (toolKit for Automating high-Level synthesis of Analog computing systeMs), which leverages factor graphs as the foundational paradigm for synthesizing MP-based analog computing systems. Factor graphs are the basis of various signal processing tasks and, when coupled with MP, can be used to design scalable and energy-efficient analog signal processors. Using Python scripting language, the KALAM automation flow translates an input factor graph to its equivalent SPICE-compatible circuit netlist that can be used to validate the intended functionality. KALAM also allows the integration of design optimization strategies such as precision tuning, variable elimination, and mathematical simplification. We demonstrate KALAM’s versatility for tasks such as Bayesian inference, Low-Density Parity Check (LDPC) decoding, and Artificial Neural Networks (ANN). Simulation results of the netlists align closely with software implementations, affirming the efficacy of our proposed automation tool.

[LG-28] Simulation-Free Training of Neural ODEs on Paired Data

链接: https://arxiv.org/abs/2410.22918
作者: Semin Kim,Jaehoon Yoo,Jinwoo Kim,Yeonwoo Cha,Saehoon Kim,Seunghoon Hong
关键词-EN: Ordinary Differential Equations, Neural Ordinary Differential, Differential Equations, Neural Ordinary, Ordinary Differential
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we investigate a method for simulation-free training of Neural Ordinary Differential Equations (NODEs) for learning deterministic mappings between paired data. Despite the analogy of NODEs as continuous-depth residual networks, their application in typical supervised learning tasks has not been popular, mainly due to the large number of function evaluations required by ODE solvers and numerical instability in gradient estimation. To alleviate this problem, we employ the flow matching framework for simulation-free training of NODEs, which directly regresses the parameterized dynamics function to a predefined target velocity field. Contrary to generative tasks, however, we show that applying flow matching directly between paired data can often lead to an ill-defined flow that breaks the coupling of the data pairs (e.g., due to crossing trajectories). We propose a simple extension that applies flow matching in the embedding space of data pairs, where the embeddings are learned jointly with the dynamic function to ensure the validity of the flow which is also easier to learn. We demonstrate the effectiveness of our method on both regression and classification tasks, where our method outperforms existing NODEs with a significantly lower number of function evaluations. The code is available at this https URL.

[LG-29] CopRA: A Progressive LoRA Training Strategy NEURIPS2024

链接: https://arxiv.org/abs/2410.22911
作者: Zhan Zhuang,Xiequn Wang,Yulong Zhang,Wei Li,Yu Zhang,Ying Wei
关键词-EN: Low-Rank Adaptation, rapidly fine-tuning foundation, fine-tuning foundation models, parameter-efficient technique, technique for rapidly
类目: Machine Learning (cs.LG)
*备注: Published in UniReps Workshop (Extended Abstract Track), NeurIPS 2024

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is a parameter-efficient technique for rapidly fine-tuning foundation models. In standard LoRA training dynamics, models tend to quickly converge to a local optimum near the initialization. However, this local optimum may not be ideal for out-of-distribution data or tasks such as merging and pruning. In this work, we propose a novel progressive training strategy for LoRA with random layer dropping. This strategy also optimizes the Shapley value of LoRA parameters in each layer, treating each layer as a player in a cooperative game. We refer to this method as Cooperative LoRA (CopRA). Our experimental results demonstrate that parameters trained with CopRA exhibit linear mode connectivity, which enables efficient model merging. This also paves the way for federated learning and multi-task learning via LoRA merging. Additionally, by optimizing the Shapley value, CopRA shows superior performance in pruning tasks.

[LG-30] Federated UCBVI: Communication-Efficient Federated Regret Minimization with Heterogeneous Agents

链接: https://arxiv.org/abs/2410.22908
作者: Safwan Labbi,Daniil Tiapkin,Lorenzo Mancini,Paul Mangold,Eric Moulines
关键词-EN: Upper Confidence Bound, Federated Upper Confidence, Iteration algorithm, Upper Confidence, Confidence Bound
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we present the Federated Upper Confidence Bound Value Iteration algorithm ( \textttFed-UCBVI ), a novel extension of the \textttUCBVI algorithm (Azar et al., 2017) tailored for the federated learning framework. We prove that the regret of \textttFed-UCBVI scales as \tilde\mathcalO(\sqrtH^3 |\mathcalS| |\mathcalA| T / M) , with a small additional term due to heterogeneity, where |\mathcalS| is the number of states, |\mathcalA| is the number of actions, H is the episode length, M is the number of agents, and T is the number of episodes. Notably, in the single-agent setting, this upper bound matches the minimax lower bound up to polylogarithmic factors, while in the multi-agent scenario, \textttFed-UCBVI has linear speed-up. To conduct our analysis, we introduce a new measure of heterogeneity, which may hold independent theoretical interest. Furthermore, we show that, unlike existing federated reinforcement learning approaches, \textttFed-UCBVI 's communication complexity only marginally increases with the number of agents.

[LG-31] Data subsampling for Poisson regression with pth-root-link NEURIPS2024

链接: https://arxiv.org/abs/2410.22872
作者: Han Cheng Lie,Alexander Munteanu
关键词-EN: Poisson regression, analyze data subsampling, Poisson, data subsampling techniques, develop and analyze
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: NeurIPS 2024

点击查看摘要

Abstract:We develop and analyze data subsampling techniques for Poisson regression, the standard model for count data y\in\mathbbN . In particular, we consider the Poisson generalized linear model with ID- and square root-link functions. We consider the method of coresets, which are small weighted subsets that approximate the loss function of Poisson regression up to a factor of 1\pm\varepsilon . We show \Omega(n) lower bounds against coresets for Poisson regression that continue to hold against arbitrary data reduction techniques up to logarithmic factors. By introducing a novel complexity parameter and a domain shifting approach, we show that sublinear coresets with 1\pm\varepsilon approximation guarantee exist when the complexity parameter is small. In particular, the dependence on the number of input points can be reduced to polylogarithmic. We show that the dependence on other input parameters can also be bounded sublinearly, though not always logarithmically. In particular, we show that the square root-link admits an O(\log(y_\max)) dependence, where y_\max denotes the largest count presented in the data, while the ID-link requires a \Theta(\sqrty_\max/\log(y_\max)) dependence. As an auxiliary result for proving the tightness of the bound with respect to y_\max in the case of the ID-link, we show an improved bound on the principal branch of the Lambert W_0 function, which may be of independent interest. We further show the limitations of our analysis when p th degree root-link functions for p\geq 3 are considered, which indicate that other analytical or computational methods would be required if such a generalization is even possible.

[LG-32] MILP-StuDio: MILP Instance Generation via Block Structure Decomposition NEURIPS2024

链接: https://arxiv.org/abs/2410.22806
作者: Haoyang Liu,Jie Wang,Wanbo Zhang,Zijie Geng,Yufei Kuang,Xijun Li,Bin Li,Yongdong Zhang,Feng Wu
关键词-EN: Mixed-integer linear programming, Mixed-integer linear, popular mathematical formulations, MILP, linear programming
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注: Published in the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Mixed-integer linear programming (MILP) is one of the most popular mathematical formulations with numerous applications. In practice, improving the performance of MILP solvers often requires a large amount of high-quality data, which can be challenging to collect. Researchers thus turn to generation techniques to generate additional MILP instances. However, existing approaches do not take into account specific block structures – which are closely related to the problem formulations – in the constraint coefficient matrices (CCMs) of MILPs. Consequently, they are prone to generate computationally trivial or infeasible instances due to the disruptions of block structures and thus problem formulations. To address this challenge, we propose a novel MILP generation framework, called Block Structure Decomposition (MILP-StuDio), to generate high-quality instances by preserving the block structures. Specifically, MILP-StuDio begins by identifying the blocks in CCMs and decomposing the instances into block units, which serve as the building blocks of MILP instances. We then design three operators to construct new instances by removing, substituting, and appending block units in the original instances, enabling us to generate instances with flexible sizes. An appealing feature of MILP-StuDio is its strong ability to preserve the feasibility and computational hardness of the generated instances. Experiments on the commonly-used benchmarks demonstrate that using instances generated by MILP-StuDio is able to significantly reduce over 10% of the solving time for learning-based solvers.

[LG-33] Solving Differential Equations with Constrained Learning

链接: https://arxiv.org/abs/2410.22796
作者: Viggo Moro,Luiz F. O. Chamon
关键词-EN: describing natural phenomena, differential equations, natural phenomena, science and engineering, fundamental tools
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:(Partial) differential equations (PDEs) are fundamental tools for describing natural phenomena, making their solution crucial in science and engineering. While traditional methods, such as the finite element method, provide reliable solutions, their accuracy is often tied to the use of computationally intensive fine meshes. Moreover, they do not naturally account for measurements or prior solutions, and any change in the problem parameters requires results to be fully recomputed. Neural network-based approaches, such as physics-informed neural networks and neural operators, offer a mesh-free alternative by directly fitting those models to the PDE solution. They can also integrate prior knowledge and tackle entire families of PDEs by simply aggregating additional training losses. Nevertheless, they are highly sensitive to hyperparameters such as collocation points and the weights associated with each loss. This paper addresses these challenges by developing a science-constrained learning (SCL) framework. It demonstrates that finding a (weak) solution of a PDE is equivalent to solving a constrained learning problem with worst-case losses. This explains the limitations of previous methods that minimize the expected value of aggregated losses. SCL also organically integrates structural constraints (e.g., invariances) and (partial) measurements or known solutions. The resulting constrained learning problems can be tackled using a practical algorithm that yields accurate solutions across a variety of PDEs, neural network architectures, and prior knowledge levels without extensive hyperparameter tuning and sometimes even at a lower computational cost.

[LG-34] heoretical Investigations and Practical Enhancements on Tail Task Risk Minimization in Meta Learning

链接: https://arxiv.org/abs/2410.22788
作者: Yiqin Lv,Qi Wang,Dong Liang,Zheng Xie
关键词-EN: task distributional robustness, Meta learning, real-world scenarios, promising paradigm, indispensable consideration
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Meta learning is a promising paradigm in the era of large models and task distributional robustness has become an indispensable consideration in real-world scenarios. Recent advances have examined the effectiveness of tail task risk minimization in fast adaptation robustness improvement \citepwang2023simple. This work contributes to more theoretical investigations and practical enhancements in the field. Specifically, we reduce the distributionally robust strategy to a max-min optimization problem, constitute the Stackelberg equilibrium as the solution concept, and estimate the convergence rate. In the presence of tail risk, we further derive the generalization bound, establish connections with estimated quantiles, and practically improve the studied strategy. Accordingly, extensive evaluations demonstrate the significance of our proposal and its scalability to multimodal large models in boosting robustness.

[LG-35] Understanding Aggregations of Proper Learners in Multiclass Classification

链接: https://arxiv.org/abs/2410.22749
作者: Julian Asilis,Mikael Møller Høgsgaard,Grigoris Velegkas
关键词-EN: proper learners, epsilon, learners, Graph dimension, proper
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 23 pages

点击查看摘要

Abstract:Multiclass learnability is known to exhibit a properness barrier: there are learnable classes which cannot be learned by any proper learner. Binary classification faces no such barrier for learnability, but a similar one for optimal learning, which can in general only be achieved by improper learners. Fortunately, recent advances in binary classification have demonstrated that this requirement can be satisfied using aggregations of proper learners, some of which are strikingly simple. This raises a natural question: to what extent can simple aggregations of proper learners overcome the properness barrier in multiclass classification? We give a positive answer to this question for classes which have finite Graph dimension, d_G . Namely, we demonstrate that the optimal binary learners of Hanneke, Larsen, and Aden-Ali et al. (appropriately generalized to the multiclass setting) achieve sample complexity O\left(\fracd_G + \ln(1 / \delta)\epsilon\right) . This forms a strict improvement upon the sample complexity of ERM. We complement this with a lower bound demonstrating that for certain classes of Graph dimension d_G , majorities of ERM learners require \Omega \left( \fracd_G + \ln(1 / \delta)\epsilon\right) samples. Furthermore, we show that a single ERM requires \Omega \left(\fracd_G \ln(1 / \epsilon) + \ln(1 / \delta)\epsilon\right) samples on such classes, exceeding the lower bound of Daniely et al. (2015) by a factor of \ln(1 / \epsilon) . For multiclass learning in full generality – i.e., for classes of finite DS dimension but possibly infinite Graph dimension – we give a strong refutation to these learning strategies, by exhibiting a learnable class which cannot be learned to constant error by any aggregation of a finite number of proper learners. Comments: 23 pages Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2410.22749 [cs.LG] (or arXiv:2410.22749v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.22749 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-36] MIXAD: Memory-Induced Explainable Time Series Anomaly Detection ICPR2024

链接: https://arxiv.org/abs/2410.22735
作者: Minha Kim,Kishor Kumar Bhaumik,Amin Ahsan Ali,Simon S. Woo
关键词-EN: modern industrial applications, multivariate time series, time series data, Explainable Time Series, industrial applications
类目: Machine Learning (cs.LG)
*备注: ICPR 2024 (oral paper)

点击查看摘要

Abstract:For modern industrial applications, accurately detecting and diagnosing anomalies in multivariate time series data is essential. Despite such need, most state-of-the-art methods often prioritize detection performance over model interpretability. Addressing this gap, we introduce MIXAD (Memory-Induced Explainable Time Series Anomaly Detection), a model designed for interpretable anomaly detection. MIXAD leverages a memory network alongside spatiotemporal processing units to understand the intricate dynamics and topological structures inherent in sensor relationships. We also introduce a novel anomaly scoring method that detects significant shifts in memory activation patterns during anomalies. Our approach not only ensures decent detection performance but also outperforms state-of-the-art baselines by 34.30% and 34.51% in interpretability metrics.

[LG-37] Extensional Properties of Recurrent Neural Networks

链接: https://arxiv.org/abs/2410.22730
作者: Evgeny Dantsin,Alexander Wolpert
关键词-EN: recurrent neural network, loosely speaking, neural network, recurrent neural, function computed
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:A property of a recurrent neural network (RNN) is called \emphextensional if, loosely speaking, it is a property of the function computed by the RNN rather than a property of the RNN algorithm. Many properties of interest in RNNs are extensional, for example, robustness against small changes of input or good clustering of inputs. Given an RNN, it is natural to ask whether it has such a property. We give a negative answer to the general question about testing extensional properties of RNNs. Namely, we prove a version of Rice’s theorem for RNNs: any nontrivial extensional property of RNNs is undecidable.

[LG-38] Enhancing binary classification: A new stacking method via leveraging computational geometry

链接: https://arxiv.org/abs/2410.22722
作者: Wei Wu,Liang Tang,Zhongjie Zhao,Chung-Piaw Teo
关键词-EN: multiple base models, harness the strengths, base models, potent ensemble learning, established learning models
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG)
*备注: 11 pages

点击查看摘要

Abstract:Stacking, a potent ensemble learning method, leverages a meta-model to harness the strengths of multiple base models, thereby enhancing prediction accuracy. Traditional stacking techniques typically utilize established learning models, such as logistic regression, as the meta-model. This paper introduces a novel approach that integrates computational geometry techniques, specifically solving the maximum weighted rectangle problem, to develop a new meta-model for binary classification. Our method is evaluated on multiple open datasets, with statistical analysis showing its stability and demonstrating improvements in accuracy compared to current state-of-the-art stacking methods with out-of-fold predictions. This new stacking method also boasts two significant advantages: enhanced interpretability and the elimination of hyperparameter tuning for the meta-model, thus increasing its practicality. These merits make our method highly applicable not only in stacking ensemble learning but also in various real-world applications, such as hospital health evaluation scoring and bank credit scoring systems, offering a fresh evaluation perspective.

[LG-39] Community search signatures as foundation features for human-centered geospatial modeling ICML2024

链接: https://arxiv.org/abs/2410.22721
作者: Mimi Sun,Chaitanya Kamath,Mohit Agarwal,Arbaaz Muslim,Hector Yee,David Schottlander,Shailesh Bavadekar,Niv Efron,Shravya Shetty,Gautam Prasad
关键词-EN: reflecting people habits, unique composite signal, composite signal reflecting, signal reflecting people, relative search frequencies
类目: Machine Learning (cs.LG)
*备注: 8 pages, 8 figures, presented at the DMLR workshop at ICML 2024

点击查看摘要

Abstract:Aggregated relative search frequencies offer a unique composite signal reflecting people’s habits, concerns, interests, intents, and general information needs, which are not found in other readily available datasets. Temporal search trends have been successfully used in time series modeling across a variety of domains such as infectious diseases, unemployment rates, and retail sales. However, most existing applications require curating specialized datasets of individual keywords, queries, or query clusters, and the search data need to be temporally aligned with the outcome variable of interest. We propose a novel approach for generating an aggregated and anonymized representation of search interest as foundation features at the community level for geospatial modeling. We benchmark these features using spatial datasets across multiple domains. In zip codes with a population greater than 3000 that cover over 95% of the contiguous US population, our models for predicting missing values in a 20% set of holdout counties achieve an average R^2 score of 0.74 across 21 health variables, and 0.80 across 6 demographic and environmental variables. Our results demonstrate that these search features can be used for spatial predictions without strict temporal alignment, and that the resulting models outperform spatial interpolation and state of the art methods using satellite imagery features.

[LG-40] Exactly Minimax-Optimal Locally Differentially Private Sampling NEURIPS2024

链接: https://arxiv.org/abs/2410.22699
作者: Hyun-Young Park,Shahab Asoodeh,Si-Hyeon Lee
关键词-EN: local differential privacy, remains incomplete, generative models, privacy-utility trade-off, problem under local
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 32 pages and 7 figures. Accepted by NeurIPS 2024

点击查看摘要

Abstract:The sampling problem under local differential privacy has recently been studied with potential applications to generative models, but a fundamental analysis of its privacy-utility trade-off (PUT) remains incomplete. In this work, we define the fundamental PUT of private sampling in the minimax sense, using the f-divergence between original and sampling distributions as the utility measure. We characterize the exact PUT for both finite and continuous data spaces under some mild conditions on the data distributions, and propose sampling mechanisms that are universally optimal for all f-divergences. Our numerical experiments demonstrate the superiority of our mechanisms over baselines, in terms of theoretical utilities for finite data space and of empirical utilities for continuous data space.

[LG-41] An Iterative Algorithm for Regularized Non-negative Matrix Factorizations

链接: https://arxiv.org/abs/2410.22698
作者: Steven E. Pav
关键词-EN: non-negative matrix factorization, matrix factorization algorithm, Lasso regularization, Lee and Seung, ridge and Lasso
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Applications (stat.AP); Computation (stat.CO)
*备注: 6 figures

点击查看摘要

Abstract:We generalize the non-negative matrix factorization algorithm of Lee and Seung to accept a weighted norm, and to support ridge and Lasso regularization. We recast the Lee and Seung multiplicative update as an additive update which does not get stuck on zero values. We apply the companion R package rnnmf to the problem of finding a reduced rank representation of a database of cocktails.

[LG-42] MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs

链接: https://arxiv.org/abs/2410.22697
作者: Aishwarya Sarkar,Sayan Ghosh,Nathan R. Tallent,Ali Jannesari
关键词-EN: Graph Neural Networks, Neural Networks, rising computational costs, pose significant challenges, massively connected graphs
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: In Proc. of the IEEE International Conference on Cluster Computing (CLUSTER), 2024

点击查看摘要

Abstract:Graph Neural Networks (GNN) are indispensable in learning from graph-structured data, yet their rising computational costs, especially on massively connected graphs, pose significant challenges in terms of execution performance. To tackle this, distributed-memory solutions such as partitioning the graph to concurrently train multiple replicas of GNNs are in practice. However, approaches requiring a partitioned graph usually suffer from communication overhead and load imbalance, even under optimal partitioning and communication strategies due to irregularities in the neighborhood minibatch sampling. This paper proposes practical trade-offs for improving the sampling and communication overheads for representation learning on distributed graphs (using popular GraphSAGE architecture) by developing a parameterized continuous prefetch and eviction scheme on top of the state-of-the-art Amazon DistDGL distributed GNN framework, demonstrating about 15-40% improvement in end-to-end training performance on the National Energy Research Scientific Computing Center’s (NERSC) Perlmutter supercomputer for various OGB datasets. Comments: In Proc. of the IEEE International Conference on Cluster Computing (CLUSTER), 2024 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2410.22697 [cs.DC] (or arXiv:2410.22697v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2410.22697 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-43] Byzantine-Robust Federated Learning: An Overview With Focus on Developing Sybil-based Attacks to Backdoor Augmented Secure Aggregation Protocols

链接: https://arxiv.org/abs/2410.22680
作者: Atharv Deshmukh
关键词-EN: collaboratively train Machine, train Machine Learning, Machine Learning models, paradigms enable large, enable large numbers
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 16 pages, 4 figures, 1 appendix

点击查看摘要

Abstract:Federated Learning (FL) paradigms enable large numbers of clients to collaboratively train Machine Learning models on private data. However, due to their multi-party nature, traditional FL schemes are left vulnerable to Byzantine attacks that attempt to hurt model performance by injecting malicious backdoors. A wide variety of prevention methods have been proposed to protect frameworks from such attacks. This paper provides a exhaustive and updated taxonomy of existing methods and frameworks, before zooming in and conducting an in-depth analysis of the strengths and weaknesses of the Robustness of Federated Learning (RoFL) protocol. From there, we propose two novel Sybil-based attacks that take advantage of vulnerabilities in RoFL. Finally, we conclude with comprehensive proposals for future testing, describe and detail implementation of the proposed attacks, and offer direction for improvements in the RoFL protocol as well as Byzantine-robust frameworks as a whole.

[LG-44] Is Function Similarity Over-Engineered? Building a Benchmark NEURIPS2024

链接: https://arxiv.org/abs/2410.22677
作者: Rebecca Saul,Chang Liu,Noah Fleischmann,Richard Zak,Kristopher Micinski,Edward Raff,James Holt
关键词-EN: including reverse engineering, critical security tasks, including reverse, reverse engineering, core component
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: To appear in the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

点击查看摘要

Abstract:Binary analysis is a core component of many critical security tasks, including reverse engineering, malware analysis, and vulnerability detection. Manual analysis is often time-consuming, but identifying commonly-used or previously-seen functions can reduce the time it takes to understand a new file. However, given the complexity of assembly, and the NP-hard nature of determining function equivalence, this task is extremely difficult. Common approaches often use sophisticated disassembly and decompilation tools, graph analysis, and other expensive pre-processing steps to perform function similarity searches over some corpus. In this work, we identify a number of discrepancies between the current research environment and the underlying application need. To remedy this, we build a new benchmark, REFuSE-Bench, for binary function similarity detection consisting of high-quality datasets and tests that better reflect real-world use cases. In doing so, we address issues like data duplication and accurate labeling, experiment with real malware, and perform the first serious evaluation of ML binary function similarity models on Windows data. Our benchmark reveals that a new, simple basline, one which looks at only the raw bytes of a function, and requires no disassembly or other pre-processing, is able to achieve state-of-the-art performance in multiple settings. Our findings challenge conventional assumptions that complex models with highly-engineered features are being used to their full potential, and demonstrate that simpler approaches can provide significant value.

[LG-45] Calibrating Practical Privacy Risks for Differentially Private Machine Learning

链接: https://arxiv.org/abs/2410.22673
作者: Yuechun Gu,Keke Chen
关键词-EN: Differential privacy quantifies, Differential privacy, privacy, privacy quantifies privacy, attacking success rate
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Differential privacy quantifies privacy through the privacy budget \epsilon , yet its practical interpretation is complicated by variations across models and datasets. Recent research on differentially private machine learning and membership inference has highlighted that with the same theoretical \epsilon setting, the likelihood-ratio-based membership inference (LiRA) attacking success rate (ASR) may vary according to specific datasets and models, which might be a better indicator for evaluating real-world privacy risks. Inspired by this practical privacy measure, we study the approaches that can lower the attacking success rate to allow for more flexible privacy budget settings in model training. We find that by selectively suppressing privacy-sensitive features, we can achieve lower ASR values without compromising application-specific data utility. We use the SHAP and LIME model explainer to evaluate feature sensitivities and develop feature-masking strategies. Our findings demonstrate that the LiRA ASR^M on model M can properly indicate the inherent privacy risk of a dataset for modeling, and it’s possible to modify datasets to enable the use of larger theoretical \epsilon settings to achieve equivalent practical privacy protection. We have conducted extensive experiments to show the inherent link between ASR and the dataset’s privacy risk. By carefully selecting features to mask, we can preserve more data utility with equivalent practical privacy protection and relaxed \epsilon settings. The implementation details are shared online at the provided GitHub URL \urlthis https URL.

[LG-46] Reweighting Local Mimina with Tilted SAM

链接: https://arxiv.org/abs/2410.22656
作者: Tian Li,Tianyi Zhou,Jeffrey A. Bilmes
关键词-EN: optimizing model parameters, Sharpness-Aware Minimization, overparameterized models, loss landscape, optimizing model
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sharpness-Aware Minimization (SAM) has been demonstrated to improve the generalization performance of overparameterized models by seeking flat minima on the loss landscape through optimizing model parameters that incur the largest loss within a neighborhood. Nevertheless, such min-max formulations are computationally challenging especially when the problem is highly non-convex. Additionally, focusing only on the worst-case local solution while ignoring potentially many other local solutions may be suboptimal when searching for flat minima. In this work, we propose Tilted SAM (TSAM), a generalization of SAM inspired by exponential tilting that effectively assigns higher priority to local solutions that are flatter and that incur larger losses. TSAM is parameterized by a tilt hyperparameter t and reduces to SAM as t approaches infinity. We prove that (1) the TSAM objective is smoother than SAM and thus easier to optimize; and (2) TSAM explicitly favors flatter minima as t increases. This is desirable as flatter minima could have better generalization properties for certain tasks. We develop algorithms motivated by the discretization of Hamiltonian dynamics to solve TSAM. Empirically, TSAM arrives at flatter local minima and results in superior test performance than the baselines of SAM and ERM across a range of image and text tasks.

[LG-47] FT-PrivacyScore: Personalized Privacy Scoring Service for Machine Learning Participation

链接: https://arxiv.org/abs/2410.22651
作者: Yuechun Gu,Jiajie He,Keke Chen
关键词-EN: Training data privacy, Training data, data, top concern, Training
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Training data privacy has been a top concern in AI modeling. While methods like differentiated private learning allow data contributors to quantify acceptable privacy loss, model utility is often significantly damaged. In practice, controlled data access remains a mainstream method for protecting data privacy in many industrial and research environments. In controlled data access, authorized model builders work in a restricted environment to access sensitive data, which can fully preserve data utility with reduced risk of data leak. However, unlike differential privacy, there is no quantitative measure for individual data contributors to tell their privacy risk before participating in a machine learning task. We developed the demo prototype FT-PrivacyScore to show that it’s possible to efficiently and quantitatively estimate the privacy risk of participating in a model fine-tuning task. The demo source code will be available at \urlthis https URL.

[LG-48] WaveRoRA: Wavelet Rotary Route Attention for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2410.22649
作者: Aobo Liang,Yan Sun
关键词-EN: achieved significant success, time series forecasting, multivariate time series, Transformer-based models, recent years
类目: Machine Learning (cs.LG)
*备注: The code is coming soon! For sure

点击查看摘要

Abstract:In recent years, Transformer-based models (Transformers) have achieved significant success in multivariate time series forecasting (MTSF). However, previous works focus on extracting features either from the time domain or the frequency domain, which inadequately captures the trends and periodic characteristics. To address this issue, we propose a wavelet learning framework to model complex temporal dependencies of the time series data. The wavelet domain integrates both time and frequency information, allowing for the analysis of local characteristics of signals at different scales. Additionally, the Softmax self-attention mechanism used by Transformers has quadratic complexity, which leads to excessive computational costs when capturing long-term dependencies. Therefore, we propose a novel attention mechanism: Rotary Route Attention (RoRA). Unlike Softmax attention, RoRA utilizes rotary position embeddings to inject relative positional information to sequence tokens and introduces a small number of routing tokens r to aggregate information from the KV matrices and redistribute it to the Q matrix, offering linear complexity. We further propose WaveRoRA, which leverages RoRA to capture inter-series dependencies in the wavelet domain. We conduct extensive experiments on eight real-world datasets. The results indicate that WaveRoRA outperforms existing state-of-the-art models while maintaining lower computational costs.

[LG-49] Solving Minimum-Cost Reach Avoid using Reinforcement Learning NEURIPS2024

链接: https://arxiv.org/abs/2410.22600
作者: Oswin So,Cheng Ge,Chuchu Fan
关键词-EN: avoiding unsafe states, Current reinforcement-learning methods, Current reinforcement-learning, unsafe states, cumulative costs subject
类目: Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: Accepted to NeurIPS 2024

点击查看摘要

Abstract:Current reinforcement-learning methods are unable to directly learn policies that solve the minimum cost reach-avoid problem to minimize cumulative costs subject to the constraints of reaching the goal and avoiding unsafe states, as the structure of this new optimization problem is incompatible with current methods. Instead, a surrogate problem is solved where all objectives are combined with a weighted sum. However, this surrogate objective results in suboptimal policies that do not directly minimize the cumulative cost. In this work, we propose RC-PPO, a reinforcement-learning-based method for solving the minimum-cost reach-avoid problem by using connections to Hamilton-Jacobi reachability. Empirical results demonstrate that RC-PPO learns policies with comparable goal-reaching rates to while achieving up to 57% lower cumulative costs compared to existing methods on a suite of minimum-cost reach-avoid benchmarks on the Mujoco simulator. The project page can be found at this https URL.

[LG-50] Gaussian Derivative Change-point Detection for Early Warnings of Industrial System Failures

链接: https://arxiv.org/abs/2410.22594
作者: Hao Zhao,Rong Pan
关键词-EN: conducting predictive maintenance, enhancing system availability, system, Derivative Change-Point Detection, essential for conducting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An early warning of future system failure is essential for conducting predictive maintenance and enhancing system availability. This paper introduces a three-step framework for assessing system health to predict imminent system breakdowns. First, the Gaussian Derivative Change-Point Detection (GDCPD) algorithm is proposed for detecting changes in the high-dimensional feature space. GDCPD conducts a multivariate Change-Point Detection (CPD) by implementing Gaussian derivative processes for identifying change locations on critical system features, as these changes eventually will lead to system failure. To assess the significance of these changes, Weighted Mahalanobis Distance (WMD) is applied in both offline and online analyses. In the offline setting, WMD helps establish a threshold that determines significant system variations, while in the online setting, it facilitates real-time monitoring, issuing alarms for potential future system breakdowns. Utilizing the insights gained from the GDCPD and monitoring scheme, Long Short-Term Memory (LSTM) network is then employed to estimate the Remaining Useful Life (RUL) of the system. The experimental study of a real-world system demonstrates the effectiveness of the proposed methodology in accurately forecasting system failures well before they occur. By integrating CPD with real-time monitoring and RUL prediction, this methodology significantly advances system health monitoring and early warning capabilities.

[LG-51] Flow Matching for Posterior Inference with Simulator Feedback

链接: https://arxiv.org/abs/2410.22573
作者: Benjamin Holzschuh,Nils Thuerey
关键词-EN: Flow-based generative modeling, Flow-based generative, lower inference times, powerful tool, tool for solving
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Code available at this https URL

点击查看摘要

Abstract:Flow-based generative modeling is a powerful tool for solving inverse problems in physical sciences that can be used for sampling and likelihood evaluation with much lower inference times than traditional methods. We propose to refine flows with additional control signals based on a simulator. Control signals can include gradients and a problem-specific cost function if the simulator is differentiable, or they can be fully learned from the simulator output. In our proposed method, we pretrain the flow network and include feedback from the simulator exclusively for finetuning, therefore requiring only a small amount of additional parameters and compute. We motivate our design choices on several benchmark problems for simulation-based inference and evaluate flow matching with simulator feedback against classical MCMC methods for modeling strong gravitational lens systems, a challenging inverse problem in astronomy. We demonstrate that including feedback from the simulator improves the accuracy by 53% , making it competitive with traditional techniques while being up to 67 x faster for inference.

[LG-52] Vertical Federated Learning with Missing Features During Training and Inference

链接: https://arxiv.org/abs/2410.22564
作者: Pedro Valdeira,Shiqiang Wang,Yuejie Chi
关键词-EN: Vertical federated learning, local data, feature-partitioned datasets, datasets across multiple, Vertical federated
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Vertical federated learning trains models from feature-partitioned datasets across multiple clients, who collaborate without sharing their local data. Standard approaches assume that all feature partitions are available during both training and inference. Yet, in practice, this assumption rarely holds, as for many samples only a subset of the clients observe their partition. However, not utilizing incomplete samples during training harms generalization, and not supporting them during inference limits the utility of the model. Moreover, if any client leaves the federation after training, its partition becomes unavailable, rendering the learned model unusable. Missing feature blocks are therefore a key challenge limiting the applicability of vertical federated learning in real-world scenarios. To address this, we propose LASER-VFL, a vertical federated learning method for efficient training and inference of split neural network-based models that is capable of handling arbitrary sets of partitions. Our approach is simple yet effective, relying on the strategic sharing of model parameters and on task-sampling to train a family of predictors. We show that LASER-VFL achieves a \mathcalO(1/\sqrtT) convergence rate for nonconvex objectives in general, \mathcalO(1/T) for sufficiently large batch sizes, and linear convergence under the Polyak-Łojasiewicz inequality. Numerical experiments show improved performance of LASER-VFL over the baselines. Remarkably, this is the case even in the absence of missing features. For example, for CIFAR-100, we see an improvement in accuracy of 21.4% when each of four feature blocks is observed with a probability of 0.5 and of 12.2% when all features are observed.

[LG-53] Unsupervised Multimodal Fusion of In-process Sensor Data for Advanced Manufacturing Process Monitoring

链接: https://arxiv.org/abs/2410.22558
作者: Matthew McKinney,Anthony Garland,Dale Cillessen,Jesse Adamczyk,Dan Bolintineanu,Michael Heiden,Elliott Fowler,Brad L. Boyce
关键词-EN: Effective monitoring, maintaining product quality, operational efficiency, crucial for maintaining, maintaining product
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Effective monitoring of manufacturing processes is crucial for maintaining product quality and operational efficiency. Modern manufacturing environments generate vast amounts of multimodal data, including visual imagery from various perspectives and resolutions, hyperspectral data, and machine health monitoring information such as actuator positions, accelerometer readings, and temperature measurements. However, interpreting this complex, high-dimensional data presents significant challenges, particularly when labeled datasets are unavailable. This paper presents a novel approach to multimodal sensor data fusion in manufacturing processes, inspired by the Contrastive Language-Image Pre-training (CLIP) model. We leverage contrastive learning techniques to correlate different data modalities without the need for labeled data, developing encoders for five distinct modalities: visual imagery, audio signals, laser position (x and y coordinates), and laser power measurements. By compressing these high-dimensional datasets into low-dimensional representational spaces, our approach facilitates downstream tasks such as process control, anomaly detection, and quality assurance. We evaluate the effectiveness of our approach through experiments, demonstrating its potential to enhance process monitoring capabilities in advanced manufacturing systems. This research contributes to smart manufacturing by providing a flexible, scalable framework for multimodal data fusion that can adapt to diverse manufacturing environments and sensor configurations.

[LG-54] Hindsight Experience Replay Accelerates Proximal Policy Optimization

链接: https://arxiv.org/abs/2410.22524
作者: Douglas C. Crowder,Darrien M. McKenzie,Matthew L. Trappett,Frances S. Chance
关键词-EN: Hindsight experience replay, emit sparse rewards, Hindsight experience, off-policy reinforcement learning, experience replay
类目: Machine Learning (cs.LG)
*备注: 12 pages. 10 Figures

点击查看摘要

Abstract:Hindsight experience replay (HER) accelerates off-policy reinforcement learning algorithms for environments that emit sparse rewards by modifying the goal of the episode post-hoc to be some state achieved during the episode. Because post-hoc modification of the observed goal violates the assumptions of on-policy algorithms, HER is not typically applied to on-policy algorithms. Here, we show that HER can dramatically accelerate proximal policy optimization (PPO), an on-policy reinforcement learning algorithm, when tested on a custom predator-prey environment.

[LG-55] Multimodal Structure Preservation Learning

链接: https://arxiv.org/abs/2410.22520
作者: Chang Liu,Jieshi Chen,Lee H. Harrison,Artur Dubrawski
关键词-EN: build machine learning, machine learning models, acquisition cost, crucial considerations, build machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When selecting data to build machine learning models in practical applications, factors such as availability, acquisition cost, and discriminatory power are crucial considerations. Different data modalities often capture unique aspects of the underlying phenomenon, making their utilities complementary. On the other hand, some sources of data host structural information that is key to their value. Hence, the utility of one data type can sometimes be enhanced by matching the structure of another. We propose Multimodal Structure Preservation Learning (MSPL) as a novel method of learning data representations that leverages the clustering structure provided by one data modality to enhance the utility of data from another modality. We demonstrate the effectiveness of MSPL in uncovering latent structures in synthetic time series data and recovering clusters from whole genome sequencing and antimicrobial resistance data using mass spectrometry data in support of epidemiology applications. The results show that MSPL can imbue the learned features with external structures and help reap the beneficial synergies occurring across disparate data modalities.

[LG-56] Unlocking Point Processes through Point Set Diffusion

链接: https://arxiv.org/abs/2410.22493
作者: David Lüdke,Enric Rabasseda Raventós,Marcel Kollovieh,Stephan Günnemann
关键词-EN: Point Set Diffusion, random point sets, Point processes, Point processes model, point sets
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Point processes model the distribution of random point sets in mathematical spaces, such as spatial and temporal domains, with applications in fields like seismology, neuroscience, and economics. Existing statistical and machine learning models for point processes are predominantly constrained by their reliance on the characteristic intensity function, introducing an inherent trade-off between efficiency and flexibility. In this paper, we introduce Point Set Diffusion, a diffusion-based latent variable model that can represent arbitrary point processes on general metric spaces without relying on the intensity function. By directly learning to stochastically interpolate between noise and data point sets, our approach enables efficient, parallel sampling and flexible generation for complex conditional tasks defined on the metric space. Experiments on synthetic and real-world datasets demonstrate that Point Set Diffusion achieves state-of-the-art performance in unconditional and conditional generation of spatial and spatiotemporal point processes while providing up to orders of magnitude faster sampling than autoregressive baselines.

[LG-57] Learning Identifiable Factorized Causal Representations of Cellular Responses

链接: https://arxiv.org/abs/2410.22472
作者: Haiyi Mao,Romain Lopez,Kai Liu,Jan-Christian Huetter,David Richmond,Panayiotis Benos,Lin Qiu
关键词-EN: chemical perturbations promises, mathbf, promises to accelerate, accelerate the discovery, therapeutic targets
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The study of cells and their responses to genetic or chemical perturbations promises to accelerate the discovery of therapeutic targets. However, designing adequate and insightful models for such data is difficult because the response of a cell to perturbations essentially depends on its biological context (e.g., genetic background or cell type). For example, while discovering therapeutic targets, one may want to enrich for drugs that specifically target a certain cell type. This challenge emphasizes the need for methods that explicitly take into account potential interactions between drugs and contexts. Towards this goal, we propose a novel Factorized Causal Representation (FCR) learning method that reveals causal structure in single-cell perturbation data from several cell lines. Based on the framework of identifiable deep generative models, FCR learns multiple cellular representations that are disentangled, comprised of covariate-specific ( \mathbfz_x ), treatment-specific ( \mathbfz_t ), and interaction-specific ( \mathbfz_tx ) blocks. Based on recent advances in non-linear ICA theory, we prove the component-wise identifiability of \mathbfz_tx and block-wise identifiability of \mathbfz_t and \mathbfz_x . Then, we present our implementation of FCR, and empirically demonstrate that it outperforms state-of-the-art baselines in various tasks across four single-cell datasets. Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM) Cite as: arXiv:2410.22472 [cs.LG] (or arXiv:2410.22472v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2410.22472 Focus to learn more arXiv-issued DOI via DataCite

[LG-58] Power side-channel leakage localization through adversarial training of deep neural networks

链接: https://arxiv.org/abs/2410.22425
作者: Jimmy Gammell,Anand Raghunathan,Kaushik Roy
关键词-EN: Supervised deep learning, Supervised deep, effective tool, tool for carrying, deep learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Supervised deep learning has emerged as an effective tool for carrying out power side-channel attacks on cryptographic implementations. While increasingly-powerful deep learning-based attacks are regularly published, comparatively-little work has gone into using deep learning to defend against these attacks. In this work we propose a technique for identifying which timesteps in a power trace are responsible for leaking a cryptographic key, through an adversarial game between a deep learning-based side-channel attacker which seeks to classify a sensitive variable from the power traces recorded during encryption, and a trainable noise generator which seeks to thwart this attack by introducing a minimal amount of noise into the power traces. We demonstrate on synthetic datasets that our method can outperform existing techniques in the presence of common countermeasures such as Boolean masking and trace desynchronization. Results on real datasets are weak because the technique is highly sensitive to hyperparameters and early-stop point, and we lack a holdout dataset with ground truth knowledge of leaking points for model selection. Nonetheless, we believe our work represents an important first step towards deep side-channel leakage localization without relying on strong assumptions about the implementation or the nature of its leakage. An open-source PyTorch implementation of our experiments is provided.

[LG-59] GleanVec: Accelerating vector search with minimalist nonlinear dimensionality reduction

链接: https://arxiv.org/abs/2410.22347
作者: Mariano Tepper,Ishwar Singh Bhati,Cecilia Aguerrebere,Ted Willke
关键词-EN: reflects semantic affinities, similarity reflects semantic, Embedding models, semantic affinities, models can generate
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Embedding models can generate high-dimensional vectors whose similarity reflects semantic affinities. Thus, accurately and timely retrieving those vectors in a large collection that are similar to a given query has become a critical component of a wide range of applications. In particular, cross-modal retrieval (e.g., where a text query is used to find images) is gaining momentum rapidly. Here, it is challenging to achieve high accuracy as the queries often have different statistical distributions than the database vectors. Moreover, the high vector dimensionality puts these search systems under compute and memory pressure, leading to subpar performance. In this work, we present new linear and nonlinear methods for dimensionality reduction to accelerate high-dimensional vector search while maintaining accuracy in settings with in-distribution (ID) and out-of-distribution (OOD) queries. The linear LeanVec-Sphering outperforms other linear methods, trains faster, comes with no hyperparameters, and allows to set the target dimensionality more flexibly. The nonlinear Generalized LeanVec (GleanVec) uses a piecewise linear scheme to further improve the search accuracy while remaining computationally nimble. Initial experimental results show that LeanVec-Sphering and GleanVec push the state of the art for vector search.

[LG-60] Improving the accuracy of food security predictions by integrating conflict data

链接: https://arxiv.org/abs/2410.22342
作者: Marco Bertetti,Paolo Agnolucci,Alvaro Calzadilla,Licia Capra
关键词-EN: prominent factors driving, driving food crises, factors driving food, emerged as prominent, prominent factors
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Violence and armed conflicts have emerged as prominent factors driving food crises. However, the extent of their impact remains largely unexplored. This paper provides an in-depth analysis of the impact of violent conflicts on food security in Africa. We performed a comprehensive correlation analysis using data from the Famine Early Warning Systems Network (FEWSNET) and the Armed Conflict Location Event Data (ACLED). Our results show that using conflict data to train machine learning models leads to a 1.5% increase in accuracy compared to models that do not incorporate conflict-related information. The key contribution of this study is the quantitative analysis of the impact of conflicts on food security predictions.

[LG-61] Conditional Forecasting of Margin Calls using Dynamic Graph Neural Networks

链接: https://arxiv.org/abs/2410.23275
作者: Matteo Citterio,Marco D’Errico,Gabriele Visentin
关键词-EN: Dynamic Graph Neural, Graph Neural Network, steps ahead forecasting, Graph Neural, ahead forecasting problems
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a novel Dynamic Graph Neural Network (DGNN) architecture for solving conditional m -steps ahead forecasting problems in temporal financial networks. The proposed DGNN is validated on simulated data from a temporal financial network model capturing stylized features of Interest Rate Swaps (IRSs) transaction networks, where financial entities trade swap contracts dynamically and the network topology evolves conditionally on a reference rate. The proposed model is able to produce accurate conditional forecasts of net variation margins up to a 21 -day horizon by leveraging conditional information under pre-determined stress test scenarios. Our work shows that the network dynamics can be successfully incorporated into stress-testing practices, thus providing regulators and policymakers with a crucial tool for systemic risk monitoring.

[LG-62] Very fast Bayesian Additive Regression Trees on GPU

链接: https://arxiv.org/abs/2410.23244
作者: Giacomo Petrillo
关键词-EN: Bayesian Additive Regression, Additive Regression Trees, Bayesian regression technique, nonparametric Bayesian regression, Bayesian Additive
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Check out the software at this https URL

点击查看摘要

Abstract:Bayesian Additive Regression Trees (BART) is a nonparametric Bayesian regression technique based on an ensemble of decision trees. It is part of the toolbox of many statisticians. The overall statistical quality of the regression is typically higher than other generic alternatives, and it requires less manual tuning, making it a good default choice. However, it is a niche method compared to its natural competitor XGBoost, due to the longer running time, making sample sizes above 10,000-100,000 a nuisance. I present a GPU-enabled implementation of BART, faster by up to 200x relative to a single CPU core, making BART competitive in running time with XGBoost. This implementation is available in the Python package bartz.

[LG-63] Full-waveform earthquake source inversion using simulation-based inference

链接: https://arxiv.org/abs/2410.23238
作者: A. A. Saoulis,D. Piras,A. Spurio Mancini,B. Joachimi,A. M. G. Ferreira
关键词-EN: SBI, paper presents, moment tensor, Gaussian likelihood, Bayesian inference framework
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 22 + 11 pages, 11 + 11 figures

点击查看摘要

Abstract:This paper presents a novel framework for full-waveform seismic source inversion using simulation-based inference (SBI). Traditional probabilistic approaches often rely on simplifying assumptions about data errors, which we show can lead to inaccurate uncertainty quantification. SBI addresses this limitation by building an empirical probabilistic model of the data errors using machine learning models, known as neural density estimators, which can then be integrated into the Bayesian inference framework. We apply the SBI framework to point-source moment tensor inversions as well as joint moment tensor and time-location inversions. We construct a range of synthetic examples to explore the quality of the SBI solutions, as well as to compare the SBI results with standard Gaussian likelihood-based Bayesian inversions. We then demonstrate that under real seismic noise, common Gaussian likelihood assumptions for treating full-waveform data yield overconfident posterior distributions that underestimate the moment tensor component uncertainties by up to a factor of 3. We contrast this with SBI, which produces well-calibrated posteriors that generally agree with the true seismic source parameters, and offers an order-of-magnitude reduction in the number of simulations required to perform inference compared to standard Monte Carlo techniques. Finally, we apply our methodology to a pair of moderate magnitude earthquakes in the North Atlantic. We utilise seismic waveforms recorded by the recent UPFLOW ocean bottom seismometer array as well as by regional land stations in the Azores, comparing full moment tensor and source-time location posteriors between SBI and a Gaussian likelihood approach. We find that our adaptation of SBI can be directly applied to real earthquake sources to efficiently produce high quality posterior distributions that significantly improve upon Gaussian likelihood approaches.

[LG-64] Improved convergence rate of kNN graph Laplacians

链接: https://arxiv.org/abs/2410.23212
作者: Yixuan Tan,Xiuyuan Cheng
关键词-EN: local data densities, nearest neighbor, graph, widely used due, adaptivity to local
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In graph-based data analysis, k -nearest neighbor ( k NN) graphs are widely used due to their adaptivity to local data densities. Allowing weighted edges in the graph, the kernelized graph affinity provides a more general type of k NN graph where the k NN distance is used to set the kernel bandwidth adaptively. In this work, we consider a general class of k NN graph where the graph affinity is W_ij = \epsilon^-d/2 ; k_0 ( | x_i - x_j |^2 / \epsilon \phi( \widehat\rho(x_i), \widehat\rho(x_j) )^2 ) , with \widehat\rho(x) being the (rescaled) k NN distance at the point x , \phi a symmetric bi-variate function, and k_0 a non-negative function on [0,\infty) . Under the manifold data setting, where N i.i.d. samples x_i are drawn from a density p on a d -dimensional unknown manifold embedded in a high dimensional Euclidean space, we prove the point-wise convergence of the k NN graph Laplacian to the limiting manifold operator (depending on p ) at the rate of O(N^-2/(d+6),) , up to a log factor, when k_0 and \phi have C^3 regularity and satisfy other technical conditions. This fast rate is obtained when \epsilon \sim N^-2/(d+6), and k \sim N^6/(d+6), , both at the optimal order to balance the theoretical bias and variance errors. When k_0 and \phi have lower regularities, including when k_0 is a compactly supported function as in the standard k NN graph, the convergence rate degenerates to O(N^-1/(d+4),) . Our improved convergence rate is based on a refined analysis of the k NN estimator, which can be of independent interest. We validate our theory by numerical experiments on simulated data.

[LG-65] Uncertainty quantification for fast reconstruction methods using augmented equivariant bootstrap: Application to radio interferometry NEURIPS2024

链接: https://arxiv.org/abs/2410.23178
作者: Mostafa Cherif,Tobías I. Liaudat,Jonathan Kern,Christophe Kervazo,Jérôme Bobin
关键词-EN: Square Kilometer Array, Kilometer Array promises, Square Kilometer, Kilometer Array, astronomy observational capabilities
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures. Accepted at the Machine Learning and the Physical Sciences Workshop, NeurIPS 2024

点击查看摘要

Abstract:The advent of next-generation radio interferometers like the Square Kilometer Array promises to revolutionise our radio astronomy observational capabilities. The unprecedented volume of data these devices generate requires fast and accurate image reconstruction algorithms to solve the ill-posed radio interferometric imaging problem. Most state-of-the-art reconstruction methods lack trustworthy and scalable uncertainty quantification, which is critical for the rigorous scientific interpretation of radio observations. We propose an unsupervised technique based on a conformalized version of a radio-augmented equivariant bootstrapping method, which allows us to quantify uncertainties for fast reconstruction methods. Noticeably, we rely on reconstructions from ultra-fast unrolled algorithms. The proposed method brings more reliable uncertainty estimations to our problem than existing alternatives.

[LG-66] Functional Gradient Flows for Constrained Sampling NEURIPS2024

链接: https://arxiv.org/abs/2410.23170
作者: Shiyue Zhang,Longlin Yu,Ziheng Cheng,Cheng Zhang
关键词-EN: chain Monte Carlo, Markov chain Monte, particle-based variational inference, Monte Carlo, Variational Gradient Descent
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: NeurIPS 2024 camera-ready (30 pages, 26 figures)

点击查看摘要

Abstract:Recently, through a unified gradient flow perspective of Markov chain Monte Carlo (MCMC) and variational inference (VI), particle-based variational inference methods (ParVIs) have been proposed that tend to combine the best of both worlds. While typical ParVIs such as Stein Variational Gradient Descent (SVGD) approximate the gradient flow within a reproducing kernel Hilbert space (RKHS), many attempts have been made recently to replace RKHS with more expressive function spaces, such as neural networks. While successful, these methods are mainly designed for sampling from unconstrained domains. In this paper, we offer a general solution to constrained sampling by introducing a boundary condition for the gradient flow which would confine the particles within the specific domain. This allows us to propose a new functional gradient ParVI method for constrained sampling, called constrained functional gradient flow (CFG), with provable continuous-time convergence in total variation (TV). We also present novel numerical strategies to handle the boundary integral term arising from the domain constraints. Our theory and experiments demonstrate the effectiveness of the proposed framework.

[LG-67] When can classical neural networks represent quantum states?

链接: https://arxiv.org/abs/2410.23152
作者: Tai-Hsuan Yang,Mehdi Soleimanifar,Thiago Bergamaschi,John Preskill
关键词-EN: n-qubit state requires, naive classical representation, requires specifying exponentially, naive classical, neural quantum states
类目: Quantum Physics (quant-ph); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注: 37 pages, 9 figures

点击查看摘要

Abstract:A naive classical representation of an n-qubit state requires specifying exponentially many amplitudes in the computational basis. Past works have demonstrated that classical neural networks can succinctly express these amplitudes for many physically relevant states, leading to computationally powerful representations known as neural quantum states. What underpins the efficacy of such representations? We show that conditional correlations present in the measurement distribution of quantum states control the performance of their neural representations. Such conditional correlations are basis dependent, arise due to measurement-induced entanglement, and reveal features not accessible through conventional few-body correlations often examined in studies of phases of matter. By combining theoretical and numerical analysis, we demonstrate how the state’s entanglement and sign structure, along with the choice of measurement basis, give rise to distinct patterns of short- or long-range conditional correlations. Our findings provide a rigorous framework for exploring the expressive power of neural quantum states.

[LG-68] Graph Integration for Diffusion-Based Manifold Alignment ICML

链接: https://arxiv.org/abs/2410.22978
作者: Jake S. Rhodes,Adam G. Rustad
关键词-EN: Manifold alignment, Semi-supervised manifold alignment, intrinsically linked, manifold alignment methods, Manifold
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, Accepted at ICMLA 2024

点击查看摘要

Abstract:Data from individual observations can originate from various sources or modalities but are often intrinsically linked. Multimodal data integration can enrich information content compared to single-source data. Manifold alignment is a form of data integration that seeks a shared, underlying low-dimensional representation of multiple data sources that emphasizes similarities between alternative representations of the same entities. Semi-supervised manifold alignment relies on partially known correspondences between domains, either through shared features or through other known associations. In this paper, we introduce two semi-supervised manifold alignment methods. The first method, Shortest Paths on the Union of Domains (SPUD), forms a unified graph structure using known correspondences to establish graph edges. By learning inter-domain geodesic distances, SPUD creates a global, multi-domain structure. The second method, MASH (Manifold Alignment via Stochastic Hopping), learns local geometry within each domain and forms a joint diffusion operator using known correspondences to iteratively learn new inter-domain correspondences through a random-walk approach. Through the diffusion process, MASH forms a coupling matrix that links heterogeneous domains into a unified structure. We compare SPUD and MASH with existing semi-supervised manifold alignment methods and show that they outperform competing methods in aligning true correspondences and cross-domain classification. In addition, we show how these methods can be applied to transfer label information between domains.

[LG-69] Generalization Bounds via Conditional f-Information NEURIPS2024

链接: https://arxiv.org/abs/2410.22887
作者: Ziqiao Wang,Yongyi Mao
关键词-EN: traditional conditional mutual, conditional mutual information, information-theoretic generalization bounds, traditional conditional, conditional mutual
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:In this work, we introduce novel information-theoretic generalization bounds using the conditional f -information framework, an extension of the traditional conditional mutual information (MI) framework. We provide a generic approach to derive generalization bounds via f -information in the supersample setting, applicable to both bounded and unbounded loss functions. Unlike previous MI-based bounds, our proof strategy does not rely on upper bounding the cumulant-generating function (CGF) in the variational formula of MI. Instead, we set the CGF or its upper bound to zero by carefully selecting the measurable function invoked in the variational formula. Although some of our techniques are partially inspired by recent advances in the coin-betting framework (e.g., Jang et al. (2023)), our results are independent of any previous findings from regret guarantees of online gambling algorithms. Additionally, our newly derived MI-based bound recovers many previous results and improves our understanding of their potential limitations. Finally, we empirically compare various f -information measures for generalization, demonstrating the improvement of our new bounds over the previous bounds.

[LG-70] Hyperparameter Optimization in Machine Learning

链接: https://arxiv.org/abs/2410.22854
作者: Luca Franceschi,Michele Donini,Valerio Perrone,Aaron Klein,Cédric Archambeau,Matthias Seeger,Massimiliano Pontil,Paolo Frasconi
关键词-EN: configuration variables controlling, machine learning algorithms, machine learning, configuration variables, variables controlling
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Hyperparameters are configuration variables controlling the behavior of machine learning algorithms. They are ubiquitous in machine learning and artificial intelligence and the choice of their values determine the effectiveness of systems based on these technologies. Manual hyperparameter search is often unsatisfactory and becomes unfeasible when the number of hyperparameters is large. Automating the search is an important step towards automating machine learning, freeing researchers and practitioners alike from the burden of finding a good set of hyperparameters by trial and error. In this survey, we present a unified treatment of hyperparameter optimization, providing the reader with examples and insights into the state-of-the-art. We cover the main families of techniques to automate hyperparameter search, often referred to as hyperparameter optimization or tuning, including random and quasi-random search, bandit-, model- and gradient- based approaches. We further discuss extensions, including online, constrained, and multi-objective formulations, touch upon connections with other fields such as meta-learning and neural architecture search, and conclude with open questions and future research directions.

[LG-71] Dataset of polarimetric images of mechanically generated water surface waves coupled with surface elevation records by wave gauges linear array

链接: https://arxiv.org/abs/2410.22849
作者: Noam Ginio(1),Michael Lindenbaum(2 and 3),Barak Fishbain(1),Dan Liberzon(1 and 3) ((1) Faculty of Civil and Environmental Engineering, Technion, Haifa, Israel, (2) Faculty of Computer Science, Technion, Haifa, Israel, (3) Interdisciplinary program for Marine Engineering, Technion, Haifa, Israel)
关键词-EN: Effective spatio-temporal measurements, Effective spatio-temporal, engineering research, water surface elevation, experiments are essential
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures, 3 tables. Data article. Under review in “Data in Brief” journal. The data in available for download on ScienceDB repository. arXiv admin note: substantial text overlap with arXiv:2410.14988

点击查看摘要

Abstract:Effective spatio-temporal measurements of water surface elevation (water waves) in laboratory experiments are essential for scientific and engineering research. Existing techniques are often cumbersome, computationally heavy and generally suffer from limited wavenumber/frequency response. To address these challenges a novel method was developed, using polarization filter equipped camera as the main sensor and Machine Learning (ML) algorithms for data processing [1,2]. The developed method training and evaluation was based on in-house made supervised dataset. Here we present this supervised dataset of polarimetric images of the water surface coupled with the water surface elevation measurements made by a linear array of resistance-type wave gauges (WG). The water waves were mechanically generated in a laboratory waves basin, and the polarimetric images were captured under an artificial light source. Meticulous camera and WGs calibration and instruments synchronization supported high spatio-temporal resolution. The data set covers several wavefield conditions, from simple monochromatic wave trains of various steepness, to irregular wavefield of JONSWAP prescribed spectral shape and several wave breaking scenarios. The dataset contains measurements repeated in several camera positions relative to the wave field propagation direction.

[LG-72] Machine Learning Nonadiabatic Dynamics: Eliminating Phase Freedom of Nonadiabatic Couplings with the State-Intraction State-Averaged Spin-Restricted Ensemble-Referenced Kohn-Sham Approach

链接: https://arxiv.org/abs/2410.22801
作者: Sung Wook Moon,Soohaeng Yoo Willow,Tae Hyeon Park,Seung Kyu Min,Chang Woo Myung
关键词-EN: Excited-state molecular dynamics, Excited-state molecular, pose significant challenges, conical intersections, pose significant
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Excited-state molecular dynamics (ESMD) simulations near conical intersections (CIs) pose significant challenges when using machine learning potentials (MLPs). Although MLPs have gained recognition for their integration into mixed quantum-classical (MQC) methods, such as trajectory surface hopping (TSH), and their capacity to model correlated electron-nuclear dynamics efficiently, difficulties persist in managing nonadiabatic dynamics. Specifically, singularities at CIs and double-valued coupling elements result in discontinuities that disrupt the smoothness of predictive functions. Partial solutions have been provided by learning diabatic Hamiltonians with phaseless loss functions to these challenges. However, a definitive method for addressing the discontinuities caused by CIs and double-valued coupling elements has yet to be developed. Here, we introduce the phaseless coupling term, \Delta^2 , derived from the square of the off-diagonal elements of the diabatic Hamiltonian in the SSR(2,2) formalism. This approach improves the stability and accuracy of the MLP model by addressing the issues arising from CI singularities and double-valued coupling functions. We apply this method to the penta-2,4-dieniminium cation (PSB3), demonstrating its effectiveness in improving MLP training for ML-based nonadiabatic dynamics. Our results show that the \Delta^2 based ML-ESMD method can reproduce ab initio ESMD simulations, underscoring its potential and efficiency for broader applications, particularly in large-scale and long-timescale ESMD simulations.

[LG-73] Unfolding Target Detection with State Space Model

链接: https://arxiv.org/abs/2410.22774
作者: Luca Jiang-Tao Yu,Chenshu Wu
关键词-EN: fundamental task, CFAR, radar sensing, detection, Target detection
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Target detection is a fundamental task in radar sensing, serving as the precursor to any further processing for various applications. Numerous detection algorithms have been proposed. Classical methods based on signal processing, e.g., the most widely used CFAR, are challenging to tune and sensitive to environmental conditions. Deep learning-based methods can be more accurate and robust, yet usually lack interpretability and physical relevance. In this paper, we introduce a novel method that combines signal processing and deep learning by unfolding the CFAR detector with a state space model architecture. By reserving the CFAR pipeline yet turning its sophisticated configurations into trainable parameters, our method achieves high detection performance without manual parameter tuning, while preserving model interpretability. We implement a lightweight model of only 260K parameters and conduct real-world experiments for human target detection using FMCW radars. The results highlight the remarkable performance of the proposed method, outperforming CFAR and its variants by 10X in detection rate and false alarm rate. Our code is open-sourced here: this https URL.

[LG-74] An Overview of Causal Inference using Kernel Embeddings

链接: https://arxiv.org/abs/2410.22754
作者: Dino Sejdinovic
关键词-EN: representing probability measures, statistical inference problems, Kernel embeddings, probability measures, kernel Hilbert space
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Kernel embeddings have emerged as a powerful tool for representing probability measures in a variety of statistical inference problems. By mapping probability measures into a reproducing kernel Hilbert space (RKHS), kernel embeddings enable flexible representations of complex relationships between variables. They serve as a mechanism for efficiently transferring the representation of a distribution downstream to other tasks, such as hypothesis testing or causal effect estimation. In the context of causal inference, the main challenges include identifying causal associations and estimating the average treatment effect from observational data, where confounding variables may obscure direct cause-and-effect relationships. Kernel embeddings provide a robust nonparametric framework for addressing these challenges. They allow for the representations of distributions of observational data and their seamless transformation into representations of interventional distributions to estimate relevant causal quantities. We overview recent research that leverages the expressiveness of kernel embeddings in tandem with causal inference.

[LG-75] Identifying Drift Diffusion and Causal Structure from Temporal Snapshots

链接: https://arxiv.org/abs/2410.22729
作者: Vincent Guan,Joseph Janssen,Hossein Rahmani,Andrew Warren,Stephen Zhang,Elina Robeva,Geoffrey Schiebinger
关键词-EN: Stochastic differential equations, modelling dynamic processes, including gene regulatory, gene regulatory networks, Stochastic differential
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Stochastic differential equations (SDEs) are a fundamental tool for modelling dynamic processes, including gene regulatory networks (GRNs), contaminant transport, financial markets, and image generation. However, learning the underlying SDE from observational data is a challenging task, especially when individual trajectories are not observable. Motivated by burgeoning research in single-cell datasets, we present the first comprehensive approach for jointly estimating the drift and diffusion of an SDE from its temporal marginals. Assuming linear drift and additive diffusion, we prove that these parameters are identifiable from marginals if and only if the initial distribution is not invariant to a class of generalized rotations, a condition that is satisfied by most distributions. We further prove that the causal graph of any SDE with additive diffusion can be recovered from the SDE parameters. To complement this theory, we adapt entropy-regularized optimal transport to handle anisotropic diffusion, and introduce APPEX (Alternating Projection Parameter Estimation from X_0 ), an iterative algorithm designed to estimate the drift, diffusion, and causal graph of an additive noise SDE, solely from temporal marginals. We show that each of these steps are asymptotically optimal with respect to the Kullback-Leibler divergence, and demonstrate APPEX’s effectiveness on simulated data from linear additive noise SDEs.

[LG-76] Dynamic PET Image Prediction Using a Network Combining Reversible and Irreversible Modules

链接: https://arxiv.org/abs/2410.22674
作者: Jie Sun,Qian Xia,Chuanfu Sun,Yumei Chen,Huafeng Liu,Wentao Zhu,Qiegen Liu
关键词-EN: positron emission tomography, dynamic PET, dynamic PET images, dynamic PET imaging, Dynamic positron emission
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic positron emission tomography (PET) images can reveal the distribution of tracers in the organism and the dynamic processes involved in biochemical reactions, and it is widely used in clinical practice. Despite the high effectiveness of dynamic PET imaging in studying the kinetics and metabolic processes of radiotracers. Pro-longed scan times can cause discomfort for both patients and medical personnel. This study proposes a dynamic frame prediction method for dynamic PET imaging, reduc-ing dynamic PET scanning time by applying a multi-module deep learning framework composed of reversible and irreversible modules. The network can predict kinetic parameter images based on the early frames of dynamic PET images, and then generate complete dynamic PET images. In validation experiments with simulated data, our network demonstrated good predictive performance for kinetic parameters and was able to reconstruct high-quality dynamic PET images. Additionally, in clinical data experiments, the network exhibited good generalization performance and attached that the proposed method has promising clinical application prospects.

[LG-77] SleepNetZero: Zero-Burden Zero-Shot Reliable Sleep Staging With Neural Networks Based on Ballistocardiograms

链接: https://arxiv.org/abs/2410.22646
作者: Shuzhen Li,Yuxin Chen,Xuesong Chen,Ruiyang Gao,Yupeng Zhang,Chao Yu,Yunfei Li,Ziyi Ye,Weijun Huang,Hongliang Yi,Yue Leng,Yi Wu
关键词-EN: maintaining good health, plays a crucial, crucial role, role in maintaining, maintaining good
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Sleep monitoring plays a crucial role in maintaining good health, with sleep staging serving as an essential metric in the monitoring process. Traditional methods, utilizing medical sensors like EEG and ECG, can be effective but often present challenges such as unnatural user experience, complex deployment, and high costs. Ballistocardiography~(BCG), a type of piezoelectric sensor signal, offers a non-invasive, user-friendly, and easily deployable alternative for long-term home monitoring. However, reliable BCG-based sleep staging is challenging due to the limited sleep monitoring data available for BCG. A restricted training dataset prevents the model from generalization across populations. Additionally, transferring to BCG faces difficulty ensuring model robustness when migrating from other data sources. To address these issues, we introduce SleepNetZero, a zero-shot learning based approach for sleep staging. To tackle the generalization challenge, we propose a series of BCG feature extraction methods that align BCG components with corresponding respiratory, cardiac, and movement channels in PSG. This allows models to be trained on large-scale PSG datasets that are diverse in population. For the migration challenge, we employ data augmentation techniques, significantly enhancing generalizability. We conducted extensive training and testing on large datasets~(12393 records from 9637 different subjects), achieving an accuracy of 0.803 and a Cohen’s Kappa of 0.718. ZeroSleepNet was also deployed in real prototype~(monitoring pads) and tested in actual hospital settings~(265 users), demonstrating an accuracy of 0.697 and a Cohen’s Kappa of 0.589. To the best of our knowledge, this work represents the first known reliable BCG-based sleep staging effort and marks a significant step towards in-home health monitoring.

[LG-78] Feature Responsiveness Scores: Model-Agnostic Explanations for Recourse

链接: https://arxiv.org/abs/2410.22598
作者: Seung Hyun Cheon,Anneke Wernerfelt,Sorelle A. Friedler,Berk Ustun
关键词-EN: Machine learning models, Machine learning, automate or support, Machine, support decisions
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 11 pages, 3 figures in main body

点击查看摘要

Abstract:Machine learning models are often used to automate or support decisions in applications such as lending and hiring. In such settings, consumer protection rules mandate that we provide a list of “principal reasons” to consumers who receive adverse decisions. In practice, lenders and employers identify principal reasons by returning the top-scoring features from a feature attribution method. In this work, we study how such practices align with one of the underlying goals of consumer protection - recourse - i.e., educating individuals on how they can attain a desired outcome. We show that standard attribution methods can mislead individuals by highlighting reasons without recourse - i.e., by presenting consumers with features that cannot be changed to achieve recourse. We propose to address these issues by scoring features on the basis of responsiveness - i.e., the probability that an individual can attain a desired outcome by changing a specific feature. We develop efficient methods to compute responsiveness scores for any model and any dataset under complex actionability constraints. We present an extensive empirical study on the responsiveness of explanations in lending and demonstrate how responsiveness scores can be used to construct feature-highlighting explanations that lead to recourse and mitigate harm by flagging instances with fixed predictions.

[LG-79] Orb: A Fast Scalable Neural Network Potential

链接: https://arxiv.org/abs/2410.22570
作者: Mark Neumann,James Gin,Benjamin Rhodes,Steven Bennett,Zhiyi Li,Hitarth Choubisa,Arthur Hussey,Jonathan Godwin
关键词-EN: universal interatomic potentials, Matbench Discovery benchmark, atomistic modelling, introduce Orb, Matbench Discovery
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Orb, a family of universal interatomic potentials for atomistic modelling of materials. Orb models are 3-6 times faster than existing universal potentials, stable under simulation for a range of out of distribution materials and, upon release, represented a 31% reduction in error over other methods on the Matbench Discovery benchmark. We explore several aspects of foundation model development for materials, with a focus on diffusion pretraining. We evaluate Orb as a model for geometry optimization, Monte Carlo and molecular dynamics simulations.

[LG-80] Fast Deep Hedging with Second-Order Optimization

链接: https://arxiv.org/abs/2410.22568
作者: Konrad Mueller,Amira Akkari,Lukas Gonon,Ben Wood
关键词-EN: risk management task, important risk management, management task, risk management, Hedging
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:Hedging exotic options in presence of market frictions is an important risk management task. Deep hedging can solve such hedging problems by training neural network policies in realistic simulated markets. Training these neural networks may be delicate and suffer from slow convergence, particularly for options with long maturities and complex sensitivities to market parameters. To address this, we propose a second-order optimization scheme for deep hedging. We leverage pathwise differentiability to construct a curvature matrix, which we approximate as block-diagonal and Kronecker-factored to efficiently precondition gradients. We evaluate our method on a challenging and practically important problem: hedging a cliquet option on a stock with stochastic volatility by trading in the spot and vanilla options. We find that our second-order scheme can optimize the policy in 1/4 of the number of steps that standard adaptive moment-based optimization takes.

[LG-81] owards Neural-Network-based optical temperature sensing of Semiconductor Membrane External Cavity Laser

链接: https://arxiv.org/abs/2410.22528
作者: Jakob Mannstadt,Arash Rahimi-Iman
关键词-EN: trained few-layer neural, few-layer neural net, neural net model, laser gain medium, machine-learning non-contact method
类目: Optics (physics.optics); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:A machine-learning non-contact method to determine the temperature of a laser gain medium via its laser emission with a trained few-layer neural net model is presented. The training of the feed-forward Neural Network (NN) enables the prediction of the device’s properties solely from spectral data, here recorded by visible-/nearinfrared-light compact micro-spectrometers for both a diode pump laser and optically-pumped gain membrane of a semiconductor disk laser. Fiber spectrometers are used for the acquisition of large quantities of labelled intensity data, which can afterwards be used for the prediction process. Such pretrained deep NNs enable a fast, reliable and easy way to infer the temperature of a laser system such as our Membrane External Cavity Laser, at a later monitoring stage without the need of additional optical diagnostics or read-out temperature sensors. With the miniature mobile spectrometer and the remote detection ability, the temperature inference capability can be adapted for various laser diodes using transfer learning methods with pretrained models. Here, mean-square-error values for the temperature inference corresponding to sub-percent accuracy of our sensor scheme are reached, while computational cost can be saved by reducing the network depth at the here displayed cost of accuracy, as appropriate for different application scenarios.

[LG-82] Evaluating utility in synthetic banking microdata applications

链接: https://arxiv.org/abs/2410.22519
作者: Hugo E. Caceres,Ben Moews
关键词-EN: collect vast amounts, banks collect vast, banking secrecy laws, central banks collect, fine-grained banking microdata
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注: 28 pages, 4 figures

点击查看摘要

Abstract:Financial regulators such as central banks collect vast amounts of data, but access to the resulting fine-grained banking microdata is severely restricted by banking secrecy laws. Recent developments have resulted in mechanisms that generate faithful synthetic data, but current evaluation frameworks lack a focus on the specific challenges of banking institutions and microdata. We develop a framework that considers the utility and privacy requirements of regulators, and apply this to financial usage indices, term deposit yield curves, and credit card transition matrices. Using the Central Bank of Paraguay’s data, we provide the first implementation of synthetic banking microdata using a central bank’s collected information, with the resulting synthetic datasets for all three domain applications being publicly available and featuring information not yet released in statistical disclosure. We find that applications less susceptible to post-processing information loss, which are based on frequency tables, are particularly suited for this approach, and that marginal-based inference mechanisms to outperform generative adversarial network models for these applications. Our results demonstrate that synthetic data generation is a promising privacy-enhancing technology for financial regulators seeking to complement their statistical disclosure, while highlighting the crucial role of evaluating such endeavors in terms of utility and privacy requirements.

[LG-83] Bayesian Counterfactual Prediction Models for HIV Care Retention with Incomplete Outcome and Covariate Information

链接: https://arxiv.org/abs/2410.22481
作者: Arman Oganisian,Joseph Hogan,Edwin Sang,Allison DeLong,Ben Mosong,Hamish Fraser,Ann Mwangi
关键词-EN: human immunodeficiency virus, chronic diseases, human immunodeficiency, immunodeficiency virus, managed over time
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Like many chronic diseases, human immunodeficiency virus (HIV) is managed over time at regular clinic visits. At each visit, patient features are assessed, treatments are prescribed, and a subsequent visit is scheduled. There is a need for data-driven methods for both predicting retention and recommending scheduling decisions that optimize retention. Prediction models can be useful for estimating retention rates across a range of scheduling options. However, training such models with electronic health records (EHR) involves several complexities. First, formal causal inference methods are needed to adjust for observed confounding when estimating retention rates under counterfactual scheduling decisions. Second, competing events such as death preclude retention, while censoring events render retention missing. Third, inconsistent monitoring of features such as viral load and CD4 count lead to covariate missingness. This paper presents an all-in-one approach for both predicting HIV retention and optimizing scheduling while accounting for these complexities. We formulate and identify causal retention estimands in terms of potential return-time under a hypothetical scheduling decision. Flexible Bayesian approaches are used to model the observed return-time distribution while accounting for competing and censoring events and form posterior point and uncertainty estimates for these estimands. We address the urgent need for data-driven decision support in HIV care by applying our method to EHR from the Academic Model Providing Access to Healthcare (AMPATH) - a consortium of clinics that treat HIV in Western Kenya.

[LG-84] Explainable convolutional neural network model provides an alternative genome-wide association perspective on mutations in SARS-CoV-2

链接: https://arxiv.org/abs/2410.22452
作者: Parisa Hatami,Richard Annan,Luis Urias Miranda,Jane Gorman,Mengjun Xie,Letu Qingge,Hong Qin
关键词-EN: Identifying mutations, critical for pandemic, Shapley Additive explanations, Identifying, applied Shapley Additive
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying mutations of SARS-CoV-2 strains associated with their phenotypic changes is critical for pandemic prediction and prevention. We compared an explainable convolutional neural network (CNN) and the traditional genome-wide association study (GWAS) on the mutations associated with WHO labels of SARS-CoV-2, a proxy for virulence phenotypes. We trained a CNN classification model that can predict genomic sequences into Variants of Concern (VOCs), and then applied Shapley Additive explanations (SHAP) model to identify mutations that are important for the correct predictions. For comparison, we performed traditional GWAS to identify mutations associated with VOCs. Comparison of the two approaches shows that the explainable neural network approach can more effectively reveal known nucleotide substitutions associated with VOCs, such as those in the spike gene regions. Our results suggest that explainable neural networks for genomic sequences offer a promising alternative to the traditional genome wide analysis approaches.

[LG-85] ET-Flow: Equivariant Flow-Matching for Molecular Conformer Generation NEURIPS2024

链接: https://arxiv.org/abs/2410.22388
作者: Majdi Hassan,Nikhil Shenoy,Jungyoon Lee,Hannes Stark,Stephan Thaler,Dominique Beaini
关键词-EN: Predicting low-energy molecular, computational drug discovery, low-energy molecular conformations, Predicting low-energy, drug discovery
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:Predicting low-energy molecular conformations given a molecular graph is an important but challenging task in computational drug discovery. Existing state-of-the-art approaches either resort to large scale transformer-based models that diffuse over conformer fields, or use computationally expensive methods to generate initial structures and diffuse over torsion angles. In this work, we introduce Equivariant Transformer Flow (ET-Flow). We showcase that a well-designed flow matching approach with equivariance and harmonic prior alleviates the need for complex internal geometry calculations and large architectures, contrary to the prevailing methods in the field. Our approach results in a straightforward and scalable method that directly operates on all-atom coordinates with minimal assumptions. With the advantages of equivariance and flow matching, ET-Flow significantly increases the precision and physical validity of the generated conformers, while being a lighter model and faster at inference. Code is available this https URL.

[LG-86] Representation Learning for Regime detection in Block Hierarchical Financial Markets

链接: https://arxiv.org/abs/2410.22346
作者: Alexa Orton,Tim Gebbie
关键词-EN: causal information geometry, information geometry underpinning, geometry underpinning traded, underpinning traded asset, traded asset systems
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 6 pages. Presented at the 2024 IEEE CIFEr conference. Analysis of block-resampled chronology-preserving lead-lag learning dynamics presented at: this https URL

点击查看摘要

Abstract:We consider financial market regime detection from the perspective of deep representation learning of the causal information geometry underpinning traded asset systems using a hierarchical correlation structure to characterise market evolution. We assess the robustness of three toy models: SPDNet, SPD-NetBN and U-SPDNet whose architectures respect the underlying Riemannian manifold of input block hierarchical SPD correlation matrices. Market phase detection for each model is carried out using three data configurations: randomised JSE Top 60 data, synthetically-generated block hierarchical SPD matrices and block-resampled chronology-preserving JSE Top 60 data. We show that using a singular performance metric is misleading in our financial market investment use cases where deep learning models overfit in learning spatio-temporal correlation dynamics.

信息检索

[IR-0] Real-Time Personalization for LLM -based Recommendation with Customized In-Context Learning

链接: https://arxiv.org/abs/2410.23136
作者: Keqin Bao,Ming Yan,Yang Zhang,Jizhi Zhang,Wenjie Wang,Fuli Feng,Xiangnan He
关键词-EN: Frequently updating Large, updating Large Language, Large Language Model, Large Language, Frequently updating
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Frequently updating Large Language Model (LLM)-based recommender systems to adapt to new user interests – as done for traditional ones – is impractical due to high training costs, even with acceleration methods. This work explores adapting to dynamic user interests without any model updates by leveraging In-Context Learning (ICL), which allows LLMs to learn new tasks from few-shot examples provided in the input. Using new-interest examples as the ICL few-shot examples, LLMs may learn real-time interest directly, avoiding the need for model updates. However, existing LLM-based recommenders often lose the in-context learning ability during recommendation tuning, while the original LLM’s in-context learning lacks recommendation-specific focus. To address this, we propose RecICL, which customizes recommendation-specific in-context learning for real-time recommendations. RecICL organizes training examples in an in-context learning format, ensuring that in-context learning ability is preserved and aligned with the recommendation task during tuning. Extensive experiments demonstrate RecICL’s effectiveness in delivering real-time recommendations without requiring model updates. Our code is available at this https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2410.23136 [cs.IR] (or arXiv:2410.23136v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2410.23136 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] A Universal Sets-level Optimization Framework for Next Set Recommendation CIKM2024

链接: https://arxiv.org/abs/2410.23023
作者: Yuli Liu,Min Liu,Christian Walder,Lexing Xie
关键词-EN: encompassing related tasks, trending research topic, temporal sets prediction, Set Recommendation, encompassing related
类目: Information Retrieval (cs.IR)
*备注: Accepter at CIKM2024

点击查看摘要

Abstract:Next Set Recommendation (NSRec), encompassing related tasks such as next basket recommendation and temporal sets prediction, stands as a trending research topic. Although numerous attempts have been made on this topic, there are certain drawbacks: (i) Existing studies are still confined to utilizing objective functions commonly found in Next Item Recommendation (NIRec), such as binary cross entropy and BPR, which are calculated based on individual item comparisons; (ii) They place emphasis on building sophisticated learning models to capture intricate dependency relationships across sequential sets, but frequently overlook pivotal dependency in their objective functions; (iii) Diversity factor within sequential sets is frequently overlooked. In this research, we endeavor to unveil a universal and S ets-level optimization framework for N ext Set Recommendation (SNSRec), offering a holistic fusion of diversity distribution and intricate dependency relationships within temporal sets. To realize this, the following contributions are made: (i) We directly model the temporal set in a sequence as a cohesive entity, leveraging the Structured Determinantal Point Process (SDPP), wherein the probabilistic DPP distribution prioritizes collections of structures (sequential sets) instead of individual items; (ii) We introduce a co-occurrence representation to discern and acknowledge the importance of different sets; (iii) We propose a sets-level optimization criterion, which integrates the diversity distribution and dependency relations across the entire sequence of sets, guiding the model to recommend relevant and diversified set. Extensive experiments on real-world datasets show that our approach consistently outperforms previous methods on both relevance and diversity.

[IR-2] DataRec: A Framework for Standardizing Recommendation Data Processing and Analysis

链接: https://arxiv.org/abs/2410.22972
作者: Alberto Carlo Maria Mancino,Salvatore Bufi,Angela Di Fazio,Daniele Malitesta,Claudio Pomo,Antonio Ferrara,Tommaso Di Noia
关键词-EN: machine learning applications, great interest posed, researchers and companies, learning applications, great interest
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Thanks to the great interest posed by researchers and companies, recommendation systems became a cornerstone of machine learning applications. However, concerns have arisen recently about the need for reproducibility, making it challenging to identify suitable pipelines. Several frameworks have been proposed to improve reproducibility, covering the entire process from data reading to performance evaluation. Despite this effort, these solutions often overlook the role of data management, do not promote interoperability, and neglect data analysis despite its well-known impact on recommender performance. To address these gaps, we propose DataRec, which facilitates using and manipulating recommendation datasets. DataRec supports reading and writing in various formats, offers filtering and splitting techniques, and enables data distribution analysis using well-known metrics. It encourages a unified approach to data manipulation by allowing data export in formats compatible with several recommendation frameworks.

[IR-3] Understanding and Improving Adversarial Collaborative Filtering for Robust Recommendation

链接: https://arxiv.org/abs/2410.22844
作者: Kaike Zhang,Qi Cao,Yunfan Wu,Fei Sun,Huawei Shen,Xueqi Cheng
关键词-EN: Adversarial Collaborative Filtering, Collaborative Filtering, typically applies adversarial, ACF, recommender systems
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Adversarial Collaborative Filtering (ACF), which typically applies adversarial perturbations at user and item embeddings through adversarial training, is widely recognized as an effective strategy for enhancing the robustness of Collaborative Filtering (CF) recommender systems against poisoning attacks. Besides, numerous studies have empirically shown that ACF can also improve recommendation performance compared to traditional CF. Despite these empirical successes, the theoretical understanding of ACF’s effectiveness in terms of both performance and robustness remains unclear. To bridge this gap, in this paper, we first theoretically show that ACF can achieve a lower recommendation error compared to traditional CF with the same training epochs in both clean and poisoned data contexts. Furthermore, by establishing bounds for reductions in recommendation error during ACF’s optimization process, we find that applying personalized magnitudes of perturbation for different users based on their embedding scales can further improve ACF’s effectiveness. Building on these theoretical understandings, we propose Personalized Magnitude Adversarial Collaborative Filtering (PamaCF). Extensive experiments demonstrate that PamaCF effectively defends against various types of poisoning attacks while significantly enhancing recommendation performance.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-10-31

目录

概览 (2024-10-31)

自然语言处理

人工智能

计算机视觉

机器学习

信息检索

附件下载