Arxiv今日论文 | 2024-12-25

本篇博文主要展示 2024-12-25 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决长时语音生成（long-form speech generation）中的挑战，特别是在生成多分钟连贯语音时面临的问题。当前语音语言模型在生成超过数十秒的语音时，常因语音标记的高时间分辨率导致连贯性丧失、长序列训练或外推的架构问题，以及推理时的高内存成本等问题而表现不佳。为解决这些问题，论文提出了SpeechSSM，这是首个基于线性时间序列建模（linear-time sequence modeling）的语音语言模型，能够在单次解码会话中直接从长时语音（如16分钟的朗读或即兴演讲）中学习并生成语音，而无需文本中间表示。此外，论文还提出了新的嵌入和基于大语言模型（LLM）的评估指标、长度和时间上的质量测量方法，以及一个新的长时语音处理和生成基准LibriSpeech-Long，以应对长时语音评估中的挑战。

链接: https://arxiv.org/abs/2412.18603
作者: Se Jin Park,Julian Salazar,Aren Jansen,Keisuke Kinoshita,Yong Man Ro,RJ Skerry-Ryan
机构: 未知
关键词: audio-native voice assistants, voice assistants, audio-native voice, long-form multimedia generation, speech
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, current spoken language models struggle to generate plausible speech past tens of seconds, from high temporal resolution of speech tokens causing loss of coherence, to architectural issues with long-sequence training or extrapolation, to memory costs at inference time. With these considerations we propose SpeechSSM, the first speech language model to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates, based on recent advances in linear-time sequence modeling. Furthermore, to address growing challenges in spoken language evaluation, especially in this new long-form setting, we propose: new embedding-based and LLM-judged metrics; quality measurements over length and time; and a new benchmark for long-form speech processing and generation, LibriSpeech-Long. Speech samples and the dataset are released at this https URL
zh

[NLP-1] Exploring Embedding Priors in Prompt-Tuning for Improved Interpretability and Control

【速读】：该论文旨在探讨在Prompt-Tuning过程中频繁出现的嵌入崩溃（embedding collapse）现象对模型最终性能的影响。为了解决这一问题，作者设计了嵌入先验（embedding priors），并将其与Soft和Deep Prompt-Tuning方法收敛后的后验（posteriors）进行了比较。研究结果表明，先验对调优后的嵌入位置有显著影响，模型能够有效利用激活空间中不同区域的嵌入，包括全新区域。由于Prompt-Tuning的能力有限，作者提出可控的Prompt-Tuning后验可能作为链式思维（chain-of-thought, COT）蒸馏等任务的良好起点。此外，实验还表明生成的轨迹在模型的激活空间中并不局部化，但远距离任务（如自然语言处理（NLP）和算术）的激活形成了不同的簇，而NLP任务（如问答和掩码语言模型（MLM））的激活则位于同一簇中。这些发现引发了对单一激活簇在大语言模型泛化能力中重要性的质疑。

链接: https://arxiv.org/abs/2412.18582
作者: Sergey Sedov,Sumanth Bharadwaj Hachalli Karanam,Venu Gopal Kadamba
机构: 未知
关键词: minimal computational overhead, modifying prompt embeddings, adapting pre-trained language, adapting pre-trained, minimal computational
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prompt-Tuning is an efficient method for adapting pre-trained language models to new tasks with minimal computational overhead by modifying prompt embeddings. In this work, we investigate how crucial the phenomenon of embedding collapse, frequently observed in Prompt-Tuning, is for the final performance of the model. To address this question, we designed embedding priors and compared them with posteriors of the converged Soft and Deep Prompt-Tuning methods. Our findings suggest that priors strongly affect the position of the tuned embeddings, and models can effectively work with embeddings from different parts of activation spaces, including completely new regions. As the final Prompt-Tuning capabilities are limited, we hypothesize that controllable Prompt-Tuning posteriors may serve as a good starting point for tasks such as chain-of-thought (COT) distillation. Our experiments also show that generated trajectories are not localized in the activation space of the models. However, there are distinct clusters of activations for distant tasks (e.g., NLP and arithmetic), while activations between NLP tasks (e.g., Question-Answering and MLM) lie in the same cluster. These observations raise questions about the importance of a single activation cluster for the generalization abilities of large language models.
zh

[NLP-2] How Well Do LLM s Generate Code for Different Application Domains? Benchmark and Evaluation

【速读】：该论文旨在解决当前代码生成基准测试（benchmark）主要关注通用场景，而缺乏针对特定应用领域的代码生成性能评估的问题。为此，作者提出了一个新的基准测试工具 MultiCodeBench，其关键解决方案包括：1) 涵盖 12 个热门软件开发领域和 15 种编程语言的 2,400 个编程任务；2) 通过深入研究确定这些应用领域，并分类每个领域中的常用技术框架和平台；3) 从 GitHub 仓库中采样相关子领域的编程问题，并通过邀请标注者重写任务文档字符串（docstrings）以确保任务质量并避免数据泄露；4) 构建基于静态分析的依赖解析工具，提取每个任务的依赖关系，以支持更深入的性能分析。通过这些方法，MultiCodeBench 能够全面评估主流大语言模型（LLMs）在不同应用领域的代码生成性能，为开发者提供实用见解，并为模型开发者提供改进领域特定代码生成能力的指导。

链接: https://arxiv.org/abs/2412.18573
作者: Dewu Zheng,Yanlin Wang,Ensheng Shi,Hongyu Zhang,Zibin Zheng
机构: Sun Yat-sen University(中山大学); Xi’an Jiaotong University(西安交通大学); Chongqing University(重庆大学)
关键词: AI-driven programming assistants, programming assistants powered, significantly boosting developer, boosting developer productivity, significantly boosting
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, an increasing number of AI-driven programming assistants powered by code LLMs have been integrated into various real-world software development environments, significantly boosting developer productivity. However, existing code generation benchmarks primarily focus on general-purpose scenarios, leaving the code generation performance of LLMs for specific application domains largely unknown. In this paper, we introduce a new benchmark, MultiCodeBench, to fill this gap. MultiCodeBench comprises 2,400 programming tasks, covering 12 popular software development domains and 15 programming languages. Specifically, we perform in-depth research to identify these 12 application domains. Given that each domain may involve multiple technical frameworks, and that different frameworks present distinct challenges in the coding process, we categorize the commonly used frameworks and platforms within each domain. We then sample programming problems from GitHub repositories related to these subdomains. To ensure the quality of the tasks and mitigate data leakage issues, we invite annotators to rewrite the docstrings for each task in MultiCodeBench. Additionally, we build a static analysis-based dependency parsing tool to extract the dependencies in the ground truth for each task, enabling deeper performance analysis. Through extensive experiments on MultiCodeBench with eleven representative mainstream LLMs, we reveal the code generation performance of the LLMs across different application domains, providing practical insights for developers in downstream fields when selecting LLMs. Furthermore, we analyze the reasons behind the models’ failures in completing software application development tasks, offering guidance for model developers to enhance domain-specific code generation capabilities.
zh

[NLP-3] Zero-resource Speech Translation and Recognition with LLM s ICASSP2025

【速读】：该论文旨在解决零资源语音翻译（Zero-resource Speech Translation, ST）和自动语音识别（Automatic Speech Recognition, ASR）中的挑战，特别是在模型从未见过配对音频-文本数据的语言中。解决方案的关键在于利用多语言大语言模型（Multilingual Large Language Model, LLM），通过预训练的多语言语音编码器、多语言LLM以及轻量级适配模块，将音频表示映射到LLM的令牌嵌入空间。该适配模块使得模型能够在未见过的语言中执行ST和ASR任务。实验结果表明，该模型在CoVoST2数据集上对两种未见语言的ST任务中BLEU得分超过23，在ASR任务中词错误率（WER）低至28.2%。最终，系统的性能受限于LLM输出目标语言文本的能力。

链接: https://arxiv.org/abs/2412.18566
作者: Karel Mundnich,Xing Niu,Prashant Mathur,Srikanth Ronanki,Brady Houston,Veera Raghavendra Elluru,Nilaksh Das,Zejiang Hou,Goeric Huybrechts,Anshu Bhatia,Daniel Garcia-Romero,Kyu J. Han,Katrin Kirchhoff
机构: AWS AI Labs
关键词: remain challenging problems, zero-resource speech translation, automatic speech recognition, remain challenging, challenging problems
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: ICASSP 2025, 5 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.
zh

[NLP-4] Distilling Fine-grained Sentiment Understanding from Large Language Models

【速读】：该论文旨在解决细粒度情感分析（Fine-grained Sentiment Analysis, FSA）中直接部署大语言模型（Large Language Models, LLMs）所面临的高推理成本问题。为了解决这一问题，论文提出了一种通过蒸馏（distillation）方法将LLMs的细粒度情感理解能力迁移到小语言模型（Small Language Models, SLMs）中的解决方案。具体而言，论文通过提示LLMs对给定评论的情感进行分析和解释，并利用生成的内容对SLMs进行预训练。此外，论文还开发了一个全面的FSA基准来评估SLMs和LLMs的性能。实验结果表明，蒸馏显著提升了SLMs在FSA任务中的表现，F1分数提高了6.00%，并且蒸馏后的模型仅需220M参数即可超越Llama-2-7b。此外，蒸馏还赋予了SLMs出色的零样本情感分类能力，使其能够匹配甚至超越其教师模型。这些结果表明，从LLMs进行蒸馏是FSA领域一个极具前景的研究方向。

链接: https://arxiv.org/abs/2412.18552
作者: Yice Zhang,Guangyu Xie,Hongling Xu,Kaiheng Hou,Jianzhu Bao,Qianlong Wang,Shiwei Chen,Ruifeng Xu
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区); Peng Cheng Laboratory, Shenzhen, China (鹏城实验室); Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies (广东省新型安全智能技术重点实验室)
关键词: vast opinionated text, summarize user opinions, Fine-grained sentiment analysis, aims to extract, opinionated text
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-grained sentiment analysis (FSA) aims to extract and summarize user opinions from vast opinionated text. Recent studies demonstrate that large language models (LLMs) possess exceptional sentiment understanding capabilities. However, directly deploying LLMs for FSA applications incurs high inference costs. Therefore, this paper investigates the distillation of fine-grained sentiment understanding from LLMs into small language models (SLMs). We prompt LLMs to examine and interpret the sentiments of given reviews and then utilize the generated content to pretrain SLMs. Additionally, we develop a comprehensive FSA benchmark to evaluate both SLMs and LLMs. Extensive experiments on this benchmark reveal that: (1) distillation significantly enhances the performance of SLMs in FSA tasks, achieving a 6.00% improvement in F_1 -score, and the distilled model can outperform Llama-2-7b with only 220M parameters; (2) distillation equips SLMs with excellent zero-shot sentiment classification capabilities, enabling them to match or even exceed their teacher models. These results suggest that distillation from LLMs is a highly promising direction for FSA. We will release our code, data, and pretrained model weights at \urlthis https URL.
zh

[NLP-5] Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability

【速读】：该论文旨在解决大语言模型（LLMs）在性能与安全性评估中的不平衡问题。传统方法通常通过平均性能和安全指标来评估模型，这可能导致模型在某一维度上表现优异而在其他维度上表现不足。为此，论文提出了Libra-Leaderboard框架，该框架通过动态排行榜和交互式LLM竞技场的结合，鼓励模型在能力和安全性之间实现联合优化。其关键创新在于采用“距离最优得分法”（distance-to-optimal-score method）来计算整体排名，从而激励模型在性能与安全性之间取得平衡，而非单一维度的卓越表现。在首次发布中，Libra-Leaderboard评估了来自14个领先组织的26个主流LLMs，揭示了即使是当前最先进的模型仍面临显著的安全挑战。

链接: https://arxiv.org/abs/2412.18551
作者: Haonan Li,Xudong Han,Zenan Zhai,Honglin Mu,Hao Wang,Zhenxuan Zhang,Yilin Geng,Shom Lin,Renxi Wang,Artem Shelmanov,Xiangyu Qi,Yuxia Wang,Donghai Hong,Youliang Yuan,Meng Chen,Haoqin Tu,Fajri Koto,Tatsuki Kuribayashi,Cong Zeng,Rishabh Bhardwaj,Bingchen Zhao,Yawen Duan,Yi Liu,Emad A. Alghamdi,Yaodong Yang,Yinpeng Dong,Soujanya Poria,Pengfei Liu,Zhengzhong Liu,Xuguang Ren,Eduard Hovy,Iryna Gurevych,Preslav Nakov,Monojit Choudhury,Timothy Baldwin
机构: Librai; MBZUAI; Oracle; The University of Melbourne(墨尔本大学); Tsinghua University(清华大学); Princeton University(普林斯顿大学); Peking University(北京大学); The Chinese University of Hong Kong (Shenzhen)(香港中文大学（深圳）); Beijing University of Posts and Telecommunications(北京邮电大学); UCSC(加州大学圣克鲁兹分校); SUTD(新加坡科技设计大学); University of Edinburgh(爱丁堡大学); Concordia AI; Nanyang Technological University(南洋理工大学); King Abdulaziz University(阿卜杜勒阿齐兹国王大学); Shanghai Jiaotong University(上海交通大学)
关键词: comprehensive framework designed, address this gap, comprehensive framework, framework designed, designed to rank
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.
zh

[NLP-6] oken-Budget-Aware LLM Reasoning

【速读】：该论文旨在解决大语言模型（LLMs）在推理过程中因使用链式思维（Chain-of-Thought, CoT）方法而导致的高额令牌开销问题。尽管CoT方法通过将问题分解为中间步骤提升了LLMs的性能，但其推理过程通常过于冗长，增加了令牌使用量，进而导致成本上升。论文提出了一种基于令牌预算的LLM推理框架，该框架根据问题的推理复杂度动态估计令牌预算，并利用这些预算来指导推理过程。实验表明，该方法在仅轻微降低性能的情况下，有效减少了CoT推理中的令牌成本，为平衡LLM推理的效率和准确性提供了实用解决方案。其关键在于通过合理设置令牌预算来压缩推理过程，从而在保证性能的同时降低成本。

链接: https://arxiv.org/abs/2412.18547
作者: Tingxu Han,Chunrong Fang,Shiyu Zhao,Shiqing Ma,Zhenyu Chen,Zhenting Wang
机构: Nanjing University(南京大学); Rutgers University(罗格斯大学); UMass Amherst(马萨诸塞大学阿默斯特分校)
关键词: large language models, language models, range of tasks, critical for large, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning is critical for large language models (LLMs) to excel in a wide range of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM performance by decomposing problems into intermediate steps, they also incur significant overhead in token usage, leading to increased costs. We find that the reasoning process of current LLMs is unnecessarily lengthy and it can be compressed by including a reasonable token budget in the prompt, but the choice of token budget plays a crucial role in the actual compression effectiveness. We then propose a token-budget-aware LLM reasoning framework, which dynamically estimates token budgets for different problems based on reasoning complexity and uses the estimated token budgets to guide the reasoning process. Experiments show that our method effectively reduces token costs in CoT reasoning with only a slight performance reduction, offering a practical solution to balance efficiency and accuracy in LLM reasoning. Code: this https URL.
zh

[NLP-7] Consistency Checks for Language Model Forecasters ICLR2025

【速读】：该论文旨在解决如何即时评估和基准测试生成式 AI（Generative AI）预测模型的问题。由于预测任务的真实结果只能在未来得知，传统的评估方法无法即时验证预测的准确性。为此，论文提出了一种基于一致性检查（consistency check）框架的新方法，通过测量预测模型在不同逻辑相关问题上的预测一致性来评估其性能。关键解决方案是引入了一种基于套利（arbitrage）的通用一致性度量标准，例如，如果预测模型不合理地预测民主党和共和党在2024年美国总统选举中都有60%的胜率，套利者可以通过交易来获利。论文还构建了一个自动化评估系统，生成一组基础问题，从中实例化一致性检查，获取预测模型的预测结果，并测量其一致性。此外，论文建立了一个标准的、基于评分规则的预测基准，并展示了即时一致性度量与未来才能得知的 Brier 分数之间的相关性。最后，论文发布了一个将在2028年解决的一致性基准，为长期预测评估提供了工具。

链接: https://arxiv.org/abs/2412.18544
作者: Daniel Paleka,Abhimanyu Pallavi Sudhir,Alejandro Alvarez,Vineeth Bhat,Adam Shen,Evan Wang,Florian Tramèr
机构: Cranberry-Lemon University; University of the Witwatersrand
关键词: consistency, showing LLM forecasters, LLM forecasters rapidly, work showing LLM, Forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 56 pages, 25 figures. Submitted to ICLR 2025

点击查看摘要

Abstract:Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how can we benchmark and evaluate these forecasters instantaneously? Following the consistency check framework, we measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions. We propose a new, general consistency metric based on arbitrage: for example, if a forecasting AI illogically predicts that both the Democratic and Republican parties have 60% probability of winning the 2024 US presidential election, an arbitrageur can trade against the forecaster’s predictions and make a profit. We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We then build a standard, proper-scoring-rule forecasting benchmark, and show that our (instantaneous) consistency metrics correlate with LLM forecasters’ ground truth Brier scores (which are only known in the future). We also release a consistency benchmark that resolves in 2028, providing a long-term evaluation tool for forecasting.
zh

[NLP-8] Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation AAAI’2025

【速读】：该论文旨在解决大型语言模型（LLMs）在复杂知识推理任务中出现的幻觉（hallucination）和知识过时问题，这些问题导致模型输出的事实性错误。为解决这一问题，论文提出了自适应多维度知识图谱增强框架（Adaptive Multi-Aspect Retrieval-augmented over KGs, Amar）。该框架的关键在于通过从大规模知识图谱（KGs）中检索实体、关系和子图等知识，并将其转换为提示嵌入（prompt embeddings），以增强LLMs的推理能力。Amar框架包含两个核心子模块：1）自对齐模块（self-alignment module），用于对齐实体、关系和子图之间的共性，从而减少噪声干扰；2）相关性门控模块（relevance gating module），通过软门控机制学习问题与多维度检索数据之间的相关性得分，以确定哪些信息应被用于增强LLMs的输出，或完全过滤。实验结果表明，Amar在WebQSP和CWQ数据集上取得了最先进的性能，准确率提高了1.9%，逻辑形式生成能力提高了6.6%，验证了其在提升LLMs推理能力方面的有效性。

链接: https://arxiv.org/abs/2412.18537
作者: Derong Xu Xinhang Li,Ziheng Zhang,Zhenxi Lin,Zhihong Zhu,Zhi Zheng,Xian Wu,Xiangyu Zhao,Tong Xu,Enhong Chen
机构: 未知
关键词: Large Language Models, Large Language, Language Models, factually incorrect outputs, demonstrate remarkable capabilities
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI’2025

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities, yet struggle with hallucination and outdated knowledge when tasked with complex knowledge reasoning, resulting in factually incorrect outputs. Previous studies have attempted to mitigate it by retrieving factual knowledge from large-scale knowledge graphs (KGs) to assist LLMs in logical reasoning and prediction of answers. However, this kind of approach often introduces noise and irrelevant data, especially in situations with extensive context from multiple knowledge aspects. In this way, LLM attention can be potentially mislead from question and relevant information. In our study, we introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework. This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings. The Amar framework comprises two key sub-components: 1) a self-alignment module that aligns commonalities among entities, relations, and subgraphs to enhance retrieved text, thereby reducing noise interference; 2) a relevance gating module that employs a soft gate to learn the relevance score between question and multi-aspect retrieved data, to determine which information should be used to enhance LLMs’ output, or even filtered altogether. Our method has achieved state-of-the-art performance on two common datasets, WebQSP and CWQ, showing a 1.9% improvement in accuracy over its best competitor and a 6.6% improvement in logical form generation over a method that directly uses retrieved text as context prompts. These results demonstrate the effectiveness of Amar in improving the reasoning of LLMs.
zh

[NLP-9] Characterizations of Language Generation With Breadth

【速读】：该论文旨在解决语言生成（language generation）中的一致性与广度（breadth）之间的权衡问题，特别是针对Kleinberg和 Mullainathan [KM24]提出的算法在生成目标语言K时牺牲了广度的现象。论文的核心问题在于探讨这种权衡是否具有内在必然性，并进一步研究在不同广度定义下的语言生成可能性。解决方案的关键在于通过引入并分析三种不同的广度定义——精确广度（exact breadth）、近似广度（approximate breadth）和无歧义生成（unambiguous generation）——以及Charikar和Pabbaraju [CP24a]提出的穷尽生成（exhaustive generation），全面刻画了这些广度概念及其自然组合下的语言生成特性。论文通过移除[KVM24]中的技术条件，提供了精确生成的无条件下界，并证明了近似广度与穷尽生成的等价性，同时展示了Angluin条件在这些生成模式中的核心作用。此外，论文还通过引入稳定生成器（stable generators）的无条件下界，进一步区分了稳定与非稳定生成在近似广度下的差异。

链接: https://arxiv.org/abs/2412.18530
作者: Alkis Kalavasis,Anay Mehrotra,Grigoris Velegkas
机构: 未知
关键词: Kleinberg and Mullainathan, generation, breadth, Angluin condition, Angluin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
备注: Abstract shortened to fix arXiv limit

点击查看摘要

Abstract:We study language generation in the limit, introduced by Kleinberg and Mullainathan [KM24], building on classical works of Gold [Gol67] and Angluin [Ang79]. [KM24] proposed an algorithm that generates strings from any countable language collection in the limit. While their algorithm eventually outputs strings from the target language K , it sacrifices breadth, i.e., the ability to generate all strings in K . A key open question in [KM24] is whether this trade-off between consistency and breadth is inherrent. Recent works proposed different notions of consistent generation with breadth. Kalavasis, Mehrotra, and Velegkas [KVM24] introduced three definitions: generation with exact breadth, approximate breadth, and unambiguous generation. Concurrently and independently, Charikar and Pabbaraju [CP24a] proposed exhaustive generation. Both works examined when generation with these notions of breadth is possible. Building on [CP24a, KVM24], we fully characterize language generation for these notions and their natural combinations. For exact breadth, we provide an unconditional lower bound, removing a technical condition from [KVM24] and extending the result of [CP24a] that holds for specific collections of languages. We show that generation with exact breadth is characterized by Angluin’s condition for identification. We further introduce a weaker version of Angluin’s condition that tightly characterizes both approximate breadth and exhaustive generation, proving their equivalence. Additionally, we show that unambiguous generation is also characterized by Angluin’s condition as a special case of a broader result. Finally, we strengthen [KVM24] by giving unconditional lower bounds for stable generators, showing that Angluin’s condition characterizes the previous breadth notions for stable generators. This shows a separation between stable and unstable generation with approximate breadth. Comments: Abstract shortened to fix arXiv limit Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2412.18530 [cs.LG] (or arXiv:2412.18530v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.18530 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-10] hink or Remember? Detecting and Directing LLM s Towards Memorization or Generalization

【速读】：该论文旨在探索大语言模型（LLMs）中记忆（memorization）和泛化（generalization）的基础机制，并借鉴人类大脑功能分化的现象进行研究。论文通过设计特定的数据集和实验规模的LLMs，试图解决以下问题：首先，通过训练使LLMs能够同时表现出记忆和泛化行为；其次，探究LLMs是否在神经元层面存在空间分化以支持这两种行为；再次，利用模型内部表示预测这些行为；最后，通过推理时的干预手段来引导这些行为。解决方案的关键在于通过实验验证LLMs中神经元级别的记忆与泛化分化，并展示如何通过有针对性的干预成功引导模型的行为。

链接: https://arxiv.org/abs/2412.18497
作者: Yi-Fu Fu,Yu-Chieh Tu,Tzu-Ling Cheng,Cheng-Yu Lin,Yi-Ting Yang,Heng-Yi Liu,Keng-Te Liao,Da-Cheng Juan,Shou-De Lin
机构: 未知
关键词: Large Language Models, Large Language, functional specialization observed, Language Models, human brain
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we explore the foundational mechanisms of memorization and generalization in Large Language Models (LLMs), inspired by the functional specialization observed in the human brain. Our investigation serves as a case study leveraging specially designed datasets and experimental-scale LLMs to lay the groundwork for understanding these behaviors. Specifically, we aim to first enable LLMs to exhibit both memorization and generalization by training with the designed dataset, then (a) examine whether LLMs exhibit neuron-level spatial differentiation for memorization and generalization, (b) predict these behaviors using model internal representations, and © steer the behaviors through inference-time interventions. Our findings reveal that neuron-wise differentiation of memorization and generalization is observable in LLMs, and targeted interventions can successfully direct their behavior.
zh

[NLP-11] Generating event descriptions under syntactic and semantic constraints

【速读】：该论文旨在支持可扩展的词汇语义标注（lexical semantic annotation）、分析和理论构建，通过全面评估在不同句法约束（如期望的从句结构）和语义约束（如期望的动词意义）下生成事件描述的方法。研究比较了三种方法：（i）专家手动生成；（ii）从标注了句法和语义信息的语料库中采样；（iii）从基于句法和语义信息调节的语言模型（LM）中采样，并从自然性（naturalness）、典型性（typicality）和独特性（distinctiveness）三个维度评估生成的事件描述。研究发现，所有方法均能可靠地生成自然、典型且独特的事件描述，但手动生成的事件描述在自然性、典型性和独特性方面仍优于自动化生成方法。论文结论指出，所考虑的自动化方法生成的事件描述质量足以用于下游标注和分析，前提是这些标注和分析方法对生成事件描述的少量质量下降具有鲁棒性。解决方案的关键在于通过比较不同生成方法，确定自动化方法在满足特定句法和语义约束下的可行性和适用性。

链接: https://arxiv.org/abs/2412.18496
作者: Angela Cao,Faye Holt,Jonas Chan,Stephanie Richter,Lelia Glass,Aaron Steven White
机构: 未知
关键词: desired clause structure, desired verb sense, supporting scalable lexical, scalable lexical semantic, event descriptions
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the goal of supporting scalable lexical semantic annotation, analysis, and theorizing, we conduct a comprehensive evaluation of different methods for generating event descriptions under both syntactic constraints – e.g. desired clause structure – and semantic constraints – e.g. desired verb sense. We compare three different methods – (i) manual generation by experts; (ii) sampling from a corpus annotated for syntactic and semantic information; and (iii) sampling from a language model (LM) conditioned on syntactic and semantic information – along three dimensions of the generated event descriptions: (a) naturalness, (b) typicality, and © distinctiveness. We find that all methods reliably produce natural, typical, and distinctive event descriptions, but that manual generation continues to produce event descriptions that are more natural, typical, and distinctive than the automated generation methods. We conclude that the automated methods we consider produce event descriptions of sufficient quality for use in downstream annotation and analysis insofar as the methods used for this annotation and analysis are robust to a small amount of degradation in the resulting event descriptions.
zh

[NLP-12] How “Real” is Your Real-Time Simultaneous Speech-to-Text Translation System? ACL

【速读】：该论文旨在解决同时语音到文本翻译（SimulST）领域中的两个主要问题：一是现有研究大多局限于人工预分割的语音，忽略了无界语音翻译中的实际挑战；二是术语使用的不一致性限制了研究成果在现实应用中的适用性。为解决这些问题，论文提出了三个关键贡献：首先，定义了SimulST系统的步骤和核心组件，并提出了标准化的术语和分类体系；其次，对社区研究趋势进行了深入分析；最后，从评估框架到系统架构等方面提供了具体的建议和未来研究方向，以推动该领域向更现实和有效的SimulST解决方案发展。

链接: https://arxiv.org/abs/2412.18495
作者: Sara Papi,Peter Polak,Ondřej Bojar,Dominik Macháček
机构: 未知
关键词: ensuring low latency, target-language text concurrently, translates source-language speech, translates source-language, ensuring low
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at TACL

点击查看摘要

Abstract:Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker’s speech, ensuring low latency for better user comprehension. Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges. This narrow focus, coupled with widespread terminological inconsistencies, is limiting the applicability of research outcomes to real-world applications, ultimately hindering progress in the field. Our extensive literature review of 110 papers not only reveals these critical issues in current research but also serves as the foundation for our key contributions. We 1) define the steps and core components of a SimulST system, proposing a standardized terminology and taxonomy; 2) conduct a thorough analysis of community trends, and 3) offer concrete recommendations and future directions to bridge the gaps in existing literature, from evaluation frameworks to system architectures, for advancing the field towards more realistic and effective SimulST solutions.
zh

[NLP-13] Segment-Based Attention Masking for GPTs

【速读】：该论文旨在解决现代语言模型（LMs）在初始“预填充”（prefill）阶段因因果掩码（causal masking）而引入的不必要约束问题。传统GPT模型在处理用户输入时，逐步对所有输入标记应用因果掩码，模拟生成过程，这在预填充阶段限制了模型对输入提示的处理效率。论文提出了一种基于已知块结构的分段掩码方案（Segment-by-Segment scheme），在预填充阶段根据输入提示的块结构进行掩码，使得每个块内的前几个标记可以以非因果方式访问后续标记，从而优化了模型在预填充阶段的处理效率。随后，模型在生成输出时仍采用传统的逐标记自回归过程。该方案无需额外计算开销，并在Llama和Qwen等模型中实现了最先进的性能。

链接: https://arxiv.org/abs/2412.18487
作者: Shahar Katz,Liran Ringel,Yaniv Romano,Lior Wolf
机构: 未知
关键词: Generative Pre-Trained Transformer, Modern Language Models, Modern Language, Pre-Trained Transformer, backbone of Generative
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern Language Models (LMs) owe much of their success to masked causal attention, the backbone of Generative Pre-Trained Transformer (GPT) models. Although GPTs can process the entire user prompt at once, the causal masking is applied to all input tokens step-by-step, mimicking the generation process. This imposes an unnecessary constraint during the initial “prefill” phase when the model processes the input prompt and generates the internal representations before producing any output tokens. In this work, attention is masked based on the known block structure at the prefill phase, followed by the conventional token-by-token autoregressive process after that. For example, in a typical chat prompt, the system prompt is treated as one block, and the user prompt as the next one. Each of these is treated as a unit for the purpose of masking, such that the first tokens in each block can access the subsequent tokens in a non-causal manner. Then, the model answer is generated in the conventional causal manner. This Segment-by-Segment scheme entails no additional computational overhead. When integrating it into models such as Llama and Qwen, state-of-the-art performance is consistently achieved.
zh

[NLP-14] Is Large Language Model Good at Triple Set Prediction? An Empirical Study

【速读】：该论文旨在探索大语言模型（LLMs）在三元组集预测（Triple Set Prediction, TSP）任务中的潜力与局限性。TSP任务是一种更为现实的知识图谱补全（Knowledge Graph Completion, KGC）任务，其目标是根据已知三元组的信息预测未知三元组的所有元素。论文提出的解决方案关键在于构建一个包含LLM规则挖掘和LLM三元组集预测的框架。首先，利用嵌入丰富语义信息的知识图谱关系列表来引导LLM生成规则，这一过程高效且独立于统计信息，便于挖掘有效且现实的规则。随后，针对每个子图，应用特定规则并结合相关三元组来指导LLM预测缺失的三元组。最后，整合所有子图的预测结果，得到完整的预测三元组集。实验结果表明，当LLMs需要依据大量事实知识预测缺失三元组时，会出现显著的幻觉现象，导致性能显著下降。论文通过详细案例分析进一步探讨了这一现象的原因。

链接: https://arxiv.org/abs/2412.18443
作者: Yuan Yuan,Yajing Xu,Wen Zhang
机构: 未知
关键词: Knowledge Graph Completion, graph completion task, KGC tasks, Common KGC tasks, Graph Completion
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The core of the Knowledge Graph Completion (KGC) task is to predict and complete the missing relations or nodes in a KG. Common KGC tasks are mostly about inferring unknown elements with one or two elements being known in a triple. In comparison, the Triple Set Prediction (TSP) task is a more realistic knowledge graph completion task. It aims to predict all elements of unknown triples based on the information from known triples. In recent years, large language models (LLMs) have exhibited significant advancements in language comprehension, demonstrating considerable potential for KGC tasks. However, the potential of LLM on the TSP task has not yet to be investigated. Thus in this paper we proposed a new framework to explore the strengths and limitations of LLM in the TSP task. Specifically, the framework consists of LLM-based rule mining and LLM-based triple set prediction. The relation list of KG embedded within rich semantic information is first leveraged to prompt LLM in the generation of rules. This process is both efficient and independent of statistical information, making it easier to mine effective and realistic rules. For each subgraph, the specified rule is applied in conjunction with the relevant triples within that subgraph to guide the LLM in predicting the missing triples. Subsequently, the predictions from all subgraphs are consolidated to derive the complete set of predicted triples on KG. Finally, the method is evaluated on the relatively complete CFamily dataset. The experimental results indicate that when LLMs are required to adhere to a large amount of factual knowledge to predict missing triples, significant hallucinations occurs, leading to a noticeable decline in performance. To further explore the causes of this phenomenon, this paper presents a comprehensive analysis supported by a detailed case study.
zh

[NLP-15] Unlocking the Potential of Multiple BERT Models for Bangla Question Answering in NCTB Textbooks

【速读】：该论文旨在解决教育环境中文本理解能力自动评估的问题，特别是针对孟加拉语（Bangla）教材中的篇章问答任务。研究通过评估三种先进的语言模型（RoBERTa Base、Bangla-BERT 和 BERT Base）在自动评估孟加拉语篇章问答任务中的表现，探讨了这些模型在优化超参数配置下的性能。研究的关键解决方案包括：构建了一个包含约 3,000 个孟加拉语篇章问答实例的数据集，并使用 F1 分数和精确匹配（Exact Match, EM）指标对模型进行评估。研究结果表明，Bangla-BERT 在较小的批量大小、包含停用词和中等学习率的配置下表现最佳，显著优于其他模型。这一发现强调了超参数微调在优化模型性能中的重要性，并为未来教育机构中自动化评估系统的开发提供了关键见解。

链接: https://arxiv.org/abs/2412.18440
作者: Abdullah Khondoker,Enam Ahmed Taufik,Md Iftekhar Islam Tashik,S M Ishtiak mahmud,Antara Firoz Parsa
机构: 未知
关键词: improving curricular effectiveness, Bangla passage-based question-answering, understanding student performance, BERT Base-in automatically, Evaluating text comprehension
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating text comprehension in educational settings is critical for understanding student performance and improving curricular effectiveness. This study investigates the capability of state-of-the-art language models-RoBERTa Base, Bangla-BERT, and BERT Base-in automatically assessing Bangla passage-based question-answering from the National Curriculum and Textbook Board (NCTB) textbooks for classes 6-10. A dataset of approximately 3,000 Bangla passage-based question-answering instances was compiled, and the models were evaluated using F1 Score and Exact Match (EM) metrics across various hyperparameter configurations. Our findings revealed that Bangla-BERT consistently outperformed the other models, achieving the highest F1 (0.75) and EM (0.53) scores, particularly with smaller batch sizes, the inclusion of stop words, and a moderate learning rate. In contrast, RoBERTa Base demonstrated the weakest performance, with the lowest F1 (0.19) and EM (0.27) scores under certain configurations. The results underscore the importance of fine-tuning hyperparameters for optimizing model performance and highlight the potential of machine learning models in evaluating text comprehension in educational contexts. However, limitations such as dataset size, spelling inconsistencies, and computational constraints emphasize the need for further research to enhance the robustness and applicability of these models. This study lays the groundwork for the future development of automated evaluation systems in educational institutions, providing critical insights into model performance in the context of Bangla text comprehension.
zh

[NLP-16] GeAR: Graph-enhanced Agent for Retrieval-augmented Generation

【速读】：该论文旨在解决多跳检索（multi-hop retrieval）场景中传统稀疏或密集检索器面临的挑战。多跳检索要求系统能够通过多个步骤从不同文档中提取信息，而传统检索器在此类任务中表现不佳。论文提出的解决方案GeAR通过两个关键创新来提升检索增强生成系统（Retrieval-augmented generation, RAG）的性能：首先，图扩展（graph expansion）技术增强了任何传统基础检索器（如BM25）的能力；其次，引入了一个代理框架（agent framework），将图扩展技术整合到系统中。实验结果表明，GeAR在三个多跳问答数据集上表现出卓越的检索性能，尤其在MuSiQue数据集上实现了超过10%的性能提升，同时减少了所需的标记和迭代次数。

链接: https://arxiv.org/abs/2412.18431
作者: Zhili Shen,Chenxin Diao,Pavlos Vougiouklis,Pascual Merita,Shriram Piramanayagam,Damien Graux,Dandan Tu,Zeren Jiang,Ruofei Lai,Yang Ren,Jeff Z. Pan
机构: 未知
关键词: Retrieval-augmented generation systems, Retrieval-augmented generation, document retrieval capabilities, effective document retrieval, generation systems rely
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation systems rely on effective document retrieval capabilities. By design, conventional sparse or dense retrievers face challenges in multi-hop retrieval scenarios. In this paper, we present GeAR, which advances RAG performance through two key innovations: (i) graph expansion, which enhances any conventional base retriever, such as BM25, and (ii) an agent framework that incorporates graph expansion. Our evaluation demonstrates GeAR’s superior retrieval performance on three multi-hop question answering datasets. Additionally, our system achieves state-of-the-art results with improvements exceeding 10% on the challenging MuSiQue dataset, while requiring fewer tokens and iterations compared to other multi-step retrieval systems.
zh

[NLP-17] Explainable Multi-Modal Data Exploration in Natural Language via LLM Agent

【速读】：该论文旨在解决在多模态数据（如数据库、文本、图像和视频）中，如何通过自然语言查询进行可解释的多模态数据探索的问题。当前的研究在多模态数据探索和自然语言到数据库查询语言的自动翻译方面取得了一定进展，但结合非结构化模态（如图像）的自然语言查询数据库系统的研究仍较为匮乏。论文提出的解决方案XMODE系统，其关键在于利用基于大语言模型（LLM）的智能代理框架，将自然语言问题分解为文本到SQL生成和图像分析等子任务。通过在多模态数据集上的实验，XMODE系统在准确性、查询延迟、API成本、规划效率和解释质量等多个性能指标上均优于现有的多模态探索系统，这得益于其更有效地利用了LLM的推理能力。

链接: https://arxiv.org/abs/2412.18428
作者: Farhad Nooralahzadeh,Yi Zhang,Jonathan Furst,Kurt Stockinger
机构: 未知
关键词: hospitals collect large, collect large amounts, International enterprises, text documents, natural language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:International enterprises, organizations, or hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying database systems combined with other unstructured modalities such as images in natural language is widely unexplored. In this paper, we propose XMODE - a system that enables explainable, multi-modal data exploration in natural language. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) XMODE leverages a LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis. (3) Experimental results on multi-modal datasets over relational data and images demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling not only in accuracy but also in various performance metrics such as query latency, API costs, planning efficiency, and explanation quality, thanks to the more effective utilization of the reasoning capabilities of LLMs. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2412.18428 [cs.AI] (or arXiv:2412.18428v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.18428 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-18] LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding Reasoning and Locating

【速读】：该论文旨在解决现有文档理解基准在处理长文档、数值推理和跨元素定位任务时的局限性。现有基准通常仅能处理少量页面，且未能全面分析布局元素的定位问题。为此，论文首先定义了三个主要任务类别：长文档理解（Long Document Understanding）、数值推理（Numerical Reasoning）和跨元素定位（Cross-element Locating），并提出了一个综合基准LongDocURL，该基准整合了上述三个主要任务，包含20个子任务，基于不同任务和答案证据进行分类。关键解决方案包括开发半自动化的构建流程，收集了2,325个高质量问答对，覆盖超过33,000页文档，显著超越了现有基准。此外，论文还对开源和闭源模型在26种不同配置下进行了全面评估，揭示了该领域的关键性能差距。

链接: https://arxiv.org/abs/2412.18424
作者: Chao Deng,Jiale Yuan,Pi Bu,Peijie Wang,Zhong-Zhi Li,Jian Xu,Xiao-Hui Li,Yuan Gao,Jun Song,Bo Zheng,Cheng-Lin Liu
机构: 未知
关键词: Large vision language, understanding capabilities remarkably, Large vision, document understanding capabilities, complex document elements
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.
zh

[NLP-19] Multilingual Mathematical Reasoning: Advancing Open-Source LLM s in Hindi and English AAAI2025

【速读】：该论文旨在解决大语言模型（LLMs）在非英语语言（如印地语）中数学推理能力不足的问题，特别是针对资源高效的开源小模型。研究通过多种方法提升这些模型在印地语和英语中的数学推理能力，包括零样本（zero-shot）、少样本链式思维（few-shot chain-of-thought, CoT）以及监督微调（supervised fine-tuning）。解决方案的关键在于采用了课程学习（curriculum learning），逐步训练模型解决难度递增的问题；引入了一种新颖的分解策略（Decomposition Strategy），简化复杂算术运算；并设计了结构化解决方案（Structured Solution Design），将解题过程分为多个阶段。实验结果表明，WizardMath 7B在英语数据集上的准确率超过了Gemini +6%，并在印地语数据集上达到了与Gemini相当的性能。此外，采用双语（英语和印地语）样本的训练方法，取得了与单一语言模型相当的效果，证明了模型在两种语言中学习数学推理的潜力。

链接: https://arxiv.org/abs/2412.18415
作者: Avinash Anand,Kritarth Prasad,Chhavi Kirtani,Ashwin R Nair,Manvendra Kumar Nema,Raj Jaiswal,Rajiv Ratn Shah
机构: 未知
关键词: Large Language Models, Large Language, excel in linguistic, linguistic tasks, tasks but struggle
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Large Language Models (LLMs) excel in linguistic tasks but struggle with mathematical reasoning, particularly in non English languages like Hindi. This research aims to enhance the mathematical reasoning skills of smaller, resource efficient open-source LLMs in both Hindi and English. We evaluate models like OpenHathi 7B, LLaMA-2 7B, WizardMath 7B, Mistral 7B, LLeMMa 7B, MAmmoTH 7B, Gemini Pro, and GPT-4 using zero-shot, few-shot chain-of-thought (CoT) methods, and supervised fine-tuning. Our approach incorporates curriculum learning, progressively training models on increasingly difficult problems, a novel Decomposition Strategy to simplify complex arithmetic operations, and a Structured Solution Design that divides solutions into phases. Our experiments result in notable performance enhancements. WizardMath 7B exceeds Gemini’s accuracy on English datasets by +6% and matches Gemini’s performance on Hindi datasets. Adopting a bilingual approach that combines English and Hindi samples achieves results comparable to individual language models, demonstrating the capability to learn mathematical reasoning in both languages. This research highlights the potential for improving mathematical reasoning in open-source LLMs.
zh

[NLP-20] ChaI-TeA: A Benchmark for Evaluating Autocompletion of Interactions with LLM -based Chatbots

【速读】：该论文旨在解决基于大语言模型（LLMs）的聊天机器人交互中的自动补全（autocomplete）问题。随着LLMs的兴起，越来越多的用户与计算机的交互转向了基于LLM的聊天机器人，然而，用户在与这些模型交互时，往往需要花费大量时间和精力来组织长而多样的自然语言文本。为此，论文提出了“聊天交互自动补全”任务，并引入了ChaI-TeA（CHat InTEraction Autocomplete）框架，作为评估LLM聊天交互自动补全的基准。该框架包括任务的形式化定义、适用的数据集和评估指标。通过该框架，论文测试了9个现有模型在自动补全任务中的表现，发现尽管现有模型表现尚可，但在生成建议的排序方面仍有较大改进空间。论文为从事该任务的研究者提供了实践见解，并为该领域的研究开辟了新的方向，同时公开了该框架以作为未来研究的基础。

链接: https://arxiv.org/abs/2412.18377
作者: Shani Goren,Oren Kalinsky,Tomer Stav,Yuri Rapoport,Yaron Fairstein,Ram Yazdy,Nachshon Cohen,Alexander Libov,Guy Kushilevitz
机构: 未知
关键词: rise of LLMs, LLMs has deflected, deflected a growing, growing portion, portion of human-computer
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rise of LLMs has deflected a growing portion of human-computer interactions towards LLM-based chatbots. The remarkable abilities of these models allow users to interact using long, diverse natural language text covering a wide range of topics and styles. Phrasing these messages is a time and effort consuming task, calling for an autocomplete solution to assist users. We introduce the task of chatbot interaction autocomplete. We present ChaI-TeA: CHat InTEraction Autocomplete; An autcomplete evaluation framework for LLM-based chatbot interactions. The framework includes a formal definition of the task, coupled with suitable datasets and metrics. We use the framework to evaluate After formally defining the task along with suitable datasets and metrics, we test 9 models on the defined auto completion task, finding that while current off-the-shelf models perform fairly, there is still much room for improvement, mainly in ranking of the generated suggestions. We provide insights for practitioners working on this task and open new research directions for researchers in the field. We release our framework to serve as a foundation for future research.
zh

[NLP-21] Bidirectional Topic Matching: Quantifying Thematic Overlap Between Corpora Through Topic Modelling

【速读】：该论文旨在解决跨语料库主题建模中主题重叠与分歧的量化问题。传统方法在处理不同语料库之间的主题关系时存在局限性，难以全面捕捉共享主题和独特主题的细微差异。为此，论文提出了一种新颖的双向主题匹配（Bidirectional Topic Matching, BTM）方法。BTM的关键在于其双模型框架，即分别为每个语料库训练独立的主题模型，并相互应用这些模型以实现全面的跨语料库比较。该方法能够灵活整合多种主题建模技术，如BERTopic、Top2Vec和潜在狄利克雷分配（Latent Dirichlet Allocation, LDA），并通过验证展示了其在处理异常主题和提供精确主题关系分析方面的优势。BTM的灵活性和精确性使其成为从政治话语分析到跨学科研究等多种应用场景中的有力工具。

链接: https://arxiv.org/abs/2412.18376
作者: Raven Adam,Marie Lisa Kogler
机构: 未知
关键词: Bidirectional Topic Matching, introduces Bidirectional Topic, study introduces Bidirectional, Latent Dirichlet Allocation, introduces Bidirectional
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:This study introduces Bidirectional Topic Matching (BTM), a novel method for cross-corpus topic modeling that quantifies thematic overlap and divergence between corpora. BTM is a flexible framework that can incorporate various topic modeling approaches, including BERTopic, Top2Vec, and Latent Dirichlet Allocation (LDA). BTM employs a dual-model approach, training separate topic models for each corpus and applying them reciprocally to enable comprehensive cross-corpus comparisons. This methodology facilitates the identification of shared themes and unique topics, providing nuanced insights into thematic relationships. Validation against cosine similarity-based methods demonstrates the robustness of BTM, with strong agreement metrics and distinct advantages in handling outlier topics. A case study on climate news articles showcases BTM’s utility, revealing significant thematic overlaps and distinctions between corpora focused on climate change and climate action. BTM’s flexibility and precision make it a valuable tool for diverse applications, from political discourse analysis to interdisciplinary studies. By integrating shared and unique topic analyses, BTM offers a comprehensive framework for exploring thematic relationships, with potential extensions to multilingual and dynamic datasets. This work highlights BTM’s methodological contributions and its capacity to advance discourse analysis across various domains.
zh

[NLP-22] owards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset

【速读】：该论文旨在解决机器翻译领域中特定领域术语翻译的挑战，尤其是在人工智能（AI）领域的术语翻译问题。为了解决这一问题，作者提出了GIST，一个大规模多语言AI术语数据集，包含从2000年至2023年顶级AI会议论文中提取的5000个术语，并将其翻译为阿拉伯语、中文、法语、日语和俄语。解决方案的关键在于采用了一种混合框架，结合了大型语言模型（LLMs）进行术语提取和人类专家进行翻译，确保了翻译的准确性。此外，通过后翻译精炼方法将GIST集成到翻译工作流中，无需重新训练模型，且通过LLM提示显著提升了BLEU和COMET评分。该数据集的质量通过众包评估进行了验证，并在ACL Anthology平台上展示了其实际应用，提升了非英语使用者的可访问性，从而促进了全球AI研究的包容性和合作性。

链接: https://arxiv.org/abs/2412.18367
作者: Jiarui Liu,Iman Ouzzani,Wenkai Li,Lechen Zhang,Tianyue Ou,Houda Bouamor,Zhijing Jin,Mona Diab
机构: 未知
关键词: achieved significant advancements, remains challenging, significant advancements, field of machine, achieved significant
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The field of machine translation has achieved significant advancements, yet domain-specific terminology translation, particularly in AI, remains challenging. We introduced GIST, a large-scale multilingual AI terminology dataset containing 5K terms extracted from top AI conference papers spanning 2000 to 2023. The terms were translated into Arabic, Chinese, French, Japanese, and Russian using a hybrid framework that combines LLMs for extraction with human expertise for translation. The dataset’s quality was benchmarked against existing resources, demonstrating superior translation accuracy through crowdsourced evaluation. GIST was integrated into translation workflows using post-translation refinement methods that required no retraining, where LLM prompting consistently improved BLEU and COMET scores. A web demonstration on the ACL Anthology platform highlights its practical application, showcasing improved accessibility for non-English speakers. This work aims to address critical gaps in AI terminology resources and fosters global inclusivity and collaboration in AI research.
zh

[NLP-23] Extracting triples from dialogues for conversational social agents

【速读】：该论文旨在解决在混合智能（Hybrid Intelligence）协作中，如何从社交对话中提取显式符号三元组（explicit symbolic triples）以实现可控和透明的智能体（agents）的问题。社交对话与维基百科文本在体裁上存在显著差异，涉及更多的共指（co-reference）、省略（ellipsis）、协调（coordination）以及隐式和显式的否定或确认（implicit and explicit negation or confirmation）现象，这使得从对话中提取三元组更具挑战性。论文的关键解决方案包括发布用于训练和测试从社交对话中提取三元组的数据集，并创建了五个三元组提取模型进行测试。实验结果表明，尽管在单句测试中三元组元素的准确率较高（69.32），但在跨多轮对话的三元组提取中表现较差，凸显了从真实对话数据中提取知识的复杂性。

链接: https://arxiv.org/abs/2412.18364
作者: Piek Vossen,Selene Báez Santamaría,Lenka Bajčetić,Thomas Belluci
机构: 未知
关键词: Hybrid Intelligence collaboration, Hybrid Intelligence, Natural Language Understanding, Intelligence collaboration, Language Understanding models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Obtaining an explicit understanding of communication within a Hybrid Intelligence collaboration is essential to create controllable and transparent agents. In this paper, we describe a number of Natural Language Understanding models that extract explicit symbolic triples from social conversation. Triple extraction has mostly been developed and tested for Knowledge Base Completion using Wikipedia text and data for training and testing. However, social conversation is very different as a genre in which interlocutors exchange information in sequences of utterances that involve statements, questions, and answers. Phenomena such as co-reference, ellipsis, coordination, and implicit and explicit negation or confirmation are more prominent in conversation than in Wikipedia text. We therefore describe an attempt to fill this gap by releasing data sets for training and testing triple extraction from social conversation. We also created five triple extraction models and tested them in our evaluation data. The highest precision is 51.14 for complete triples and 69.32 for triple elements when tested on single utterances. However, scores for conversational triples that span multiple turns are much lower, showing that extracting knowledge from true conversational data is much more challenging.
zh

[NLP-24] Multi-Agents Based on Large Language Models for Knowledge-based Visual Question Answering

【速读】：该论文旨在解决大语言模型（LLMs）在基于知识的视觉问答（VQA）任务中存在的两个主要挑战：无法自主使用外部工具以及无法在团队中协作。人类在面对新问题时，通常会根据问题的熟悉程度决定是否使用外部工具（如搜索引擎），并通过与他人讨论来获得更好的答案。受此启发，论文提出了一种多代理投票框架（multi-agent voting framework）。该框架设计了三个基于LLM的代理，分别模拟团队中不同层级的成员，并根据层级分配可用的工具。每个代理提供相应的答案，最终通过投票机制确定最终答案。实验结果表明，该方法在OK-VQA和A-OKVQA数据集上分别比其他基线方法提升了2.2和1.0的准确率。解决方案的关键在于通过多代理协作和投票机制，模拟人类在问题解决过程中的工具使用和团队协作行为。

链接: https://arxiv.org/abs/2412.18351
作者: Zhongjian Hu,Peng Yang,Bing Li,Zhenqi Wang
机构: 未知
关键词: Large Language Models, Visual Question Answering, knowledge-based Visual Question, Large Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved impressive results in knowledge-based Visual Question Answering (VQA). However existing methods still have challenges: the inability to use external tools autonomously, and the inability to work in teams. Humans tend to know whether they need to use external tools when they encounter a new question, e.g., they tend to be able to give a direct answer to a familiar question, whereas they tend to use tools such as search engines when they encounter an unfamiliar question. In addition, humans also tend to collaborate and discuss with others to get better answers. Inspired by this, we propose the multi-agent voting framework. We design three LLM-based agents that simulate different levels of staff in a team, and assign the available tools according to the levels. Each agent provides the corresponding answer, and finally all the answers provided by the agents are voted to get the final answer. Experiments on OK-VQA and A-OKVQA show that our approach outperforms other baselines by 2.2 and 1.0, respectively.
zh

[NLP-25] M-Ped: Multi-Prompt Ensemble Decoding for Large Language Models

【速读】：该论文旨在解决大语言模型（LLMs）在自然语言处理（NLP）任务中生成质量提升的问题。其核心解决方案是提出了一种新颖的多提示集成解码方法（multi-prompt ensemble decoding approach），通过批量提交多个提示变体（prompts）并聚合其解码结果来增强LLMs的生成质量。具体而言，对于每个输入X，论文采用批量模式提交n个提示变体，并计算每个token预测的集成概率（ensemble probability），即对n个概率分布进行平均，从而生成最终的token。该方法被称为“内部批量集成”（Inner-Batch Ensemble）。此外，为了高效实现批量推理，论文采用了左填充策略（Left-Padding strategy）以确保所有提示变体的输入长度一致。通过在机器翻译、代码生成和文本简化等多样化NLP任务上的广泛实验，论文验证了该方法在提升BLEU分数、pass@k率和LENS指标方面的显著效果。

链接: https://arxiv.org/abs/2412.18299
作者: Jiaxin Guo,Daimeng Wei,Yuanchang Luo,Shimin Tao,Hengchao Shang,Zongyao Li,Shaojun Li,Jinlong Yang,Zhanglin Wu,Zhiqiang Rao,Hao Yang
机构: 未知
关键词: Large Language Models, Natural Language Processing, Language Models, Language Processing, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the widespread application of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), enhancing their performance has become a research hotspot. This paper presents a novel multi-prompt ensemble decoding approach designed to bolster the generation quality of LLMs by leveraging the aggregation of outcomes from multiple prompts. Given a unique input X , we submit n variations of prompts with X to LLMs in batch mode to decode and derive probability distributions. For each token prediction, we calculate the ensemble probability by averaging the n probability distributions within the batch, utilizing this aggregated probability to generate the token. This technique is dubbed Inner-Batch Ensemble. To facilitate efficient batch inference, we implement a Left-Padding strategy to maintain uniform input lengths across the n prompts. Through extensive experimentation on diverse NLP tasks, including machine translation, code generation, and text simplification, we demonstrate the efficacy of our method in enhancing LLM performance. The results show substantial improvements in BLEU scores, pass@ k rates, and LENS metrics over conventional methods.
zh

[NLP-26] DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation

【速读】：该论文旨在解决代码审查（Code Review）自动化中传统评估方法（主要基于文本相似性）所面临的两大挑战：开源项目中人工撰写评论的可靠性不一致，以及文本相似性与提升代码质量和缺陷检测等目标之间的弱相关性。为此，研究提出了一种新的评估框架 DeepCRCEval，该框架结合了人类评估者和大语言模型（LLMs），并引入了一套基于先前研究和开发者访谈的评估标准，以全面重新评估现有技术。此外，研究还提出了一种创新的基线方法 LLM-Reviewer，利用大语言模型的少样本学习能力进行目标导向的评论生成。研究结果表明，文本相似性指标存在显著局限性，仅有不到 10% 的基准评论适合自动化，而 DeepCRCEval 能够有效区分高质量和低质量评论，显著提升了评估的可靠性。同时，引入大语言模型评估者大幅提高了效率，时间和成本分别减少了 88.78% 和 90.32%。LLM-Reviewer 在评论生成中展示了聚焦任务实际目标的显著潜力。

链接: https://arxiv.org/abs/2412.18291
作者: Junyi Lu,Xiaojia Li,Zihan Hua,Lei Yu,Shiqi Cheng,Li Yang,Fengjun Zhang,Chun Zuo
机构: 未知
关键词: automating review comments, generating significant interest, Code review, automating review, vital but demanding
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the 28th International Conference on Fundamental Approaches to Software Engineering (FASE 2025), part of the 28th European Joint Conferences on Theory and Practice of Software (ETAPS 2025)

点击查看摘要

Abstract:Code review is a vital but demanding aspect of software development, generating significant interest in automating review comments. Traditional evaluation methods for these comments, primarily based on text similarity, face two major challenges: inconsistent reliability of human-authored comments in open-source projects and the weak correlation of text similarity with objectives like enhancing code quality and detecting defects. This study empirically analyzes benchmark comments using a novel set of criteria informed by prior research and developer interviews. We then similarly revisit the evaluation of existing methodologies. Our evaluation framework, DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a comprehensive reassessment of current techniques based on the criteria set. Besides, we also introduce an innovative and efficient baseline, LLM-Reviewer, leveraging the few-shot learning capabilities of LLMs for a target-oriented comparison. Our research highlights the limitations of text similarity metrics, finding that less than 10% of benchmark comments are high quality for automation. In contrast, DeepCRCEval effectively distinguishes between high and low-quality comments, proving to be a more reliable evaluation mechanism. Incorporating LLM evaluators into DeepCRCEval significantly boosts efficiency, reducing time and cost by 88.78% and 90.32%, respectively. Furthermore, LLM-Reviewer demonstrates significant potential of focusing task real targets in comment generation. Comments: Accepted to the 28th International Conference on Fundamental Approaches to Software Engineering (FASE 2025), part of the 28th European Joint Conferences on Theory and Practice of Software (ETAPS 2025) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2412.18291 [cs.SE] (or arXiv:2412.18291v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2412.18291 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-27] GenAI Content Detection Task 2: AI vs. Human – Academic Essay Authenticity Challenge

【速读】：该论文旨在解决学术领域中机器生成文本与人类撰写文本的检测问题，具体任务为“给定一篇论文，判断其是由机器生成还是由人类撰写”。解决方案的关键在于利用微调的基于Transformer的模型（fine-tuned transformer-based models），部分团队还采用了大型语言模型（Large Language Models, LLMs）如Llama 2和Llama 3。通过这一方法，参与团队在英语和阿拉伯语两种语言的检测任务中均取得了显著进展，最优系统的F1分数超过了0.98，显著优于基于n-gram的基线模型。

链接: https://arxiv.org/abs/2412.18274
作者: Shammur Absar Chowdhury,Hind Almerekhi,Mucahid Kutlu,Kaan Efe Keles,Fatema Ahmad,Tasnim Mohiuddin,George Mikros,Firoj Alam
机构: 未知
关键词: Academic Essay Authenticity, Essay Authenticity Challenge, GenAI Content Detection, Content Detection shared, collocated with COLING
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AI Generated Content, Academic Essay, LLMs, Arabic, English

点击查看摘要

Abstract:This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs. human-authored essays for academic purposes. The task is defined as follows: "Given an essay, identify whether it is generated by a machine or authored by a human.‘’ The challenge involves two languages: English and Arabic. During the evaluation phase, 25 teams submitted systems for English and 21 teams for Arabic, reflecting substantial interest in the task. Finally, seven teams submitted system description papers. The majority of submissions utilized fine-tuned transformer-based models, with one team employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This paper outlines the task formulation, details the dataset construction process, and explains the evaluation framework. Additionally, we present a summary of the approaches adopted by participating teams. Nearly all submitted systems outperformed the n-gram-based baseline, with the top-performing systems achieving F1 scores exceeding 0.98 for both languages, indicating significant progress in the detection of machine-generated text.
zh

[NLP-28] Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study

【速读】：该论文旨在探索大预训练语言模型（LLMs）在代码漏洞检测（CVD）任务中的有效性，填补现有研究中对此问题的研究空白。解决方案的关键在于对四种广泛使用的开源LLMs进行微调，以评估其在CVD任务中的表现。此外，论文还实现了五种基于图或中等规模序列模型的现有方法进行对比实验。实验在五个常用的CVD数据集上进行，涵盖了短样本和长样本。论文还通过定量实验研究了类别不平衡问题以及模型在不同长度样本上的性能，这些问题在以往的研究中较少被探讨。为了促进社区的研究，作者开源了所有代码和资源。

链接: https://arxiv.org/abs/2412.18260
作者: Xuefeng Jiang,Lvhua Wu,Sheng Sun,Jia Li,Jingjing Xue,Yuwei Wang,Tingting Wu,Min Liu
机构: 未知
关键词: preventing system security, ensuring software security, system security issues, playing a crucial, system security
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Code vulnerability detection (CVD) is essential for addressing and preventing system security issues, playing a crucial role in ensuring software security. Previous learning-based vulnerability detection methods rely on either fine-tuning medium-size sequence models or training smaller neural networks from scratch. Recent advancements in large pre-trained language models (LLMs) have showcased remarkable capabilities in various code intelligence tasks including code understanding and generation. However, the effectiveness of LLMs in detecting code vulnerabilities is largely under-explored. This work aims to investigate the gap by fine-tuning LLMs for the CVD task, involving four widely-used open-source LLMs. We also implement other five previous graph-based or medium-size sequence models for comparison. Experiments are conducted on five commonly-used CVD datasets, including both the part of short samples and long samples. In addition, we conduct quantitative experiments to investigate the class imbalance issue and the model’s performance on samples of different lengths, which are rarely studied in previous works. To better facilitate communities, we open-source all codes and resources of this study in this https URL and this https URL.
zh

[NLP-29] ICM-Assistant: Instruction-tuning Multimodal Large Language Models for Rule-based Explainable Image Content Moderation AAAI2025

【速读】：该论文旨在解决传统图像内容审核（Image Content Moderation, ICM）模型在处理多样化的文化规范和儿童保护标准时，无法生成精确审核决策的问题。现有的多模态大语言模型（Multimodal Large Language Models, MLLMs）在应用于基于规则的通用ICM时，往往产生与人类审核员不一致的分类和解释结果。为解决这一问题，论文提出了一种新颖的基于规则的数据集生成流程，通过分解简洁的人类定义规则，并利用精心设计的多阶段提示（multi-stage prompts）来丰富简短的显式图像注释。基于此生成的ICM-Instruct数据集包含详细的审核解释和审核问答对，并在此基础上构建了ICM-Assistant模型。该模型在基于规则的ICM框架下表现出卓越的性能和灵活性，显著优于现有方法，在审核分类和审核解释质量上分别平均提升了36.8%和26.6%。

链接: https://arxiv.org/abs/2412.18216
作者: Mengyang Wu,Yuzhi Zhao,Jialun Cao,Mingjie Xu,Zhongming Jiang,Xuehui Wang,Qinbin Li,Guangneng Hu,Shengchao Qin,Chi-Wing Fu
机构: 未知
关键词: Controversial contents largely, contents largely inundate, child protection standards, inundate the Internet, Controversial contents
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: AAAI 2025

点击查看摘要

Abstract:Controversial contents largely inundate the Internet, infringing various cultural norms and child protection standards. Traditional Image Content Moderation (ICM) models fall short in producing precise moderation decisions for diverse standards, while recent multimodal large language models (MLLMs), when adopted to general rule-based ICM, often produce classification and explanation results that are inconsistent with human moderators. Aiming at flexible, explainable, and accurate ICM, we design a novel rule-based dataset generation pipeline, decomposing concise human-defined rules and leveraging well-designed multi-stage prompts to enrich short explicit image annotations. Our ICM-Instruct dataset includes detailed moderation explanation and moderation Q-A pairs. Built upon it, we create our ICM-Assistant model in the framework of rule-based ICM, making it readily applicable in real practice. Our ICM-Assistant model demonstrates exceptional performance and flexibility. Specifically, it significantly outperforms existing approaches on various sources, improving both the moderation classification (36.8% on average) and moderation explanation quality (26.6% on average) consistently over existing MLLMs. Code/Data is available at this https URL.
zh

[NLP-30] Robustness-aware Automatic Prompt Optimization

【速读】：该论文旨在解决现有提示生成方法在面对输入数据扰动（如输入中的拼写错误）时性能下降的问题。传统方法主要针对干净输入数据生成提示，而忽略了扰动输入对提示性能的影响。为此，论文提出了一种名为BATprompt（By Adversarial Training prompt）的新方法，通过对抗训练技术增强提示的鲁棒性。BATprompt的关键在于其两步骤过程：首先进行对抗扰动，然后通过大语言模型（LLM）在未扰动输入上进行迭代优化。与传统对抗攻击方法不同，BATprompt不依赖真实梯度或模型参数，而是利用LLM的高级推理、语言理解和自我反思能力来模拟梯度，从而指导对抗扰动的生成并优化提示性能。实验结果表明，BATprompt在多种扰动场景下均表现出优于现有方法的鲁棒性和性能。

链接: https://arxiv.org/abs/2412.18196
作者: Zeru Shi,Zhenting Wang,Yongye Su,Weidi Luo,Fan Yang,Yongfeng Zhang
机构: 未知
关键词: structural integrity information, Large Language Models, Large Language, input data, semantic and structural
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The performance of Large Language Models (LLMs) is based on the quality of the prompts and the semantic and structural integrity information of the input data. However, current prompt generation methods primarily focus on generating prompts for clean input data, often overlooking the impact of perturbed inputs on prompt performance. To address this limitation, we propose BATprompt (By Adversarial Training prompt), a novel method for prompt generation designed to withstand input perturbations (such as typos in the input). Inspired by adversarial training techniques, BATprompt demonstrates strong performance on a variety of perturbed tasks through a two-step process: adversarial perturbation and iterative optimization on unperturbed input via LLM. Unlike conventional adversarial attack methods, BATprompt avoids reliance on real gradients or model parameters. Instead, it leverages the advanced reasoning, language understanding and self reflection capabilities of LLMs to simulate gradients, guiding the generation of adversarial perturbations and optimizing prompt performance. In our experiments, we evaluate BATprompt on multiple datasets across both language understanding and generation tasks. The results indicate that BATprompt outperforms existing prompt generation methods, delivering superior robustness and performance under diverse perturbation scenarios.
zh

[NLP-31] VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

【速读】：该论文旨在解决现有基准测试（benchmarks）无法充分满足基于基础模型（foundation models）特别是视觉-语言-动作模型（Vision-Language-Action models, VLAs）在语言条件操控（language-conditioned manipulation, LCM）任务中的需求问题。为了解决这一问题，作者提出了VLABench，一个开源的基准测试平台，用于评估通用LCM任务的学习能力。VLABench的关键创新点包括：1）设计了需要世界知识和常识迁移的任务；2）使用隐含人类意图的自然语言指令而非模板化指令；3）包含需要多步推理的长时程任务；4）同时评估动作策略和语言模型的能力。此外，VLABench提供了高质量的训练数据，支持下游任务的微调，并通过实验表明当前最先进的预训练VLAs和基于VLMs的工作流程在应对这些任务时仍面临挑战。

链接: https://arxiv.org/abs/2412.18194
作者: Shiduo Zhang,Zhe Xu,Peiju Liu,Xiaopeng Yu,Yuan Li,Qinghui Gao,Zhaoye Fei,Zhangyue Yin,Zuxuan Wu,Yu-Gang Jiang,Xipeng Qiu
机构: 未知
关键词: General-purposed embodied agents, General-purposed embodied, embodied agents, understand the users’, act precisely
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:General-purposed embodied agents are designed to understand the users’ natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-Language-Action models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh\texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-the-art pretrained VLAs and the workflow based on VLMs face challenges in our tasks.
zh

[NLP-32] An Analysis on Automated Metrics for Evaluating Japanese-English Chat Translation WWW

【速读】：该论文旨在探讨传统基线指标（如BLEU和TER）与基于神经网络的指标（如BERTScore和COMET）在评估神经机器翻译（NMT）模型在聊天翻译任务中的表现时的有效性，并比较这些指标与人工评分的一致性。研究结果表明，尽管传统基线指标在模型排序上表现一致且计算更为简便，但在与人工评分的相关性方面，基于神经网络的指标表现更优，尤其是COMET指标在聊天翻译任务中与人工评分的相关性最高。然而，研究也指出，即使是表现最佳的指标在处理日语中零代词（anaphoric zero-pronoun）的英文翻译时仍存在困难。解决方案的关键在于综合使用传统和神经网络指标，以在模型排序和与人工评分相关性之间取得平衡。

链接: https://arxiv.org/abs/2412.18190
作者: Andre Rusli,Makoto Shishido
机构: 未知
关键词: BLEU and TER, NMT models performance, NMT models, ranking NMT models, paper analyses
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the 29th Annual Meeting of the Association for Natural Language Processing (NLP2023). Published version available at this https URL

点击查看摘要

Abstract:This paper analyses how traditional baseline metrics, such as BLEU and TER, and neural-based methods, such as BERTScore and COMET, score several NMT models performance on chat translation and how these metrics perform when compared to human-annotated scores. The results show that for ranking NMT models in chat translations, all metrics seem consistent in deciding which model outperforms the others. This implies that traditional baseline metrics, which are faster and simpler to use, can still be helpful. On the other hand, when it comes to better correlation with human judgment, neural-based metrics outperform traditional metrics, with COMET achieving the highest correlation with the human-annotated score on a chat translation. However, we show that even the best metric struggles when scoring English translations from sentences with anaphoric zero-pronoun in Japanese.
zh

[NLP-33] On the Applicability of Zero-Shot Cross-Lingual Transfer Learning for Sentiment Classification in Distant Language Pairs WWW

【速读】：该论文探讨了跨语言迁移学习（cross-lingual transfer learning）在从英语到日语和印尼语中的适用性，重点研究了使用XLM-R预训练模型的效果。论文的核心问题在于评估零样本迁移学习（zero-shot transfer learning）方法在多语言任务中的表现，特别是与现有的零样本或全监督方法相比，XLM-R模型是否能够在未使用目标语言数据进行训练的情况下取得竞争性的结果。解决方案的关键在于利用XLM-R这一多语言预训练模型，通过跨语言迁移学习实现多语言任务的统一处理，而非为每种语言单独训练模型。实验结果表明，XLM-R在日语和印尼语的多个数据集上取得了最佳或可比的结果，证明了多语言模型的潜力。

链接: https://arxiv.org/abs/2412.18188
作者: Andre Rusli,Makoto Shishido
机构: 未知
关键词: cross-lingual transfer learning, Japanese and Indonesian, transfer learning, XLM-R pre-trained model, research explores
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the 28th Annual Meeting of the Association for Natural Language Processing (NLP2022). Published version available at this https URL

点击查看摘要

Abstract:This research explores the applicability of cross-lingual transfer learning from English to Japanese and Indonesian using the XLM-R pre-trained model. The results are compared with several previous works, either by models using a similar zero-shot approach or a fully-supervised approach, to provide an overview of the zero-shot transfer learning approach’s capability using XLM-R in comparison with existing models. Our models achieve the best result in one Japanese dataset and comparable results in other datasets in Japanese and Indonesian languages without being trained using the target language. Furthermore, the results suggest that it is possible to train a multi-lingual model, instead of one model for each language, and achieve promising results.
zh

[NLP-34] Survey of Pseudonymization Abstractive Summarization Spell Checker for Hindi and Marathi

【速读】：该论文旨在解决印度地区语言（如印地语和马拉地语）在自然语言处理（NLP）领域中的技术不足问题。尽管在广泛使用的语言（如英语）的NLP应用方面已取得显著进展，但印度地区语言的NLP研究仍处于初级阶段，存在巨大的发展潜力。论文的核心解决方案是构建一个多语言平台，支持英语、印地语和马拉地语的文本匿名化、抽象文本摘要和拼写检查等功能。该平台的目标是服务于主要使用印度地区语言的企业和消费者客户，从而推动这些语言在NLP领域的技术进步和应用普及。

链接: https://arxiv.org/abs/2412.18163
作者: Rasika Ransing,Mohammed Amaan Dhamaskar,Ayush Rajpurohit,Amey Dhoke,Sanket Dalvi
机构: 未知
关键词: Natural Language Processing, India vast linguistic, vast linguistic diversity, linguistic diversity presents, diversity presents unique
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:India’s vast linguistic diversity presents unique challenges and opportunities for technological advancement, especially in the realm of Natural Language Processing (NLP). While there has been significant progress in NLP applications for widely spoken languages, the regional languages of India, such as Marathi and Hindi, remain underserved. Research in the field of NLP for Indian regional languages is at a formative stage and holds immense significance. The paper aims to build a platform which enables the user to use various features like text anonymization, abstractive text summarization and spell checking in English, Hindi and Marathi language. The aim of these tools is to serve enterprise and consumer clients who predominantly use Indian Regional Languages.
zh

[NLP-35] CoAM: Corpus of All-Type Multiword Expressions

【速读】：该论文旨在解决多词表达式（Multiword Expressions, MWEs）识别任务中数据集标注不一致、类型单一或规模有限的问题。为此，作者构建了一个名为CoAM（Corpus of All-Type Multiword Expressions）的数据集，包含1.3K个句子，并通过多步骤流程（包括人工标注、人工审查和自动化一致性检查）提升数据质量。CoAM中的MWEs被标注了类型（如名词和动词），以支持细粒度的错误分析。此外，作者开发了一个新的标注界面，支持灵活标注各种形式的MWEs，包括不连续的表达。实验表明，基于CoAM微调的大语言模型在MWE识别任务上优于当前最先进的方法。关键解决方案在于构建高质量、全面标注的数据集，并通过细粒度的类型标注揭示不同MWE类型的识别难度差异。

链接: https://arxiv.org/abs/2412.18151
作者: Yusuke Ide,Joshua Tanner,Adam Nohejl,Jacob Hoffman,Justin Vasselli,Hidetaka Kamigaito,Taro Watanabe
机构: 未知
关键词: Multiword expressions, MWE identification, All-Type Multiword Expressions, refer to idiomatic, multiple words
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation. Existing datasets for MWE identification are inconsistently annotated, limited to a single type of MWE, or limited in size. To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking. MWEs in CoAM are tagged with MWE types, such as Noun and Verb, to enable fine-grained error analysis. Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form, including discontinuous ones. Through experiments using CoAM, we find that a fine-tuned large language model outperforms the current state-of-the-art approach for MWE identification. Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.
zh

[NLP-36] Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media

【速读】：该论文旨在解决社交媒体平台上AI生成文本（AIGTs）的量化、监测和分析问题，特别是其滥用可能对公众舆论产生的深远影响，如传播错误信息和操纵叙事。尽管这一问题的重要性日益凸显，但此前缺乏对AIGTs在社交媒体上流行程度的系统性研究。为解决这一研究空白，论文首先从Medium、Quora和Reddit三大社交媒体平台收集了约240万条帖子，构建了一个数据集（SM-D）。随后，论文通过结合开源数据集和由12个大型语言模型（LLMs）生成的AIGT数据集，构建了一个多样化的基准数据集（AIGTBench），用于训练和评估AIGT检测器。在此基础上，论文识别出性能最佳的检测器（OSM-Det），并将其应用于SM-D数据集，以追踪2022年1月至2024年10月期间AIGTs的时间变化趋势。研究发现，Medium和Quora的AI归属率（AAR）显著上升，分别从1.77%增至37.03%和2.06%增至38.95%，而Reddit的AAR增长较慢，仅从1.31%增至2.45%。进一步分析表明，AIGTs在语言模式、话题分布、参与度以及作者粉丝分布等多个维度上与人类撰写的文本存在显著差异。论文的研究成果为未来AIGTs在社交媒体领域的研究提供了重要参考。

链接: https://arxiv.org/abs/2412.18148
作者: Zhen Sun,Zongmin Zhang,Xinyue Shen,Ziyi Zhang,Yule Liu,Michael Backes,Yang Zhang,Xinlei He
机构: 未知
关键词: Social media platforms, Social media, media platforms, media, Social
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
备注: 24 pages,18 figures

点击查看摘要

Abstract:Social media platforms are experiencing a growing presence of AI-Generated Texts (AIGTs). However, the misuse of AIGTs could have profound implications for public opinion, such as spreading misinformation and manipulating narratives. Despite its importance, a systematic study to assess the prevalence of AIGTs on social media is still lacking. To address this gap, this paper aims to quantify, monitor, and analyze the AIGTs on online social media platforms. We first collect a dataset (SM-D) with around 2.4M posts from 3 major social media platforms: Medium, Quora, and Reddit. Then, we construct a diverse dataset (AIGTBench) to train and evaluate AIGT detectors. AIGTBench combines popular open-source datasets and our AIGT datasets generated from social media texts by 12 LLMs, serving as a benchmark for evaluating mainstream detectors. With this setup, we identify the best-performing detector (OSM-Det). We then apply OSM-Det to SM-D to track AIGTs over time and observe different trends of AI Attribution Rate (AAR) across social media platforms from January 2022 to October 2024. Specifically, Medium and Quora exhibit marked increases in AAR, rising from 1.77% to 37.03% and 2.06% to 38.95%, respectively. In contrast, Reddit shows slower growth, with AAR increasing from 1.31% to 2.45% over the same period. Our further analysis indicates that AIGTs differ from human-written texts across several dimensions, including linguistic patterns, topic distributions, engagement levels, and the follower distribution of authors. We envision our analysis and findings on AIGTs in social media can shed light on future research in this domain.
zh

[NLP-37] Ensuring Consistency for In-Image Translation

【速读】：该论文旨在解决图像内机器翻译任务中的一致性问题，特别是在翻译和图像生成过程中保持一致性。现有方法往往忽视这一关键问题，导致翻译结果与图像背景和风格不一致。论文提出了两种一致性要求：翻译一致性（translation consistency）和图像生成一致性（image generation consistency）。翻译一致性要求在翻译过程中融入图像信息，而图像生成一致性则要求生成的文本图像在风格上与原始图像保持一致，并确保背景完整性。为解决这些问题，论文提出了一种名为HCIIT（High-Consistency In-Image Translation）的两阶段框架。第一阶段使用多模态多语言大语言模型进行文本图像翻译，并通过思维链学习（chain of thought learning）增强模型在翻译过程中利用图像信息的能力；第二阶段使用扩散模型（diffusion model）进行图像回填，确保文本图像风格的一致性并保留背景细节。此外，论文还构建了一个包含40万对风格一致的伪文本图像对的数据集用于模型训练。实验结果表明，该框架在确保一致性和生成高质量翻译图像方面具有显著效果。

链接: https://arxiv.org/abs/2412.18139
作者: Chengpeng Fu,Xiaocheng Feng,Yichong Huang,Wenshuai Huo,Baohang Li,Zhirui Zhang,Yunfei Lu,Dandan Tu,Duyu Tang,Hui Wang,Bing Qin,Ting Liu
机构: 未知
关键词: machine translation task, translating text embedded, task involves translating, translation, image
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The in-image machine translation task involves translating text embedded within images, with the translated results presented in image format. While this task has numerous applications in various scenarios such as film poster translation and everyday scene image translation, existing methods frequently neglect the aspect of consistency throughout this process. We propose the need to uphold two types of consistency in this task: translation consistency and image generation consistency. The former entails incorporating image information during translation, while the latter involves maintaining consistency between the style of the text-image and the original image, ensuring background integrity. To address these consistency requirements, we introduce a novel two-stage framework named HCIIT (High-Consistency In-Image Translation) which involves text-image translation using a multimodal multilingual large language model in the first stage and image backfilling with a diffusion model in the second stage. Chain of thought learning is utilized in the first stage to enhance the model’s ability to leverage image information during translation. Subsequently, a diffusion model trained for style-consistent text-image generation ensures uniformity in text style within images and preserves background details. A dataset comprising 400,000 style-consistent pseudo text-image pairs is curated for model training. Results obtained on both curated test sets and authentic image test sets validate the effectiveness of our framework in ensuring consistency and producing high-quality translated images.
zh

[NLP-38] LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment

【速读】：该论文旨在解决大语言模型（LLMs）在边缘设备（edge devices）上部署时，现有的一刀切量化（quantization）方法无法根据具体硬件特性和使用场景动态调整内存消耗的问题。为了解决这一局限性，论文提出了LSAQ（Layer-Specific Adaptive Quantization）系统，该系统基于层重要性（layer importance）进行自适应量化和动态部署。LSAQ通过构建每层输入和输出的top-k token集并计算其Jaccard系数来评估层重要性，从而根据边缘设备的资源可用性实时调整量化策略，为不同重要性的层分配不同的精度级别。这一方法显著减少了LLMs的存储需求，同时保持了模型性能，实现了跨多种硬件平台和使用场景的高效部署。

链接: https://arxiv.org/abs/2412.18135
作者: Binrui Zeng,Bin Ji,Xiaodong Liu,Jie Yu,Shasha Li,Jun Ma,Xiaopeng Li,Shangwen Wang,Xinran Hong
机构: 未知
关键词: demonstrate exceptional performance, large language models, edge devices, demonstrate exceptional, large language
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures, work in progress

点击查看摘要

Abstract:As large language models (LLMs) demonstrate exceptional performance across various domains, the deployment of these models on edge devices has emerged as a new trend. Quantization techniques, which reduce the size and memory footprint of LLMs, are effective for enabling deployment on resource-constrained edge devices. However, existing one-size-fits-all quantization methods often fail to dynamically adjust the memory consumption of LLMs based on specific hardware characteristics and usage scenarios. To address this limitation, we propose LSAQ (Layer-Specific Adaptive Quantization), a system for adaptive quantization and dynamic deployment of LLMs based on layer importance. LSAQ evaluates layer importance by constructing top-k token sets from the inputs and outputs of each layer and calculating their Jaccard coefficient. Using this evaluation, the system adaptively adjusts quantization strategies in real time according to the resource availability of edge devices, assigning different precision levels to layers of varying importance. This approach significantly reduces the storage requirements of LLMs while maintaining model performance, enabling efficient deployment across diverse hardware platforms and usage scenarios.
zh

[NLP-39] AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）模型在生成不适宜工作场所（Not-Safe-for-Work, NSFW）图像时的安全问题。当前的内置防护机制往往导致图像质量下降，而外部检测方法则存在准确率低和效率不足的问题。为此，论文提出了AEIOU防御框架，该框架具备适应性（Adaptable）、高效性（Efficient）、可解释性（Interpretable）、可优化性（Optimizable）和统一性（Unified）的特点。AEIOU通过从模型文本编码器的隐藏状态中提取NSFW特征，利用这些特征的可分离性来检测NSFW提示。该框架不仅检测过程高效，且支持通过数据增强技术进行优化，并能够实时解释结果。实验表明，AEIOU在多个数据集上均实现了超过95%的准确率，效率提升了至少十倍，且能够有效应对自适应攻击，并在少样本和多标签场景中表现优异。

链接: https://arxiv.org/abs/2412.18123
作者: Yiming Wang,Jiahao Chen,Qingming Li,Xing Yang,Shouling Ji
机构: 未知
关键词: gain widespread adoption, widespread adoption, increasingly prominent, continue to advance, advance and gain
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As text-to-image (T2I) models continue to advance and gain widespread adoption, their associated safety issues are becoming increasingly prominent. Malicious users often exploit these models to generate Not-Safe-for-Work (NSFW) images using harmful or adversarial prompts, highlighting the critical need for robust safeguards to ensure the integrity and compliance of model outputs. Current internal safeguards frequently degrade image quality, while external detection methods often suffer from low accuracy and inefficiency. In this paper, we introduce AEIOU, a defense framework that is Adaptable, Efficient, Interpretable, Optimizable, and Unified against NSFW prompts in T2I models. AEIOU extracts NSFW features from the hidden states of the model’s text encoder, utilizing the separable nature of these features to detect NSFW prompts. The detection process is efficient, requiring minimal inference time. AEIOU also offers real-time interpretation of results and supports optimization through data augmentation techniques. The framework is versatile, accommodating various T2I architectures. Our extensive experiments show that AEIOU significantly outperforms both commercial and open-source moderation tools, achieving over 95% accuracy across all datasets and improving efficiency by at least tenfold. It effectively counters adaptive attacks and excels in few-shot and multi-label scenarios. Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) Cite as: arXiv:2412.18123 [cs.CR] (or arXiv:2412.18123v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.18123 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-40] Do Language Models Understand the Cognitive Tasks Given to Them? Investigations with the N-Back Paradigm

【速读】：该论文旨在解决在将人类认知任务应用于语言模型时，如何准确解释模型表现不佳的问题。具体而言，当模型在2-back和3-back任务中表现下降时，研究者需要区分这种表现是由于模型在测试的认知能力（如工作记忆容量）上的限制，还是由于模型未能理解任务本身。论文通过分析一系列开源语言模型在这些任务中的表现，提出模型表现不佳的主要原因在于任务理解和任务集维持的局限性，而非工作记忆容量的限制。此外，研究者通过将表现最佳的模型推向更高的n值，并尝试不同的提示策略，进一步分析了模型的注意力机制。该研究的关键在于通过实验和分析，澄清了模型表现不佳的根本原因，并为语言模型的认知评估方法提供了改进思路。

链接: https://arxiv.org/abs/2412.18120
作者: Xiaoyang Hu,Richard L. Lewis
机构: 未知
关键词: tasks originally developed, originally developed, Cognitive tasks originally, Cognitive, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cognitive tasks originally developed for humans are now increasingly used to study language models. While applying these tasks is often straightforward, interpreting their results can be challenging. In particular, when a model underperforms, it’s often unclear whether this results from a limitation in the cognitive ability being tested or a failure to understand the task itself. A recent study argued that GPT 3.5’s declining performance on 2-back and 3-back tasks reflects a working memory capacity limit similar to humans. By analyzing a range of open-source language models of varying performance levels on these tasks, we show that the poor performance instead reflects a limitation in task comprehension and task set maintenance. In addition, we push the best performing model to higher n values and experiment with alternative prompting strategies, before analyzing model attentions. Our larger aim is to contribute to the ongoing conversation around refining methodologies for the cognitive evaluation of language models.
zh

[NLP-41] Molly: Making Large Language Model Agents Solve Python Problem More Logically

【速读】：该论文旨在解决在将大语言模型（LLMs）应用于计算机编程教育时，模型生成的内容可能不准确或与学习者需求不匹配的问题。具体而言，现有方法如微调（fine-tuning）和检索增强生成（RAG）虽然在一定程度上提升了模型的表现，但微调过程资源消耗大且可能削弱模型的泛化能力，而RAG在减少模型幻觉方面表现良好，但在推理过程中可能生成与事实无关的内容，导致学习者困惑。为解决这些问题，论文提出了Molly代理，其关键解决方案包括通过基于场景的交互自动解析学习者的提问意图，从构建的知识库中精确检索相关文档，并在生成阶段对生成的响应进行反思，以确保其不仅符合事实内容，还能有效回答用户查询。实验结果表明，Molly代理在提供Python编程问题有用响应方面表现出显著提升。

链接: https://arxiv.org/abs/2412.18093
作者: Rui Xiao,Jiong Wang,Lu Han,Na Zong,Han Wu
机构: 未知
关键词: Applying large language, Applying large, large language models, teaching assists, assists has attracted
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2402.07913

点击查看摘要

Abstract:Applying large language models (LLMs) as teaching assists has attracted much attention as an integral part of intelligent education, particularly in computing courses. To reduce the gap between the LLMs and the computer programming education expert, fine-tuning and retrieval augmented generation (RAG) are the two mainstream methods in existing researches. However, fine-tuning for specific tasks is resource-intensive and may diminish the model`s generalization capabilities. RAG can perform well on reducing the illusion of LLMs, but the generation of irrelevant factual content during reasoning can cause significant confusion for learners. To address these problems, we introduce the Molly agent, focusing on solving the proposed problem encountered by learners when learning Python programming language. Our agent automatically parse the learners’ questioning intent through a scenario-based interaction, enabling precise retrieval of relevant documents from the constructed knowledge base. At generation stage, the agent reflect on the generated responses to ensure that they not only align with factual content but also effectively answer the user’s queries. Extensive experimentation on a constructed Chinese Python QA dataset shows the effectiveness of the Molly agent, indicating an enhancement in its performance for providing useful responses to Python questions.
zh

[NLP-42] Generating Traffic Scenarios via In-Context Learning to Learn Better Motion Planner

【速读】：该论文旨在解决自动驾驶中运动规划（motion planning）面临的挑战，即现有基于精心标注数据集的运动规划器难以捕捉罕见但关键的安全场景，导致在测试中可能发生事故。为解决这一问题，论文提出了一种低成本生成多样化关键交通场景的方法，以训练更具鲁棒性的运动规划器。其解决方案的关键在于：首先，将交通场景表示为脚本（scripts），并通过模拟器（如CARLA）生成场景；其次，利用大语言模型（Large Language Model, LLM）将用户指定的文本描述通过上下文学习（in-context learning）转化为脚本，进而生成对应的交通场景。通过生成大量安全关键场景作为合成训练数据，论文验证了该方法在提升运动规划器性能方面的有效性，实验表明，结合合成数据与真实数据训练的运动规划器显著优于仅使用真实数据训练的规划器。

链接: https://arxiv.org/abs/2412.18086
作者: Aizierjiang Aiersilan
机构: 未知
关键词: motion planners, autonomous driving, scenarios, Motion, crucial component
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Motion planning is a crucial component in autonomous driving. State-of-the-art motion planners are trained on meticulously curated datasets, which are not only expensive to annotate but also insufficient in capturing rarely seen critical scenarios. Failing to account for such scenarios poses a significant risk to motion planners and may lead to incidents during testing. An intuitive solution is to manually compose such scenarios by programming and executing a simulator (e.g., CARLA). However, this approach incurs substantial human costs. Motivated by this, we propose an inexpensive method for generating diverse critical traffic scenarios to train more robust motion planners. First, we represent traffic scenarios as scripts, which are then used by the simulator to generate traffic scenarios. Next, we develop a method that accepts user-specified text descriptions, which a Large Language Model (LLM) translates into scripts using in-context learning. The output scripts are sent to the simulator that produces the corresponding traffic scenarios. As our method can generate abundant safety-critical traffic scenarios, we use them as synthetic training data for motion planners. To demonstrate the value of generated scenarios, we train existing motion planners on our synthetic data, real-world datasets, and a combination of both. Our experiments show that motion planners trained with our data significantly outperform those trained solely on real-world data, showing the usefulness of our synthetic data and the effectiveness of our data generation method. Our source code is available at this https URL.
zh

[NLP-43] MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

【速读】：该论文旨在解决当前视觉任务中单一模型无法应对所有潜在用户需求的局限性问题。尽管现有的基础模型和视觉语言模型（vision-language models）以及有效的微调技术已经开发了大量通用和专用模型，但这些模型在灵活性、可访问性和用户约束（如性能和计算需求）方面仍存在不足。现有的视觉编程和多模态大语言模型（multimodal LLMs）虽然通过程序合成（program synthesis）来处理复杂视觉任务，但往往忽略了用户的具体约束，生成的解决方案难以部署，且可能需要用户具备低层指令编写能力。为解决这些问题，论文提出了MMFactory框架，其核心在于通过模型和指标路由组件（model and metrics routing components）充当解决方案搜索引擎，根据任务描述、少量输入输出样本以及（可选的）资源和性能约束，从模型库中实例化和组合视觉语言工具，生成多样化的程序化解决方案。此外，MMFactory还提供性能指标和基准测试，帮助用户选择符合其设计约束的解决方案。技术层面上，MMFactory引入了基于委员会（committee-based）的解决方案提议机制，利用多代理大语言模型对话生成可执行、多样化、通用且鲁棒的解决方案。实验结果表明，MMFactory能够根据用户问题规范提供最先进的解决方案，显著优于现有方法。

链接: https://arxiv.org/abs/2412.18072
作者: Wan-Cyuan Fan,Tanzila Rahman,Leonid Sigal
机构: 未知
关键词: effective fine-tuning techniques, fine-tuning techniques, advances in foundational, foundational and vision-language, effective fine-tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at this https URL.
zh

[NLP-44] Improving Factuality with Explicit Working Memory

【速读】：该论文旨在解决大语言模型（Large Language Models）在生成长文本时产生事实不准确内容（即“幻觉”问题）的挑战。传统基于检索增强生成（Retrieved-Augmented Generation, RAG）的方法通过迭代提示来提升事实准确性，但其效果受限于传统RAG设计。为此，论文提出了一种名为EWE（Explicit Working Memory）的新方法，通过引入一个工作记忆模块来增强长文本生成的事实准确性。该工作记忆模块能够接收来自外部资源的实时反馈，并基于在线事实核查和检索反馈进行更新，从而在生成过程中纠正错误声明，确保输出更加准确和可靠。实验结果表明，EWE在四个事实导向的长文本生成数据集上优于现有基线模型，显著提升了事实性指标VeriScore，且未牺牲回答的有用性。进一步分析表明，记忆更新规则的设计、记忆单元的配置以及检索数据存储的质量是影响模型性能的关键因素。

链接: https://arxiv.org/abs/2412.18069
作者: Mingda Chen,Yang Li,Karthik Padthe,Rulin Shao,Alicia Sun,Luke Zettlemoyer,Gargi Gosh,Wen-tau Yih
机构: 未知
关键词: factually inaccurate content, Large language models, generate factually inaccurate, Large language, inaccurate content
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models can generate factually inaccurate content, a problem known as hallucination. Recent works have built upon retrieved-augmented generation to improve factuality through iterative prompting but these methods are limited by the traditional RAG design. To address these challenges, we introduce EWE (Explicit Working Memory), a novel approach that enhances factuality in long-form text generation by integrating a working memory that receives real-time feedback from external resources. The memory is refreshed based on online fact-checking and retrieval feedback, allowing EWE to rectify false claims during the generation process and ensure more accurate and reliable outputs. Our experiments demonstrate that Ewe outperforms strong baselines on four fact-seeking long-form generation datasets, increasing the factuality metric, VeriScore, by 2 to 10 points absolute without sacrificing the helpfulness of the responses. Further analysis reveals that the design of rules for memory updates, configurations of memory units, and the quality of the retrieval datastore are crucial factors for influencing model performance.
zh

[NLP-45] Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction

【速读】：该论文旨在解决对话中的话轮转换预测（turn-taking prediction）问题，即预测当前说话者何时会让出话轮，使另一位说话者开始发言。为了解决这一问题，论文提出了一种多模态集成方法，结合了大语言模型（LLMs）和语音活动预测模型（VAP）。通过整合LLMs的语言理解能力和VAP模型的时间精度，该方法旨在提高在脚本化和非脚本化对话场景中识别话轮转换点（TRPs）的准确性和效率。论文在In-Conversation Corpus (ICC)和Coached Conversational Preference Elicitation (CCPE)数据集上进行了评估，展示了当前模型的优势与局限性，并提出了一个可能更鲁棒的预测框架。

链接: https://arxiv.org/abs/2412.18061
作者: Hyunbae Jeon,Frederic Guintu,Rayvant Sahni
机构: 未知
关键词: begin speaking, Turn-taking prediction, task of anticipating, conversation will yield, yield their turn
类目: ound (cs.SD); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Turn-taking prediction is the task of anticipating when the speaker in a conversation will yield their turn to another speaker to begin speaking. This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach that integrates large language models (LLMs) and voice activity projection (VAP) models. By combining the linguistic capabilities of LLMs with the temporal precision of VAP models, we aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios. Our methods are evaluated on the In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation (CCPE) datasets, highlighting the strengths and limitations of current models while proposing a potentially more robust framework for enhanced prediction.
zh

[NLP-46] Neuron Empirical Gradient: Connecting Neurons Linear Controllability and Representational Capacity

【速读】：该论文旨在解决预训练语言模型（PLMs）中前馈层神经元如何存储和表示知识的问题，特别是神经元激活与模型输出之间的定量关系尚未被充分理解。论文通过使用事实探测数据集进行神经元层面的干预，揭示了神经元激活与输出标记概率之间的线性关系，并提出了“神经元经验梯度”（neuron empirical gradients）的概念。关键解决方案是引入了NeurGrad方法，用于高效计算这些梯度，从而促进对神经元的定量分析。此外，论文还通过引入MCEval8k多选知识评估基准，验证了神经元经验梯度在捕捉知识方面的有效性，并探讨了技能神经元（skill neurons）的效率、通用性、包容性和相互依赖性。这些发现通过神经元经验梯度将知识与PLM输出联系起来，揭示了PLMs如何存储知识。

链接: https://arxiv.org/abs/2412.18053
作者: Xin Zhao,Zehui Jiang,Naoki Yoshinaga
机构: 未知
关键词: pre-trained language models, analyses remain qualitative, prior analyses remain, model output poorly, output poorly understood
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 18 figures

点击查看摘要

Abstract:Although neurons in the feed-forward layers of pre-trained language models (PLMs) can store factual knowledge, most prior analyses remain qualitative, leaving the quantitative relationship among knowledge representation, neuron activations, and model output poorly understood. In this study, by performing neuron-wise interventions using factual probing datasets, we first reveal the linear relationship between neuron activations and output token probabilities. We refer to the gradient of this linear relationship as ``neuron empirical gradients.‘’ and propose NeurGrad, an efficient method for their calculation to facilitate quantitative neuron analysis. We next investigate whether neuron empirical gradients in PLMs encode general task knowledge by probing skill neurons. To this end, we introduce MCEval8k, a multi-choice knowledge evaluation benchmark spanning six genres and 22 tasks. Our experiments confirm that neuron empirical gradients effectively capture knowledge, while skill neurons exhibit efficiency, generality, inclusivity, and interdependency. These findings link knowledge to PLM outputs via neuron empirical gradients, shedding light on how PLMs store knowledge. The code and dataset are released.
zh

[NLP-47] Factuality or Fiction? Benchmarking Modern LLM s on Ambiguous QA with Citations

【速读】：该论文旨在解决现代大语言模型（LLMs）在复杂且现实的任务中的事实准确性和引用性能问题，特别是在具有模糊性的问答（QA）任务中。研究通过评估两个领先的LLMs（GPT-4o-mini和Claude-3.5）在三个最新发布的数据集（DisentQA-DupliCite、DisentQA-ParaCite和AmbigQA-Cite）上的表现，发现这些模型在模糊上下文中能够预测至少一个正确答案，但在处理多个有效答案时表现不佳，且在引用生成方面表现较差，引用准确率始终为0。解决方案的关键在于引入冲突感知提示（conflict-aware prompting），该方法显著改善了模型处理多个有效答案的能力，并提高了引用准确率，同时保持了其预测正确答案的能力。这些发现为开发能够处理模糊性并提供可靠引用的LLMs提供了重要的挑战和机遇，并为未来可信和可解释的QA系统的改进奠定了基础。

链接: https://arxiv.org/abs/2412.18051
作者: Maya Patel,Aditi Anand
机构: 未知
关键词: modern large language, advancing their development, complex and realistic, Question Answering, Benchmarking modern large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Benchmarking modern large language models (LLMs) on complex and realistic tasks is critical to advancing their development. In this work, we evaluate the factual accuracy and citation performance of state-of-the-art LLMs on the task of Question Answering (QA) in ambiguous settings with source citations. Using three recently published datasets-DisentQA-DupliCite, DisentQA-ParaCite, and AmbigQA-Cite-featuring a range of real-world ambiguities, we analyze the performance of two leading LLMs, GPT-4o-mini and Claude-3.5. Our results show that larger, recent models consistently predict at least one correct answer in ambiguous contexts but fail to handle cases with multiple valid answers. Additionally, all models perform equally poorly in citation generation, with citation accuracy consistently at 0. However, introducing conflict-aware prompting leads to large improvements, enabling models to better address multiple valid answers and improve citation accuracy, while maintaining their ability to predict correct answers. These findings highlight the challenges and opportunities in developing LLMs that can handle ambiguity and provide reliable source citations. Our benchmarking study provides critical insights and sets a foundation for future improvements in trustworthy and interpretable QA systems.
zh

[NLP-48] Emoji Retrieval from Gibberish or Garbled Social Media Text: A Novel Methodology and A Case Study

【速读】：该论文旨在解决社交媒体平台上表情符号（emojis）在噪声或乱码文本中丢失的问题，这一问题对数据分析和机器学习带来了挑战。传统的预处理方法通常建议删除此类文本，但这可能导致表情符号及其上下文意义的丢失。论文提出了一种三步逆向工程方法，用于从社交媒体帖子中的乱码文本中恢复表情符号，并分析了在社交媒体数据挖掘过程中生成此类乱码文本的原因。该解决方案的关键在于通过逆向工程方法有效地检索表情符号，从而提升文本的可读性和连贯性，并通过多种可读性指标（如Flesch Reading Ease、Flesch-Kincaid Grade Level等）验证了其有效性。此外，论文还分析了表情符号的使用频率和模式，进一步展示了该方法在数据挖掘中的实用性。

链接: https://arxiv.org/abs/2412.18046
作者: Shuqi Cui,Nirmalya Thakur,Audrey Poon
机构: 未知
关键词: posing challenges, machine learning, lost in noisy, analysis and machine, social media platforms
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Emojis are widely used across social media platforms but are often lost in noisy or garbled text, posing challenges for data analysis and machine learning. Conventional preprocessing approaches recommend removing such text, risking the loss of emojis and their contextual meaning. This paper proposes a three-step reverse-engineering methodology to retrieve emojis from garbled text in social media posts. The methodology also identifies reasons for the generation of such text during social media data mining. To evaluate its effectiveness, the approach was applied to 509,248 Tweets about the Mpox outbreak, a dataset referenced in about 30 prior works that failed to retrieve emojis from garbled text. Our method retrieved 157,748 emojis from 76,914 Tweets. Improvements in text readability and coherence were demonstrated through metrics such as Flesch Reading Ease, Flesch-Kincaid Grade Level, Coleman-Liau Index, Automated Readability Index, Dale-Chall Readability Score, Text Standard, and Reading Time. Additionally, the frequency of individual emojis and their patterns of usage in these Tweets were analyzed, and the results are presented.
zh

[NLP-49] Aligning AI Research with the Needs of Clinical Coding Workflows: Eight Recommendations Based on US Data Analysis and Critical Review

【速读】：该论文旨在解决当前临床编码（Clinical Coding）研究中评估方法与实际临床环境不匹配的问题。现有的评估方法通常过于简化，例如仅关注前50个最常见的编码，而实际临床中使用的编码数量多达数千种。这种不匹配导致自动化编码研究的成果难以有效应用于实际临床工作。论文提出了八项具体建议，以改进当前的评估方法，使其更贴近实际临床需求。此外，论文还建议开发新的基于生成式 AI（Generative AI）的方法，超越单纯的自动化编码，提供辅助临床编码员工作流程的替代方案。解决方案的关键在于通过更贴近实际临床环境的评估方法和创新的 AI 技术，提升临床编码的准确性和效率。

链接: https://arxiv.org/abs/2412.18043
作者: Yidong Gan,Maciej Rybinski,Ben Hachey,Jonathan K. Kummerfeld
机构: 未知
关键词: crucial for healthcare, healthcare billing, billing and data, Clinical coding, coding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: We received a meta-review score of 5 in ARR October 2024

点击查看摘要

Abstract:Clinical coding is crucial for healthcare billing and data analysis. Manual clinical coding is labour-intensive and error-prone, which has motivated research towards full automation of the process. However, our analysis, based on US English electronic health records and automated coding research using these records, shows that widely used evaluation methods are not aligned with real clinical contexts. For example, evaluations that focus on the top 50 most common codes are an oversimplification, as there are thousands of codes used in practice. This position paper aims to align AI coding research more closely with practical challenges of clinical coding. Based on our analysis, we offer eight specific recommendations, suggesting ways to improve current evaluation methods. Additionally, we propose new AI-based methods beyond automated coding, suggesting alternative approaches to assist clinical coders in their workflows.
zh

[NLP-50] heoretical Constraints on the Expressive Power of mathsfRoPE-based Tensor Attention Transformers

【速读】：该论文旨在探讨张量注意力（Tensor Attention）和基于旋转位置嵌入（Rotary Position Embedding, RoPE）的张量注意力在理论上的局限性，特别是它们在电路复杂性方面的表现。研究发现，在多项式精度、恒定深度层以及线性或次线性隐藏维度的条件下，这些技术无法解决固定成员问题或(A_F,r)^*闭包问题，前提是假设TC^0 ≠ NC^1。这一发现揭示了张量注意力和基于RoPE的张量注意力在实际性能与理论约束之间的差距，为未来设计更具理论基础的Transformer模型提供了重要见解。解决方案的关键在于通过电路复杂性分析，揭示这些技术的理论局限性，从而指导更有效的模型设计和扩展策略。

链接: https://arxiv.org/abs/2412.18040
作者: Xiaoyu Li,Yingyu Liang,Zhenmei Shi,Zhao Song,Mingda Wan
机构: 未知
关键词: capturing high-order correlations, Rotary Position Embedding, Tensor Attention, Tensor Attention extends, Attention extends traditional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tensor Attention extends traditional attention mechanisms by capturing high-order correlations across multiple modalities, addressing the limitations of classical matrix-based attention. Meanwhile, Rotary Position Embedding ( \mathsfRoPE ) has shown superior performance in encoding positional information in long-context scenarios, significantly enhancing transformer models’ expressiveness. Despite these empirical successes, the theoretical limitations of these technologies remain underexplored. In this study, we analyze the circuit complexity of Tensor Attention and \mathsfRoPE -based Tensor Attention, showing that with polynomial precision, constant-depth layers, and linear or sublinear hidden dimension, they cannot solve fixed membership problems or (A_F,r)^* closure problems, under the assumption that \mathsfTC^0 \neq \mathsfNC^1 . These findings highlight a gap between the empirical performance and theoretical constraints of Tensor Attention and \mathsfRoPE -based Tensor Attention Transformers, offering insights that could guide the development of more theoretically grounded approaches to Transformer model design and scaling.
zh

[NLP-51] Explainability in Neural Networks for Natural Language Processing Tasks

【速读】：该论文旨在解决神经网络（Neural Networks）在自然语言处理（Natural Language Processing, NLP）应用中的“黑箱”问题，即难以理解其内部工作机制的挑战。为解决这一问题，研究采用了局部可解释模型无关解释（Local Interpretable Model-Agnostic Explanations, LIME）技术，通过对多层感知机（Multi-Layer Perceptron, MLP）神经网络在文本分类任务中的预测行为进行分析，揭示个体特征对模型预测的贡献，从而增强模型的可解释性。LIME的关键在于其能够提供局部解释，帮助研究人员理解模型的决策过程，但其在捕捉全局模式和特征交互方面存在局限性。研究进一步指出了LIME的优缺点，并提出了未来研究方向，以实现更全面的神经网络可解释性。

链接: https://arxiv.org/abs/2412.18036
作者: Melkamu Mersha,Mingiziem Bitewa,Tsion Abay,Jugal Kalita
机构: 未知
关键词: creating significant challenges, natural language processing, Local Interpretable Model-Agnostic, creating significant, language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural networks are widely regarded as black-box models, creating significant challenges in understanding their inner workings, especially in natural language processing (NLP) applications. To address this opacity, model explanation techniques like Local Interpretable Model-Agnostic Explanations (LIME) have emerged as essential tools for providing insights into the behavior of these complex systems. This study leverages LIME to interpret a multi-layer perceptron (MLP) neural network trained on a text classification task. By analyzing the contribution of individual features to model predictions, the LIME approach enhances interpretability and supports informed decision-making. Despite its effectiveness in offering localized explanations, LIME has limitations in capturing global patterns and feature interactions. This research highlights the strengths and shortcomings of LIME and proposes directions for future work to achieve more comprehensive interpretability in neural NLP models.
zh

[NLP-52] Same Company Same Signal: The Role of Identity in Earnings Call Transcripts

【速读】：该论文旨在解决财报发布后波动性预测（post-earnings volatility prediction）的问题，特别是探讨财报电话会议记录（earnings call transcripts）对波动性预测的实际影响。论文通过引入DEC数据集，该数据集利用之前被忽视的beforeAfterMarket属性，提供了精确的波动性计算，并且每个股票代码（ticker）包含多达20次财报记录，显著提升了数据密度。解决方案的关键在于提出了两种无需训练的基线模型：财报后波动性（Post-earnings Volatility, PEV）和同股票财报后波动性（Same-ticker Post-earnings Volatility, STPEV），这些模型通过利用历史波动性数据捕捉股票特定的波动模式，显著超越了基于财报电话会议记录的模型。此外，论文还揭示了当前财报电话会议记录的表征主要捕捉了股票身份信息，而非提供与每次财报相关的财务洞察，进一步支持了基线模型的有效性。

链接: https://arxiv.org/abs/2412.18029
作者: Ding Yu,Zhuo Liu,Hangfeng He
机构: 未知
关键词: rich semantics contribute, Post-earnings volatility, leveraging earnings call, semantics contribute significantly, volatility
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-earnings volatility prediction is critical for investors, with previous works often leveraging earnings call transcripts under the assumption that their rich semantics contribute significantly. To further investigate how transcripts impact volatility, we introduce DEC, a dataset featuring accurate volatility calculations enabled by the previously overlooked beforeAfterMarket attribute and dense ticker coverage. Unlike established benchmarks, where each ticker has only around two earnings, DEC provides 20 earnings records per ticker. Using DEC, we reveal that post-earnings volatility undergoes significant shifts, with each ticker displaying a distinct volatility distribution. To leverage historical post-earnings volatility and capture ticker-specific patterns, we propose two training-free baselines: Post-earnings Volatility (PEV) and Same-ticker Post-earnings Volatility (STPEV). These baselines surpass all transcripts-based models on DEC as well as on established benchmarks. Additionally, we demonstrate that current transcript representations predominantly capture ticker identity rather than offering financially meaningful insights specific to each earnings. This is evidenced by two key observations: earnings representations from the same ticker exhibit significantly higher similarity compared to those from different tickers, and predictions from transcript-based models show strong correlations with prior post-earnings volatility.
zh

[NLP-53] StructTest: Benchmarking LLM s Reasoning through Compositional Structured Outputs

【速读】：该论文旨在解决大语言模型（LLMs）评估中的三个主要问题：人类标注的高成本、基于模型评估的偏见风险以及基于目标答案的基准测试中的数据污染和作弊问题。为解决这些局限性，论文提出了StructTest，一种新颖的基准测试方法，通过评估LLMs生成组合指定的结构化输出的能力，提供了一种无偏见、低成本且难以作弊的评估手段。StructTest的核心在于使用基于规则的评估器进行确定性评估，该方法易于扩展到新任务，并通过在摘要、代码、HTML和数学等多个任务领域的测试，证明其能够有效反映模型的通用推理能力。StructTest为客观和稳健的模型评估提供了一种关键的补充方法。

链接: https://arxiv.org/abs/2412.18011
作者: Hailin Chen,Fangkai Jiao,Mathieu Ravaut,Nawshad Farruque,Xuan Phi Nguyen,Chengwei Qin,Manan Dey,Bosheng Ding,Caiming Xiong,Shafiq Joty,Yingbo Zhou
机构: 未知
关键词: large language models, evaluating their capabilities, rapid development, development of large, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of large language models (LLMs) necessitates robust, unbiased, and scalable methods for evaluating their capabilities. However, human annotations are expensive to scale, model-based evaluations are prone to biases in answer style, while target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to produce compositionally specified structured outputs as an unbiased, cheap-to-run and difficult-to-cheat measure. The evaluation is done deterministically by a rule-based evaluator, which can be easily extended to new tasks. By testing structured outputs across diverse task domains – including Summarization, Code, HTML and Math – we demonstrate that StructTest serves as a good proxy for general reasoning abilities, as producing structured outputs often requires internal logical reasoning. We believe that StructTest offers a critical, complementary approach to objective and robust model evaluation.
zh

[NLP-54] Correctness is not Faithfulness in RAG Attributions

【速读】：该论文旨在解决生成式语言模型在生成回答时引用文献的可靠性和真实性问题。虽然现有研究主要关注引用正确性（citation correctness），即引用的文献是否支持生成的陈述，但仅凭正确性不足以建立用户对生成回答的信任。论文提出并区分了引用正确性和引用忠实性（citation faithfulness）两个概念，强调引用忠实性确保模型对引用文献的依赖是真实的，而非仅与先验信念表面一致的后合理化（post-rationalization）。通过设计实验，论文揭示了后合理化现象的普遍存在，指出当前生成回答中高达57%的引用缺乏忠实性。因此，解决方案的关键在于同时评估引用正确性和引用忠实性，以确保语言模型生成的可信回答。

链接: https://arxiv.org/abs/2412.18004
作者: Jonas Wallat,Maria Heuss,Maarten de Rijke,Avishek Anand
机构: 未知
关键词: Retrieving relevant context, Retrieving relevant, enhance answer reliability, relevant context, common approach
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Retrieving relevant context is a common approach to reduce hallucinations and enhance answer reliability. Explicitly citing source documents allows users to verify generated responses and increases trust. Prior work largely evaluates citation correctness - whether cited documents support the corresponding statements. But citation correctness alone is insufficient. To establish trust in attributed answers, we must examine both citation correctness and citation faithfulness. In this work, we first disentangle the notions of citation correctness and faithfulness, which have been applied inconsistently in previous studies. Faithfulness ensures that the model’s reliance on cited documents is genuine, reflecting actual reference use rather than superficial alignment with prior beliefs, which we call post-rationalization. We design an experiment that reveals the prevalent issue of post-rationalization, which undermines reliable attribution and may result in misplaced trust. Our findings suggest that current attributed answers often lack citation faithfulness (up to 57 percent of the citations), highlighting the need to evaluate correctness and faithfulness for trustworthy attribution in language models.
zh

[NLP-55] CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language Models

【速读】：该论文旨在解决当前大语言模型（LLMs）在因果推理能力评估方面缺乏有效基准的问题。现有的基准主要基于对话任务、学术数学测试和编程测试，这些基准在评估LLMs解决现实世界问题的能力方面存在局限性。为此，作者提出了一个名为CARL-GT的基准，该基准通过图数据和表格数据来评估LLMs的因果推理能力。CARL-GT包含多样化的任务，涵盖因果图推理、知识发现和决策制定等方面，并开发了有效的零样本学习提示（zero-shot learning prompts）来支持这些任务。实验结果表明，LLMs在因果推理方面仍存在不足，尤其是在利用表格数据发现新见解时表现较弱。此外，通过分析LLMs在不同任务上的表现，作者探讨了不同任务之间的关系，发现LLMs在不同类别任务（如因果图推理、知识发现和决策制定）中的表现具有较强的相关性，而在同一类别任务中的表现相关性较弱。

链接: https://arxiv.org/abs/2412.17970
作者: Ruibo Tu,Hedvig Kjellström,Gustav Eje Henter,Cheng Zhang
机构: 未知
关键词: education and healthcare, Causal reasoning capabilities, large language models, LLMs, reasoning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Causal reasoning capabilities are essential for large language models (LLMs) in a wide range of applications, such as education and healthcare. But there is still a lack of benchmarks for a better understanding of such capabilities. Current LLM benchmarks are mainly based on conversational tasks, academic math tests, and coding tests. Such benchmarks evaluate LLMs in well-regularized settings, but they are limited in assessing the skills and abilities to solve real-world problems. In this work, we provide a benchmark, named by CARL-GT, which evaluates CAusal Reasoning capabilities of large Language models using Graphs and Tabular data. The benchmark has a diverse range of tasks for evaluating LLMs from causal graph reasoning, knowledge discovery, and decision-making aspects. In addition, effective zero-shot learning prompts are developed for the tasks. In our experiments, we leverage the benchmark for evaluating open-source LLMs and provide a detailed comparison of LLMs for causal reasoning abilities. We found that LLMs are still weak in casual reasoning, especially with tabular data to discover new insights. Furthermore, we investigate and discuss the relationships of different benchmark tasks by analyzing the performance of LLMs. The experimental results show that LLMs have different strength over different tasks and that their performance on tasks in different categories, i.e., causal graph reasoning, knowledge discovery, and decision-making, shows stronger correlation than tasks in the same category.
zh

[NLP-56] Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

【速读】：该论文旨在解决大语言模型（LLMs）在处理复杂推理任务，尤其是关系推理（如亲属关系或空间推理）时表现不佳的问题。为此，作者提出了“思维路径”（Path-of-Thoughts, PoT）框架，其核心在于将关系推理任务分解为三个关键阶段：图提取（graph extraction）、路径识别（path identification）和推理（reasoning）。与以往方法不同，PoT能够高效地提取任务无关的图结构，识别问题上下文中的关键实体、关系和属性，并在图中识别与问题相关的推理链，从而推断出潜在答案。实验结果表明，PoT在四个需要长推理链的基准数据集上显著优于现有方法（最大提升21.3%），且无需微调或大量LLM调用。此外，PoT通过利用图结构的组合特性，增强了对LLM错误的鲁棒性，相较于先前的神经符号方法具有明显优势。

链接: https://arxiv.org/abs/2412.17963
作者: Ge Zhang,Mohammad Ali Alomrani,Hongjian Gu,Jiaming Zhou,Yaochen Hu,Bin Wang,Qun Liu,Mark Coates,Yingxue Zhang,Jianye Hao
机构: 未知
关键词: Large language models, possess vast semantic, vast semantic knowledge, Large language, language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) possess vast semantic knowledge but often struggle with complex reasoning tasks, particularly in relational reasoning problems such as kinship or spatial reasoning. In this paper, we present Path-of-Thoughts (PoT), a novel framework designed to tackle relation reasoning by decomposing the task into three key stages: graph extraction, path identification, and reasoning. Unlike previous approaches, PoT efficiently extracts a task-agnostic graph that identifies crucial entities, relations, and attributes within the problem context. Subsequently, PoT identifies relevant reasoning chains within the graph corresponding to the posed question, facilitating inference of potential answers. Experimental evaluations on four benchmark datasets, demanding long reasoning chains, demonstrate that PoT surpasses state-of-the-art baselines by a significant margin (maximum 21.3%) without necessitating fine-tuning or extensive LLM calls. Furthermore, as opposed to prior neuro-symbolic methods, PoT exhibits improved resilience against LLM errors by leveraging the compositional nature of graphs.
zh

[NLP-57] IITR-CIOL@NLU of Devanagari Script Languages 2025: Multilingual Hate Speech Detection and Target Identification in Devanagari-Scripted Languages

【速读】：该论文致力于解决在德瓦纳格里文字（Devanagari-scripted）语言（如印地语、马拉地语、尼泊尔语、博杰普尔语和梵语）中的两个子任务：仇恨言论检测（Subtask B）和仇恨言论目标识别（Subtask C）。仇恨言论检测旨在识别在线文本中的仇恨言论，而仇恨言论目标识别则要求确定仇恨言论的具体目标，如个人、组织或社区。论文提出的解决方案关键是基于预训练的多语言Transformer模型（ia-multilingual-transliterated-roberta）构建的MultilingualRobertaClass模型，该模型利用上下文嵌入（contextualized embeddings）处理语言多样性，并通过分类器头（classifier head）进行二分类任务。在测试集上，该模型在仇恨言论检测和目标识别任务中分别取得了88.40%和66.11%的准确率。

链接: https://arxiv.org/abs/2412.17947
作者: Siddhant Gupta,Siddh Singhal,Azmine Toushik Wasi
机构: 未知
关键词: specifically Hindi, Devanagari-scripted languages, hate speech detection, identification in Devanagari-scripted, hate speech
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work focuses on two subtasks related to hate speech detection and target identification in Devanagari-scripted languages, specifically Hindi, Marathi, Nepali, Bhojpuri, and Sanskrit. Subtask B involves detecting hate speech in online text, while Subtask C requires identifying the specific targets of hate speech, such as individuals, organizations, or communities. We propose the MultilingualRobertaClass model, a deep neural network built on the pretrained multilingual transformer model ia-multilingual-transliterated-roberta, optimized for classification tasks in multilingual and transliterated contexts. The model leverages contextualized embeddings to handle linguistic diversity, with a classifier head for binary classification. We received 88.40% accuracy in Subtask B and 66.11% accuracy in Subtask C, in the test set.
zh

[NLP-58] BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism

【速读】：该论文旨在解决捷克语（Czech）大语言模型（Large Language Models, LLMs）缺乏全面基准测试的问题。为此，作者提出了BenCzechMark (BCM)，这是首个针对捷克语的综合性基准测试，涵盖了50个具有挑战性的任务，涉及8个类别和多个领域，如历史新闻、学生作文和口语等。BCM的关键在于其多样化的任务设计、多任务格式和多评价指标，评分系统基于统计显著性理论，并借鉴社会偏好理论进行任务聚合。此外，作者收集并清理了BUT-Large Czech Collection，这是目前最大的公开捷克语语料库，用于污染分析和捷克语特定的7B语言模型的持续预训练。该模型作为基线，与现有的多语言模型进行对比，并发布了包含44个模型提交的排行榜，供后续模型提交和评估。

链接: https://arxiv.org/abs/2412.17933
作者: Martin Fajcik,Martin Docekal,Jan Dolezal,Karel Ondrej,Karel Beneš,Jan Kapsa,Pavel Smrz,Alexander Polok,Michal Hradis,Zuzana Neverilova,Ales Horak,Radoslav Sabol,Michal Stefanik,Adam Jirkovsky,David Adamczyk,Petr Hyner,Jan Hula,Hynek Kydlicek
机构: 未知
关键词: multiple evaluation metrics, multiple task formats, comprehensive Czech language, multiple evaluation, present BenCzechMark
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: first version

点击查看摘要

Abstract:We present BenCzechMark (BCM), the first comprehensive Czech language benchmark designed for large language models, offering diverse tasks, multiple task formats, and multiple evaluation metrics. Its scoring system is grounded in statistical significance theory and uses aggregation across tasks inspired by social preference theory. Our benchmark encompasses 50 challenging tasks, with corresponding test datasets, primarily in native Czech, with 11 newly collected ones. These tasks span 8 categories and cover diverse domains, including historical Czech news, essays from pupils or language learners, and spoken word. Furthermore, we collect and clean BUT-Large Czech Collection, the largest publicly available clean Czech language corpus, and use it for (i) contamination analysis, (ii) continuous pretraining of the first Czech-centric 7B language model, with Czech-specific tokenization. We use our model as a baseline for comparison with publicly available multilingual models. Lastly, we release and maintain a leaderboard, with existing 44 model submissions, where new model submissions can be made at this https URL. Comments: first version Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.17933 [cs.CL] (or arXiv:2412.17933v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.17933 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-59] VITRO: Vocabulary Inversion for Time-series Representation Optimization ICASSP2025

【速读】：该论文旨在解决大语言模型（LLMs）在处理时间序列数据时的局限性。尽管LLMs在文本数据处理和生成方面表现出色，但其预训练词汇表难以捕捉时间序列数据中固有的细微时间动态和模式。这是因为自然语言标记的离散、符号化特性与时间序列数据的连续、数值化特性不匹配。为解决这一根本问题，论文提出了VITRO方法。该方案的关键在于借鉴视觉-语言领域中的文本反演优化技术，通过学习每个数据集特定的时间序列词汇表，弥合自然语言的离散语义特性与时间序列数据的连续数值特性之间的差距。通过可学习的时间序列特定伪词嵌入，VITRO方法在大多数数据集上的长期预测任务中实现了最先进的性能。

链接: https://arxiv.org/abs/2412.17921
作者: Filippos Bellos,Nam H. Nguyen,Jason J. Corso
机构: 未知
关键词: demonstrated remarkable capabilities, nuanced temporal dynamics, time series data, time series, series data
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to ICASSP 2025

点击查看摘要

Abstract:Although LLMs have demonstrated remarkable capabilities in processing and generating textual data, their pre-trained vocabularies are ill-suited for capturing the nuanced temporal dynamics and patterns inherent in time series. The discrete, symbolic nature of natural language tokens, which these vocabularies are designed to represent, does not align well with the continuous, numerical nature of time series data. To address this fundamental limitation, we propose VITRO. Our method adapts textual inversion optimization from the vision-language domain in order to learn a new time series per-dataset vocabulary that bridges the gap between the discrete, semantic nature of natural language and the continuous, numerical nature of time series data. We show that learnable time series-specific pseudo-word embeddings represent time series data better than existing general language model vocabularies, with VITRO-enhanced methods achieving state-of-the-art performance in long-term forecasting across most datasets.
zh

[NLP-60] A Multimodal Emotion Recognition System: Integrating Facial Expressions Body Movement Speech and Spoken Language

【速读】：该论文旨在解决传统心理评估中依赖人类观察和解释所导致的主观性、偏见、疲劳和不一致性问题。为解决这些局限性，论文提出了一种多模态情绪识别系统，该系统通过整合面部表情、语音、口语和身体动作分析，捕捉人类评估中常被忽视的细微情绪线索。该系统的关键在于通过多模态数据的融合，提供更为稳健和全面的情绪状态评估，从而降低误诊和过度诊断的风险。初步测试表明，该系统在模拟真实世界条件下能够提供可靠的情绪洞察，有助于提高诊断准确性。这一研究展示了自动化多模态分析作为传统心理评估实践的有力补充，在临床和治疗场景中的潜在应用价值。

链接: https://arxiv.org/abs/2412.17907
作者: Kris Kraack
机构: 未知
关键词: evaluations rely heavily, observation and interpretation, prone to subjectivity, psychological evaluations rely, rely heavily
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 10 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Traditional psychological evaluations rely heavily on human observation and interpretation, which are prone to subjectivity, bias, fatigue, and inconsistency. To address these limitations, this work presents a multimodal emotion recognition system that provides a standardised, objective, and data-driven tool to support evaluators, such as psychologists, psychiatrists, and clinicians. The system integrates recognition of facial expressions, speech, spoken language, and body movement analysis to capture subtle emotional cues that are often overlooked in human evaluations. By combining these modalities, the system provides more robust and comprehensive emotional state assessment, reducing the risk of mis- and overdiagnosis. Preliminary testing in a simulated real-world condition demonstrates the system’s potential to provide reliable emotional insights to improve the diagnostic accuracy. This work highlights the promise of automated multimodal analysis as a valuable complement to traditional psychological evaluation practices, with applications in clinical and therapeutic settings.
zh

[NLP-61] he Power of Adaptation: Boosting In-Context Learning through Adaptive Prompting

【速读】：该论文旨在解决在大语言模型（LLMs）中进行上下文学习（in-context learning）时，如何有效选择示例（exemplars）以提升模型性能的问题。当前研究通常采用基于不确定性或多样性的策略一次性选择示例，但这种非自适应方法可能导致所选示例在知识覆盖上存在高度冗余，从而降低其整体信息量。为解决这一问题，论文提出了\textsc{Adaptive-Prompt}方法，其关键在于通过利用模型对先前所选示例的反馈，自适应地选择示例。实验结果表明，该方法在多种推理任务中显著提升了LLM的性能。

链接: https://arxiv.org/abs/2412.17891
作者: Shuzhang Cai,Twumasi Mensah-Boateng,Xander Kuksov,Jing Yuan,Shaojie Tang
机构: 未知
关键词: Large Language Models, Large Language, demonstrated exceptional abilities, including generating solutions, complex reasoning problems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated exceptional abilities across a broad range of language-related tasks, including generating solutions to complex reasoning problems. An effective technique to enhance LLM performance is in-context learning, which encourages a step-by-step reasoning process by including explanatory examples to guide the model’s responses. However, selecting appropriate exemplars for the model poses a challenge, as each dataset demands a distinct set of exemplars to enable the LLM to learn effectively and perform well on the test set. Current studies often rely on uncertainty- or diversity-based selection strategies to select exemplars for annotation and to improve model learning. However, these studies typically employ a non-adaptive approach, selecting a set of exemplars all at once. We argue that this non-adaptive strategy may result in a set of exemplars with high redundancy in terms of the knowledge covered, ultimately reducing their overall informativeness. To address this limitation, we propose \textscAdaptive-Prompt, a novel method that adaptively selects exemplars by leveraging model feedback from previously chosen exemplars. Experimental results show that \textscAdaptive-Prompt significantly enhances LLM performance across a variety of reasoning tasks.
zh

[NLP-62] Evaluating LLM Reasoning in the Operations Research Domain with ORQA AAAI25

【速读】：该论文旨在解决大型语言模型（LLMs）在运筹学（Operations Research, OR）这一专业技术领域中的泛化能力问题。具体而言，论文通过引入并应用运筹学问答基准（Operations Research Question Answering, ORQA），评估LLMs在面对多样且复杂的优化问题时，是否能够模拟运筹学专家的知识和推理能力。解决方案的关键在于开发了一个由运筹学专家构建的数据集，该数据集包含需要多步推理才能构建数学模型的真实世界优化问题。通过对多个开源LLMs（如LLaMA 3.1、DeepSeek和Mixtral）的评估，论文揭示了这些模型在专业技术领域中的泛化能力存在显著差距，为未来研究提供了有价值的见解。

链接: https://arxiv.org/abs/2412.17874
作者: Mahdi Mostajabdaveh,Timothy T. Yu,Samarendra Chandan Bindu Dash,Rindranirina Ramamonjison,Jabo Serge Byusa,Giuseppe Carenini,Zirui Zhou,Yong Zhang
机构: 未知
关键词: Large Language Models, Research Question Answering, Operations Research Question, apply Operations Research, Question Answering
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures. Accepted and to be published in AAAI25

点击查看摘要

Abstract:In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark designed to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when confronted with diverse and complex optimization problems. The dataset, developed by OR experts, features real-world optimization problems that demand multistep reasoning to construct their mathematical models. Our evaluations of various open source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral, reveal their modest performance, highlighting a gap in their ability to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs generalization capabilities, offering valuable insights for future research in this area. The dataset and evaluation code are publicly available.
zh

[NLP-63] Joint Knowledge Editing for Information Enrichment and Probability Promotion

【速读】：该论文旨在解决大语言模型（Large Language Models）中知识更新的问题，特别是如何有效地编辑模型以反映现实世界信息的动态变化。现有的大多数知识编辑方法主要关注模型的低层，因为研究表明答案信息在低层得到丰富。然而，这些方法仅能揭示原始答案的关键召回阶段，而编辑的目标是修正模型对目标答案的预测，这种不一致性表明现有的探测方法和编辑方法存在缺陷。为解决这一问题，论文提出了一种基于对比的探测方法，识别出模型行为在原始答案和目标答案之间出现分歧的两个关键阶段：低层的信息丰富（Information Enrichment）和高层的概率提升（Probability Promotion）。基于这一发现，论文开发了联合知识编辑方法（Joint knowledge Editing for information Enrichment and probability Promotion, JEEP），该方法同时编辑低层和高层，以修改这两个关键召回阶段。考虑到双重修改可能导致的相互干扰和遗忘问题，JEEP设计确保不同区域的更新目标一致且互补。通过在GPT-J（6B）和LLaMA（7B）等模型上编辑数千条事实，并处理多种编辑目标（如添加事实性和反事实性知识），JEEP在所有测试场景中均表现出最佳性能，验证了探测方法和编辑设计的有效性。

链接: https://arxiv.org/abs/2412.17872
作者: Wenhang Shi,Yiren Chen,Shuqing Bian,Xinyi Zhang,Zhe Zhao,Pengfei Hu,Wei Lu,Xiaoyong Du
机构: 未知
关键词: requires timely updates, language models requires, models requires timely, large language models, editing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Knowledge stored in large language models requires timely updates to reflect the dynamic nature of real-world information. To update the knowledge, most knowledge editing methods focus on the low layers, since recent probes into the knowledge recall process reveal that the answer information is enriched in low layers. However, these probes only and could only reveal critical recall stages for the original answers, while the goal of editing is to rectify model’s prediction for the target answers. This inconsistency indicates that both the probe approaches and the associated editing methods are deficient. To mitigate the inconsistency and identify critical editing regions, we propose a contrast-based probe approach, and locate two crucial stages where the model behavior diverges between the original and target answers: Information Enrichment in low layers and Probability Promotion in high layers. Building upon the insights, we develop the Joint knowledge Editing for information Enrichment and probability Promotion (JEEP) method, which jointly edits both the low and high layers to modify the two critical recall stages. Considering the mutual interference and growing forgetting due to dual modifications, JEEP is designed to ensure that updates to distinct regions share the same objectives and are complementary. We rigorously evaluate JEEP by editing up to thousands of facts on various models, i.e., GPT-J (6B) and LLaMA (7B), and addressing diverse editing objectives, i.e., adding factual and counterfactual knowledge. In all tested scenarios, JEEP achieves best performances, validating the effectiveness of the revealings of our probe approach and the designs of our editing method. Our code and data are available at this https URL.
zh

[NLP-64] Evaluating and Enhancing LLM s for Multi-turn Text-to-SQL with Multiple Question Types

【速读】：该论文旨在解决当前基于大语言模型（LLMs）的文本到SQL（Text-to-SQL）系统在处理真实世界对话查询时的局限性。现有方法通常过于专注于SQL生成，而忽略了对话查询的复杂性，尤其是对于无法直接通过SQL解决的模糊问题，导致系统响应不可靠。为解决这一问题，论文提出了MMSQL，一个全面的测试套件，通过模拟多样化的问答类型和多轮交互的真实场景，评估LLMs在问题分类和SQL生成方面的能力。此外，论文还引入了一个基于LLM的多代理框架，利用专门设计的代理来识别问题类型并确定合适的回答策略。实验表明，该框架显著提升了模型在处理复杂对话动态和多样化用户查询方面的能力。

链接: https://arxiv.org/abs/2412.17867
作者: Ziming Guo,Chao Ma,Yinggang Sun,Tiancheng Zhao,Guangyao Wang,Hai Huang
机构: 未知
关键词: Recent advancements, large language models, advancements in large, large language, SQL generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly advanced text-to-SQL systems. However, most LLM-based methods often narrowly focus on SQL generation, neglecting the complexities of real-world conversational queries. This oversight can lead to unreliable responses, particularly for ambiguous questions that cannot be directly addressed with SQL. To bridge this gap, we propose MMSQL, a comprehensive test suite designed to evaluate the question classification and SQL generation capabilities of LLMs by simulating real-world scenarios with diverse question types and multi-turn Q\A interactions. Using MMSQL, we assessed the performance of popular LLMs, including both open-source and closed-source models, and identified key factors impacting their performance in such scenarios. Moreover, we introduce an LLM-based multi-agent framework that employs specialized agents to identify question types and determine appropriate answering strategies. Our experiments demonstrate that this approach significantly enhances the model’s ability to navigate the complexities of conversational dynamics, effectively handling the diverse and complex nature of user queries.
zh

[NLP-65] Overview of the 2024 ALTA Shared Task: Detect Automatic AI-Generated Sentences for Human-AI Hybrid Articles ALT

【速读】：该论文旨在解决在混合文本环境中检测机器生成文本（machine-generated text）的问题，即文本中可能同时包含人类撰写和机器生成的部分。解决方案的关键在于设计并实施一个共享任务（shared task），通过明确的评估标准（evaluation criteria）来评估参与系统的性能，从而推动相关技术的发展。该任务自2010年起每年举办，2024年的任务特别关注混合文本的检测，旨在提升系统在复杂场景下的识别能力。

链接: https://arxiv.org/abs/2412.17848
作者: Diego Mollá,Qiongkai Xu,Zijie Zeng,Zhuang Li
机构: 未知
关键词: ALTA shared tasks, ALTA shared, running annually, detect machine-generated text, ALTA
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 tables, published in ALTA 2024

点击查看摘要

Abstract:The ALTA shared tasks have been running annually since 2010. In 2024, the purpose of the task is to detect machine-generated text in a hybrid setting where the text may contain portions of human text and portions machine-generated. In this paper, we present the task, the evaluation criteria, and the results of the systems participating in the shared task.
zh

[NLP-66] Bridging the Data Provenance Gap Across Text Speech and Video

【速读】：该论文旨在解决当前人工智能（AI）训练数据在规模和质量上的不足，特别是对多模态（multimodal）数据集（如文本、语音和视频）的属性和来源缺乏系统性分析的问题。通过对1990年至2024年间近4000个公开数据集进行纵向审计，研究团队详细考察了这些数据集的地理、语言、来源和使用限制等关键属性。研究发现，多模态机器学习应用主要依赖于网络爬取、合成和社交媒体平台（如YouTube）的数据，且自2019年以来这些来源已超越其他所有来源。此外，尽管不到33%的数据集受到限制性许可，但超过80%的广泛使用的文本、语音和视频数据集的内容带有非商业性限制。研究还指出，尽管公开AI训练数据集中代表的语言和地理区域数量有所增加，但自2013年以来，相对地理和多语言表示的覆盖范围并未显著改善。论文的关键解决方案是通过发布全面的多模态审计结果，帮助从业者追踪数据的来源，促进数据集透明度和负责任使用的持续改进。

链接: https://arxiv.org/abs/2412.17847
作者: Shayne Longpre,Nikhil Singh,Manuel Cherep,Kushagra Tiwary,Joanna Materzynska,William Brannon,Robert Mahari,Manan Dey,Mohammed Hamdy,Nayan Saxena,Ahmad Mustafa Anis,Emad A. Alghamdi,Vu Minh Chien,Naana Obeng-Marnu,Da Yin,Kun Qian,Yizhi Li,Minnie Liang,An Dinh,Shrestha Mohanty,Deividas Mataciunas,Tobin South,Jianguo Zhang,Ariel N. Lee,Campbell S. Lund,Christopher Klamm,Damien Sileo,Diganta Misra,Enrico Shippole,Kevin Klyman,Lester JV Miranda,Niklas Muennighoff,Seonghyeon Ye,Seungone Kim,Vipul Gupta,Vivek Sharma,Xuhui Zhou,Caiming Xiong,Luis Villa,Stella Biderman,Alex Pentland,Sara Hooker,Jad Kabbara
机构: 未知
关键词: driven largely, scale and quality, datasets, text, video datasets
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 10 pages, 5 figures (main paper)

点击查看摘要

Abstract:Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities–popular text, speech, and video datasets–from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.
zh

[NLP-67] Enhancing Knowledge Distillation for LLM s with Response-Priming Prompting

【速读】：该论文旨在解决大语言模型（LLMs）在资源受限环境下部署困难的问题，特别是由于计算资源需求高导致的挑战。为此，论文提出了一种基于知识蒸馏（Knowledge Distillation, KD）的解决方案，通过引入一系列新颖的响应引导提示策略（response-priming prompting strategies）来增强学生模型的性能。具体而言，论文采用量化后的Llama 3.1 405B Instruct模型作为教师模型，通过LoRA优化技术对较小的Llama 3.1 8B Instruct模型进行微调，并在GSM8K基准上进行评估。实验结果表明，将推理引导提示（reasoning-eliciting prompting）集成到知识蒸馏流程中，显著提升了学生模型的性能，特别是在使用Ground Truth提示时，蒸馏后的Llama 3.1 8B Instruct模型在GSM8K上的性能提升了55%。此外，对学生模型自注意力层（self-attention layers）的深入分析表明，成功的提示模型在其注意力头（attention heads）中表现出某些积极行为，这些行为与其准确性的提升密切相关。

链接: https://arxiv.org/abs/2412.17846
作者: Vijay Goyal,Mustafa Khan,Aprameya Tirupati,Harveer Saini,Michael Lam,Kevin Zhu
机构: 未知
关键词: Large language models, natural language processing, Large language, demonstrated remarkable performance, language processing
类目: Computation and Language (cs.CL)
备注: Accepted to SoCal NLP Symposium 2024

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing (NLP) tasks. However, these models are often difficult to deploy due to significant computational requirements and resource constraints. Knowledge distillation (KD) is an effective technique for transferring the performance of larger LLMs to smaller models. Traditional KD methods primarily focus on the direct output of the teacher model, with little emphasis on the role of prompting during knowledge transfer. In this paper, we propose a set of novel response-priming prompting strategies applied in the knowledge distillation pipeline to enhance the performance of student models. Our approach fine-tunes a smaller Llama 3.1 8B Instruct model by distilling knowledge from a quantized Llama 3.1 405B Instruct teacher model. We apply LoRA optimization and evaluate on the GSM8K benchmark. Experimental results demonstrate that integrating reasoning-eliciting prompting into the proposed KD pipeline significantly improves student model performance, offering an efficient way to deploy powerful models in resource-constrained environments. We find that Ground Truth prompting results in a 55% performance increase on GSM8K for a distilled Llama 3.1 8B Instruct compared to the same model distilled without prompting. A thorough investigation into the self-attention layers of the student models indicates that the more successful prompted models tend to exhibit certain positive behaviors inside their attention heads which can be tied to their increased accuracy. Our implementation can be found at this https URL.
zh

[NLP-68] Evaluating the Capabilities of Large Language Models for Multi-label Emotion Understanding COLING2025

【速读】：该论文旨在解决大语言模型（LLMs）在多语言和多标签情感分类任务中的不足，特别是针对低资源语言的情感分类问题。论文提出了EthioEmo数据集，涵盖了四种埃塞俄比亚语言（阿姆哈拉语、阿法尔奥罗莫语、索马里语和提格里尼亚语），并利用SemEval 2018 Task 1中的英语多标签情感数据集进行对比实验。解决方案的关键在于通过零样本（zero-shot）和少样本（few-shot）方法评估LLMs的表现，并与微调较小语言模型的结果进行比较。实验结果表明，即使是高资源语言如英语，多标签情感分类的准确性仍然不足，且高资源语言与低资源语言之间存在显著性能差距。EthioEmo数据集的公开旨在进一步推动语言模型对情感表达的理解，特别是在多语言环境下的情感分类研究。

链接: https://arxiv.org/abs/2412.17837
作者: Tadesse Destaw Belay,Israel Abebe Azime,Abinew Ali Ayele,Grigori Sidorov,Dietrich Klakow,Philipp Slusallek,Olga Kolesnikova,Seid Muhie Yimam
机构: 未知
关键词: show promising learning, reasoning abilities, multi-label emotion, promising learning, learning and reasoning
类目: Computation and Language (cs.CL)
备注: COLING 2025, main conference, long

点击查看摘要

Abstract:Large Language Models (LLMs) show promising learning and reasoning abilities. Compared to other NLP tasks, multilingual and multi-label emotion evaluation tasks are under-explored in LLMs. In this paper, we present EthioEmo, a multi-label emotion classification dataset for four Ethiopian languages, namely, Amharic (amh), Afan Oromo (orm), Somali (som), and Tigrinya (tir). We perform extensive experiments with an additional English multi-label emotion dataset from SemEval 2018 Task 1. Our evaluation includes encoder-only, encoder-decoder, and decoder-only language models. We compare zero and few-shot approaches of LLMs to fine-tuning smaller language models. The results show that accurate multi-label emotion classification is still insufficient even for high-resource languages such as English, and there is a large gap between the performance of high-resource and low-resource languages. The results also show varying performance levels depending on the language and model type. EthioEmo is available publicly to further improve the understanding of emotions in language models and how people convey emotions through various languages.
zh

[NLP-69] Look Ahead Text Understanding and LLM Stitching AAAI

【速读】：该论文旨在解决前瞻性文本理解（look ahead text understanding）问题，以前瞻性段落识别（look ahead section identification, LASI）为例。这一问题在生成式 AI 和人类交互中尤为重要，因为它涉及理解正在发展的文本或对话的方向。论文通过基于 Transformer 的大语言模型（LLMs）来解决这一问题，并指出 LASI 比传统的段落识别（section identification, SI）更具挑战性。关键解决方案在于结合双向上下文信息（如 BERT）和单向预测能力（如 GPT），提出了两种将 BERT 和 GPT 结合的方法。实验表明，该方法在文本存在噪声的情况下（这在生成式 AI 中常见）表现优于现有模型。此外，论文还探讨了其他前瞻性文本理解任务（如前瞻性情感分类）的重要性，并指出了通过结合预训练 LLMs 来提升这些任务的潜力。

链接: https://arxiv.org/abs/2412.17836
作者: Junlin Julian Jiang(Piedmont High School, Piedmont, CA, USA),Xin Li(College of Business, City University of Hong Kong, Hong Kong, China)
机构: 未知
关键词: section identification, ahead section identification, text, ahead text understanding, ahead
类目: Computation and Language (cs.CL)
备注: 9 pages, 6 figures, 4 tables, published in Vol. 18 (2024): Proceedings of the Eighteenth International AAAI Conference on Web and Social Media

点击查看摘要

Abstract:This paper proposes a look ahead text understanding problem with look ahead section identification (LASI) as an example. This problem may appear in generative AI as well as human interactions, where we want to understand the direction of a developing text or conversation. We tackle the problem using transformer-based LLMs. We show that LASI is more challenging than classic section identification (SI). We argue that both bidirectional contextual information (e.g., BERT) and unidirectional predictive ability (e.g., GPT) will benefit the task. We propose two approaches to stitch together BERT and GPT. Experiments show that our approach outperforms the established models, especially when there is noise in the text (which is often the case for developing text in generative AI). Our paper sheds light on other look ahead text understanding tasks that are important to social media, such as look ahead sentiment classification, and points out the opportunities to leverage pre-trained LLMs through stitching.
zh

[NLP-70] Leveraging Sentiment for Offensive Text Classification

【速读】：该论文旨在探讨情感分析（sentiment analysis）是否能够提升模型对攻击性文本（offensive texts）的分类性能。研究基于SemEval 2019任务6的OLID数据集进行实验，首先利用预训练语言模型（pre-trained language models）预测每个实例的情感，随后选择在OLID测试集上表现最佳的模型，并在增强的OLID数据集上进行训练以分析其性能。实验结果表明，引入情感分析显著提升了模型的整体分类性能。解决方案的关键在于通过情感分析增强模型对攻击性文本的理解和分类能力。

链接: https://arxiv.org/abs/2412.17825
作者: Khondoker Ittehadul Islam
机构: 未知
关键词: classify offensive texts, classify offensive, offensive texts, conduct experiment, OLID
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we conduct experiment to analyze whether models can classify offensive texts better with the help of sentiment. We conduct this experiment on the SemEval 2019 task 6, OLID, dataset. First, we utilize pre-trained language models to predict the sentiment of each instance. Later we pick the model that achieved the best performance on the OLID test set, and train it on the augmented OLID set to analyze the performance. Results show that utilizing sentiment increases the overall performance of the model.
zh

[NLP-71] he Rosetta Paradox: Domain-Specific Performance Inversions in Large Language Models

【速读】：该论文旨在解决大语言模型（LLMs）在不同知识领域中表现不一致的现象，即所谓的“罗塞塔悖论”（Rosetta Paradox）。这一悖论描述了LLMs在高度专业化的领域中表现出色，但在需要通用日常知识的任务中表现不佳的反直觉现象。论文通过形式化罗塞塔悖论的定义，并引入一个全景分析框架，包括领域特异性指数（Domain Specificity Index, DSI）和性能反转度量（Performance Inversion Metric, PIM），以一致地量化LLMs在特定领域中的行为。通过跨多种模型和知识领域的广泛实验，研究发现罗塞塔悖论并非仅仅是数据分布的产物，而是深度神经网络固有的架构和涌现特性。论文还比较了不同模型架构、规模和训练方法，揭示了这一悖论的特殊表现方式，并对标准评估指标提出了挑战。

链接: https://arxiv.org/abs/2412.17821
作者: Basab Jha,Ujjwal Puri
机构: 未知
关键词: GPT and BERT, demonstrated unprecedented skills, Rosetta Paradox, Rosetta Paradox characterizes, natural language processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:While large language models, such as GPT and BERT, have already demonstrated unprecedented skills in everything from natural language processing to domain-specific applications, there came an unexplored phenomenon we term the Rosetta Paradox. The Rosetta Paradox characterizes the counterintuitive performance inversions across domains of knowledge. This paradox captures how such LLMs can excel in highly specialized fields but do poorly on tasks which require general, everyday knowledge. This paper formalizes the definition of the Rosetta Paradox and introduces a panoramic analysis framework that includes both a Domain Specificity Index (DSI) and a Performance Inversion Metric (PIM) for consistent quantification of domain-specific behavior in LLMs. We adopt this paradox and conduct a series of investigations through extensive experiments across diverse models and knowledge domains, ranging from rich technical areas to common-sense reasoning. Our findings indicate that the Rosetta Paradox is likely not a mere artifact of data distribution but an intrinsic architectural and emergent property of deep neural networks. We present comparative analyses across different model architectures, sizes, and training methodologies that shed light into the peculiar ways this paradox manifests itself and challenge the standard evaluation metrics. Comments: 15 pages, 7 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.17821 [cs.CL] (or arXiv:2412.17821v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.17821 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-72] Inductive Linguistic Reasoning with Large Language Models

【速读】：该论文旨在评估大语言模型（LLMs）在语言推理能力上的表现，特别是在极低资源语言上的抽象多语言推理能力。研究通过语言谜题（linguistic puzzles）来探讨模型是否能够通过类比提示（analogical prompting）从种子示例中自动生成多样化的辅助演示，从而进行归纳和演绎推理。解决方案的关键在于采用两阶段方法：首先利用语言模型生成类比示例，然后将这些示例与目标语言示例结合进行上下文应用。实验结果表明，类比提示能够有效激发模型对语言语法相似性的理解，显著提升了GPT-4o和Llama-3.1-405B-Instruct在推理任务中的表现，且该方法在语言学奥林匹克竞赛（Linguistics Olympiad）的其他任务中也表现出良好的泛化能力。

链接: https://arxiv.org/abs/2412.17819
作者: Raghav Ramji,Keshav Ramji
机构: 未知
关键词: Evaluating large language, Evaluating large, large-scale adoption, understand the gaps, surface during large-scale
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) on their linguistic reasoning capabilities is an important task to understand the gaps in their skills that may surface during large-scale adoption. In this work, we investigate the abilities of such models to perform abstract multilingual reasoning through the lens of linguistic puzzles on extremely low-resource languages. As these translation tasks involve inductive and deductive reasoning from reference instances, we examine whether diverse auxiliary demonstrations can be automatically induced from seed exemplars, through analogical prompting. We employ a two-stage procedure, first generating analogical exemplars with a language model, and then applying them in-context along with provided target language exemplars. Our results on the modeLing dataset show that analogical prompting is effective in eliciting models’ knowledge of language grammar similarities, boosting the performance of GPT-4o by as much as 8.1% and Llama-3.1-405B-Instruct by 5.9% over chain-of-thought approaches. These gains are attributable to the analogical demonstrations, both when self-generated as well as when produced by weaker multilingual models. Furthermore, we demonstrate that our method generalizes to other tasks present in Linguistics Olympiad competitions, achieving sizable improvements across all problem types and difficulty levels included in the LINGOLY dataset with GPT-4o. We also report several findings about interesting phenomena which drive linguistic reasoning performance, suggesting that such puzzles are a valuable benchmark for new reasoning methods.
zh

[NLP-73] Large Language Model Safety: A Holistic Survey

【速读】：该论文旨在探讨大语言模型（LLMs）在快速发展和部署过程中所引发的安全问题，特别是其在关键应用中的潜在风险及相应的缓解策略。论文通过全面综述当前LLM安全领域的现状，聚焦于四大主要风险类别：价值错位（value misalignment）、对抗攻击的鲁棒性（robustness to adversarial attacks）、滥用（misuse）以及自主AI风险（autonomous AI risks）。此外，论文还深入探讨了与LLM安全相关的四个主题：LLM代理的安全影响、可解释性在提升LLM安全中的作用、AI公司和研究机构提出的LLM安全技术路线图，以及针对LLM安全的AI治理，包括国际合作、政策建议和前瞻性监管方向的讨论。解决方案的关键在于采取一种积极主动、多层面的方法，强调技术解决方案、伦理考量和稳健治理框架的整合，以确保LLM的安全和有益发展，最终实现AI技术对社会进步和福祉的促进作用。

链接: https://arxiv.org/abs/2412.17686
作者: Dan Shi,Tianhao Shen,Yufei Huang,Zhigen Li,Yongqi Leng,Renren Jin,Chuang Liu,Xinwei Wu,Zishan Guo,Linhao Yu,Ling Shi,Bojian Jiang,Deyi Xiong
机构: 未知
关键词: LLM safety, natural language understanding, large language models, LLM, marked by unprecedented
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 158 pages, 18 figures

点击查看摘要

Abstract:The rapid development and deployment of large language models (LLMs) have introduced a new frontier in artificial intelligence, marked by unprecedented capabilities in natural language understanding and generation. However, the increasing integration of these models into critical applications raises substantial safety concerns, necessitating a thorough examination of their potential risks and associated mitigation strategies. This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks. In addition to the comprehensive review of the mitigation methodologies and evaluation resources on these four aspects, we further explore four topics related to LLM safety: the safety implications of LLM agents, the role of interpretability in enhancing LLM safety, the technology roadmaps proposed and abided by a list of AI companies and institutes for LLM safety, and AI governance aimed at LLM safety with discussions on international cooperation, policy proposals, and prospective regulatory directions. Our findings underscore the necessity for a proactive, multifaceted approach to LLM safety, emphasizing the integration of technical solutions, ethical considerations, and robust governance frameworks. This survey is intended to serve as a foundational resource for academy researchers, industry practitioners, and policymakers, offering insights into the challenges and opportunities associated with the safe integration of LLMs into society. Ultimately, it seeks to contribute to the safe and beneficial development of LLMs, aligning with the overarching goal of harnessing AI for societal advancement and well-being. A curated list of related papers has been publicly available at this https URL. Comments: 158 pages, 18 figures Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2412.17686 [cs.AI] (or arXiv:2412.17686v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.17686 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-74] scReader: Prompting Large Language Models to Interpret scRNA-seq Data ICDM2024

【速读】：该论文旨在解决跨物种单细胞组学数据（single-cell omics data）解读中的挑战，特别是由于不同物种数据规模差异导致的综合模型开发困难。论文提出了一种创新的混合方法，将大语言模型（LLMs）的通用知识能力与单细胞组学数据的领域特定表示模型相结合。解决方案的关键在于以基因作为基本表示单元，利用成熟的语言模型（如LLaMA-2）的功能描述初始化基因表示，并通过输入单细胞基因表达数据和提示（prompts）来建模细胞表示。该方法在人类和小鼠的发育细胞中进行了实验，重点针对难以注释的细胞，并通过细胞注释和可视化分析等基础任务验证了其有效性。结果表明，该混合方法在准确性和互操作性方面显著优于其他使用LLMs的方法，为跨物种遗传分析提供了强有力的框架。

链接: https://arxiv.org/abs/2412.18156
作者: Cong Li,Qingqing Long,Yuanchun Zhou,Meng Xiao
机构: 未知
关键词: demonstrated remarkable advancements, Large language models, Large language, remarkable advancements, primarily due
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, Accepted by ICDM 2024

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable advancements, primarily due to their capabilities in modeling the hidden relationships within text sequences. This innovation presents a unique opportunity in the field of life sciences, where vast collections of single-cell omics data from multiple species provide a foundation for training foundational models. However, the challenge lies in the disparity of data scales across different species, hindering the development of a comprehensive model for interpreting genetic data across diverse organisms. In this study, we propose an innovative hybrid approach that integrates the general knowledge capabilities of LLMs with domain-specific representation models for single-cell omics data interpretation. We begin by focusing on genes as the fundamental unit of representation. Gene representations are initialized using functional descriptions, leveraging the strengths of mature language models such as LLaMA-2. By inputting single-cell gene-level expression data with prompts, we effectively model cellular representations based on the differential expression levels of genes across various species and cell types. In the experiments, we constructed developmental cells from humans and mice, specifically targeting cells that are challenging to annotate. We evaluated our methodology through basic tasks such as cell annotation and visualization analysis. The results demonstrate the efficacy of our approach compared to other methods using LLMs, highlighting significant improvements in accuracy and interoperability. Our hybrid approach enhances the representation of single-cell data and offers a robust framework for future research in cross-species genetic analysis.
zh

[NLP-75] GeneSUM: Large Language Model-based Gene Summary Extraction

【速读】：该论文旨在解决生物医学研究中基因相关文献信息提取的低效问题，具体挑战包括文献数量庞大、基因功能复杂以及自动化整合与生成的困难。为解决这些问题，论文提出了GeneSUM，一个基于大语言模型（LLM）的两阶段自动化基因摘要提取框架。该框架首先检索并去除目标基因文献中的冗余信息，然后通过微调LLM来优化和简化摘要生成过程。实验结果表明，LLM显著提升了基因特定信息的整合效率，从而支持更高效的科研决策。

链接: https://arxiv.org/abs/2412.18154
作者: Zhijian Chen,Chuan Hu,Min Wu,Qingqing Long,Xuezhi Wang,Yuanchun Zhou,Meng Xiao
机构: 未知
关键词: Emerging topics, continuously expanding, providing a wealth, topics in biomedical, Emerging
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, Accepted by BIBM 2024

点击查看摘要

Abstract:Emerging topics in biomedical research are continuously expanding, providing a wealth of information about genes and their function. This rapid proliferation of knowledge presents unprecedented opportunities for scientific discovery and formidable challenges for researchers striving to keep abreast of the latest advancements. One significant challenge is navigating the vast corpus of literature to extract vital gene-related information, a time-consuming and cumbersome task. To enhance the efficiency of this process, it is crucial to address several key challenges: (1) the overwhelming volume of literature, (2) the complexity of gene functions, and (3) the automated integration and generation. In response, we propose GeneSUM, a two-stage automated gene summary extractor utilizing a large language model (LLM). Our approach retrieves and eliminates redundancy of target gene literature and then fine-tunes the LLM to refine and streamline the summarization process. We conducted extensive experiments to validate the efficacy of our proposed framework. The results demonstrate that LLM significantly enhances the integration of gene-specific information, allowing more efficient decision-making in ongoing research.
zh

[NLP-76] Ensemble Machine Learning Model for Inner Speech Recognition: A Subject-Specific Investigation

【速读】：该论文旨在解决内语言（inner speech）识别的挑战，特别是在康复、辅助技术开发和认知评估等领域的应用。内语言识别由于语言和语音生成的复杂性，尤其是识别语音成分的难度，一直是一个具有挑战性的任务。论文的关键解决方案包括使用公开的“Thinking Out Loud Dataset”数据集，基于128通道表面脑电图（EEG）信号，开发了一种机器学习（ML）技术来分类内语言。具体步骤包括：通过统计方法检测并去除EEG信号中的运动伪影，提取大量时域、频域和时频域特征，探索八种特征选择算法以确定最佳特征选择技术，并评估六种ML算法的性能。此外，论文还提出了一个集成模型，通过堆叠五个最佳逻辑回归模型，在四类内语言词的分类中实现了81.13%的总体准确率和81.12%的F1分数。该框架展示了使用表面EEG信号进行内语言分类的潜力。

链接: https://arxiv.org/abs/2412.17824
作者: Shahamat Mustavi Tasin,Muhammad E. H. Chowdhury,Shona Pedersen,Malek Chabbouh,Diala Bushnaq,Raghad Aljindi,Saidul Kabir,Anwarul Hasan
机构: 未知
关键词: developing assistive technology, gained enormous interest, recent years due, surface EEG signals, EEG signals
类目: ignal Processing (eess.SP); Computation and Language (cs.CL)
备注: 13 Figures, 3 Tables

点击查看摘要

Abstract:Inner speech recognition has gained enormous interest in recent years due to its applications in rehabilitation, developing assistive technology, and cognitive assessment. However, since language and speech productions are a complex process, for which identifying speech components has remained a challenging task. Different approaches were taken previously to reach this goal, but new approaches remain to be explored. Also, a subject-oriented analysis is necessary to understand the underlying brain dynamics during inner speech production, which can bring novel methods to neurological research. A publicly available dataset, Thinking Out Loud Dataset, has been used to develop a Machine Learning (ML)-based technique to classify inner speech using 128-channel surface EEG signals. The dataset is collected on a Spanish cohort of ten subjects while uttering four words (Arriba, Abajo, Derecha, and Izquierda) by each participant. Statistical methods were employed to detect and remove motion artifacts from the Electroencephalography (EEG) signals. A large number (191 per channel) of time-, frequency- and time-frequency-domain features were extracted. Eight feature selection algorithms are explored, and the best feature selection technique is selected for subsequent evaluations. The performance of six ML algorithms is evaluated, and an ensemble model is proposed. Deep Learning (DL) models are also explored, and the results are compared with the classical ML approach. The proposed ensemble model, by stacking the five best logistic regression models, generated an overall accuracy of 81.13% and an F1 score of 81.12% in the classification of four inner speech words using surface EEG signals. The proposed framework with the proposed ensemble of classical ML models shows promise in the classification of inner speech using surface EEG signals.
zh

计算机视觉

[CV-0] Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

【速读】：该论文旨在解决视频-语言理解任务中传统方法依赖重型图像编码器（300M-1.1B参数）或视频编码器（1B-1.4B参数）导致的计算负担过重的问题。为此，论文提出了一种高效的无需编码器的方法，通过引入一种新颖的时空对齐块（Spatio-Temporal Alignment Block, STAB）来直接处理视频输入，从而显著减少计算开销。STAB架构结合了局部时空编码（Local Spatio-Temporal Encoding）以进行细粒度特征提取，通过学习的注意力机制实现高效的空间下采样，并分别建模帧级和视频级关系。该方法仅使用45M参数进行视觉处理，相比传统方法至少减少了6.5倍的计算量，同时在开放式视频问答任务中达到了与基于编码器的方法相当或更优的性能，特别是在正确性和时间理解等关键方面表现突出。此外，该方法在处理速度上比之前的方法快3-4倍。

链接: https://arxiv.org/abs/2412.18609
作者: Jinhui Yi,Syed Talal Wasim,Yanan Luo,Muzammal Naseer,Juergen Gall
机构: 未知
关键词: reducing computational overhead, significantly reducing computational, significantly reducing, computational overhead, Spatio-Temporal Alignment Block
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5 \times reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks. The fine-grained video question-answering evaluation demonstrates our model’s effectiveness, outperforming the encoder-based approaches Video-ChatGPT and Video-LLaVA in key aspects like correctness and temporal understanding. Extensive ablation studies validate our architectural choices and demonstrate the effectiveness of our spatio-temporal modeling approach while achieving 3-4 \times faster processing speeds than previous methods. Code is available at \urlthis https URL.
zh

[CV-1] PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models

【速读】：该论文旨在解决当前文本或图像到3D生成器以及3D扫描仪生成的3D资产通常为单一、融合的表示（如隐式神经场、高斯混合或网格），缺乏有意义的结构，无法满足大多数应用和创意工作流中对可独立操作的多部分组成的3D资产的需求。为解决这一问题，论文提出了PartGen方法，其关键步骤包括：首先，通过多视角扩散模型从生成的或渲染的3D对象的多个视角中提取一组合理且视角一致的部分分割，将对象划分为多个部分；其次，使用另一个多视角扩散模型对每个部分单独进行遮挡填充，并将这些完整的视角输入到3D重建网络中进行3D重建。该生成式补全模型能够弥补因遮挡而缺失的信息，甚至在极端情况下可以根据输入的3D资产“幻觉”出完全不可见的部分。该方法在生成和真实的3D资产上进行了评估，显著优于现有的分割和部分提取基线方法，并展示了如3D部分编辑等下游应用。

链接: https://arxiv.org/abs/2412.18608
作者: Minghao Chen,Roman Shapovalov,Iro Laina,Tom Monnier,Jianyuan Wang,David Novotny,Andrea Vedaldi
机构: 未知
关键词: shapes and textures, high-quality shapes, Gaussian mixture, assets, parts
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Text- or image-to-3D generators and 3D scanners can now produce 3D assets with high-quality shapes and textures. These assets typically consist of a single, fused representation, like an implicit neural field, a Gaussian mixture, or a mesh, without any useful structure. However, most applications and creative workflows require assets to be made of several meaningful parts that can be manipulated independently. To address this gap, we introduce PartGen, a novel approach that generates 3D objects composed of meaningful parts starting from text, an image, or an unstructured 3D object. First, given multiple views of a 3D object, generated or rendered, a multi-view diffusion model extracts a set of plausible and view-consistent part segmentations, dividing the object into parts. Then, a second multi-view diffusion model takes each part separately, fills in the occlusions, and uses those completed views for 3D reconstruction by feeding them to a 3D reconstruction network. This completion process considers the context of the entire object to ensure that the parts integrate cohesively. The generative completion model can make up for the information missing due to occlusions; in extreme cases, it can hallucinate entirely invisible parts based on the input 3D asset. We evaluate our method on generated and real 3D assets and show that it outperforms segmentation and part-extraction baselines by a large margin. We also showcase downstream applications such as 3D part editing.
zh

[CV-2] DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

【速读】：该论文旨在解决当前基于视频扩散模型（video diffusion models）的驾驶世界模型在整合多模态数据（如动作）方面的局限性。现有的驾驶世界模型主要专注于视觉生成，但缺乏将其他模态（如动作）灵活整合的能力。为此，论文提出了一种基于自回归变换器（autoregressive transformers）的统一框架，将驾驶模型仿真和轨迹规划整合为一个序列建模问题。解决方案的关键在于引入了一种多模态驾驶语言，该语言基于交错的图像和动作标记（interleaved image and action tokens），并通过开发 DrivingGPT 模型，利用标准的下一标记预测（next-token prediction）来联合学习世界建模和规划。该方法在动作条件视频生成和端到端规划任务中表现出色，并在大规模 nuPlan 和 NAVSIM 基准测试中超越了现有基线模型。

链接: https://arxiv.org/abs/2412.18607
作者: Yuntao Chen,Yuqi Wang,Zhaoxiang Zhang
机构: 未知
关键词: human-level physical intelligence, World model-based searching, physical intelligence, model-based searching, widely recognized
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
zh

[CV-3] Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

【速读】：该论文旨在解决从单张图像中准确估计物体方向（orientation）的问题，这是理解物体空间姿态和排列的关键属性。现有的解决方案在这一领域尚未得到充分探索。为此，作者提出了“Orient Anything”模型，这是首个专门设计用于从单视角和自由视角图像中估计物体方向的基础模型。由于标注数据的稀缺性，作者提出从3D世界中提取知识，通过开发一个流程来标注3D物体的正面并从随机视角渲染图像，收集了200万张带有精确方向标注的图像。为了充分利用这一数据集，作者设计了一个鲁棒的训练目标，将3D方向建模为三个角度的概率分布，并通过拟合这些分布来预测物体方向。此外，作者还采用了多种策略来提升从合成数据到真实数据的迁移能力。该模型在渲染图像和真实图像中均达到了最先进的方向估计精度，并在多种场景中展示了出色的零样本能力。更重要的是，该模型在复杂空间概念的理解与生成以及3D物体姿态调整等应用中表现出显著的提升。

链接: https://arxiv.org/abs/2412.18605
作者: Zehan Wang,Ziang Zhang,Tianyu Pang,Chao Du,Hengshuang Zhao,Zhou Zhao
机构: 未知
关键词: crucial for understanding, key attribute, Orientation, images, object orientation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Orientation is a key attribute of objects, crucial for understanding their spatial pose and arrangement in images. However, practical solutions for accurate orientation estimation from a single image remain underexplored. In this work, we introduce Orient Anything, the first expert and foundational model designed to estimate object orientation in a single- and free-view image. Due to the scarcity of labeled data, we propose extracting knowledge from the 3D world. By developing a pipeline to annotate the front face of 3D objects and render images from random views, we collect 2M images with precise orientation annotations. To fully leverage the dataset, we design a robust training objective that models the 3D orientation as probability distributions of three angles and predicts the object orientation by fitting these distributions. Besides, we employ several strategies to improve synthetic-to-real transfer. Our model achieves state-of-the-art orientation estimation accuracy in both rendered and real images and exhibits impressive zero-shot ability in various scenarios. More importantly, our model enhances many applications, such as comprehension and generation of complex spatial concepts and 3D object pose adjustment.
zh

[CV-4] Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models

【速读】：该论文旨在解决理解分类器（classifiers）决策过程的挑战，特别是在计算机视觉任务中，分类器作为多种模型的基础组件，其决策机制往往难以解释。传统基于生成对抗网络（GAN-based）的解释模型通常局限于单一概念的分析，并且需要为每个分类器训练新模型，限制了其应用范围和解释能力。论文提出的解决方案DiffEx，利用文本到图像扩散模型（text-to-image diffusion models）的能力来解释分类器的决策。DiffEx的关键在于通过视觉语言模型（vision-language models）生成语义的层次化列表，使用户不仅能识别分类器决策中的高层语义影响（如面部分类器中的“胡须”语义），还能进一步细分子类型（如“山羊胡”或“巴尔博胡”）。相比GAN模型，DiffEx能够覆盖更广泛的语义范围，提供更详细和细粒度的分类器决策解释。

链接: https://arxiv.org/abs/2412.18604
作者: Tahira Kazimi,Ritika Allada,Pinar Yanardag
机构: 未知
关键词: computer vision tasks, vision tasks, diverse applications, important components, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classifiers are important components in many computer vision tasks, serving as the foundational backbone of a wide variety of models employed across diverse applications. However, understanding the decision-making process of classifiers remains a significant challenge. We propose DiffEx, a novel method that leverages the capabilities of text-to-image diffusion models to explain classifier decisions. Unlike traditional GAN-based explainability models, which are limited to simple, single-concept analyses and typically require training a new model for each classifier, our approach can explain classifiers that focus on single concepts (such as faces or animals) as well as those that handle complex scenes involving multiple concepts. DiffEx employs vision-language models to create a hierarchical list of semantics, allowing users to identify not only the overarching semantic influences on classifiers (e.g., the ‘beard’ semantic in a facial classifier) but also their sub-types, such as ‘goatee’ or ‘Balbo’ beard. Our experiments demonstrate that DiffEx is able to cover a significantly broader spectrum of semantics compared to its GAN counterparts, providing a hierarchical tool that delivers a more detailed and fine-grained understanding of classifier decisions.
zh

[CV-5] ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation

【速读】：该论文旨在解决人-场景交互（Human-Scene Interaction, HSI）生成中依赖昂贵且耗时的配对3D场景与运动捕捉数据的问题。现有方法虽然能够合成逼真的3D场景中的人体运动并生成合理的人-物交互，但其依赖于跨多样环境和交互的配对数据集，限制了其广泛应用。论文提出的解决方案ZeroHSI，通过整合视频生成和神经人体渲染，实现了零样本（zero-shot）的4D人-场景交互合成。其关键在于利用先进的视频生成模型所学习到的丰富运动先验（motion priors），这些模型已在大量自然人体运动和交互数据上进行了训练，并结合可微分渲染技术重建人-场景交互。ZeroHSI能够在静态场景和动态物体环境中合成逼真的人体运动，且无需任何真实运动数据。

链接: https://arxiv.org/abs/2412.18600
作者: Hongjie Li,Hong-Xing Yu,Jiaman Li,Jiajun Wu
机构: 未知
关键词: virtual reality, crucial for applications, applications in embodied, synthesize realistic human, realistic human motions
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project website: this https URL

点击查看摘要

Abstract:Human-scene interaction (HSI) generation is crucial for applications in embodied AI, virtual reality, and robotics. While existing methods can synthesize realistic human motions in 3D scenes and generate plausible human-object interactions, they heavily rely on datasets containing paired 3D scene and motion capture data, which are expensive and time-consuming to collect across diverse environments and interactions. We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis by integrating video generation and neural human rendering. Our key insight is to leverage the rich motion priors learned by state-of-the-art video generation models, which have been trained on vast amounts of natural human movements and interactions, and use differentiable rendering to reconstruct human-scene interactions. ZeroHSI can synthesize realistic human motions in both static scenes and environments with dynamic objects, without requiring any ground-truth motion data. We evaluate ZeroHSI on a curated dataset of different types of various indoor and outdoor scenes with different interaction prompts, demonstrating its ability to generate diverse and contextually appropriate human-scene interactions.
zh

[CV-6] DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

【速读】：该论文旨在解决当前视频生成模型在处理多提示（multi-prompt）视频生成任务时面临的挑战，包括严格的训练数据要求、提示跟随能力弱以及不自然的场景过渡等问题。为了解决这些问题，论文提出了DiTCtrl，一种基于多模态扩散变换器（MM-DiT）架构的无训练多提示视频生成方法。其关键创新在于将多提示视频生成任务视为具有平滑过渡的时间视频编辑任务。通过分析MM-DiT的注意力机制，发现其3D全注意力机制与UNet-like扩散模型中的交叉/自注意力块行为相似，从而实现了跨不同提示的精确语义控制，并通过注意力共享实现多提示视频生成。该方法无需额外训练即可生成具有平滑过渡和一致物体运动的视频，并在新设计的MPVBench基准测试中展示了最先进的性能。

链接: https://arxiv.org/abs/2412.18597
作者: Minghong Cai,Xiaodong Cun,Xiaoyu Li,Wenze Liu,Zhaoyang Zhang,Yong Zhang,Ying Shan,Xiangyu Yue
机构: 未知
关键词: Multi-Modal Diffusion Transformer, Sora-like video generation, achieved remarkable progress, multi-prompt video generation, Diffusion Transformer MM-DiT
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 19 pages, 19 figures, Project page: this https URL ; GitHub repository: this https URL

点击查看摘要

Abstract:Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT’s attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.
zh

[CV-7] LatentCRF: Continuous CRF for Efficient Latent Diffusion

【速读】：该论文旨在解决Latent Diffusion Models (LDMs)在生成高质量、逼真图像时，由于多次计算密集型推理迭代导致的延迟问题，从而限制其实际应用。论文提出的解决方案是引入LatentCRF，一种连续的Conditional Random Field (CRF)模型，作为神经网络层来建模LDM中潜在向量之间的空间和语义关系。通过用轻量级的LatentCRF替代部分计算密集型的LDM推理迭代，论文在图像质量、速度和多样性之间实现了更优的平衡。关键创新在于LatentCRF的引入，使得推理效率提升了33%，且无需修改LDM结构，保持了图像质量和多样性。

链接: https://arxiv.org/abs/2412.18596
作者: Kanchana Ranasinghe,Sadeep Jayasumana,Andreas Veit,Ayan Chakrabarti,Daniel Glasner,Michael S Ryoo,Srikumar Ramalingam,Sanjiv Kumar
机构: 未知
关键词: Latent Diffusion Models, Conditional Random Field, Latent Diffusion, multiple costly inference, Diffusion Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images, however, the latency incurred by multiple costly inference iterations can restrict their applicability. We introduce LatentCRF, a continuous Conditional Random Field (CRF) model, implemented as a neural network layer, that models the spatial and semantic relationships among the latent vectors in the LDM. By replacing some of the computationally-intensive LDM inference iterations with our lightweight LatentCRF, we achieve a superior balance between quality, speed and diversity. We increase inference efficiency by 33% with no loss in image quality or diversity compared to the full LDM. LatentCRF is an easy add-on, which does not require modifying the LDM.
zh

[CV-8] ClassifyViStA:WCE Classification with Visual understanding through Segmentation and Attention

【速读】：该论文旨在解决胃肠道出血（Gastrointestinal bleeding, GI bleeding）在无线胶囊内镜（Wireless Capsule Endoscopy, WCE）视频分析中的自动化检测和分类问题。由于WCE视频分析通常依赖经验丰富的胃肠病学家进行耗时的手动分析，且存在人为错误和效率低下的问题，论文提出了一个基于人工智能的框架ClassifyViStA。该框架的关键在于其结合了标准分类路径和两个专门分支：隐式注意力分支（implicit attention branch）和分割分支（segmentation branch）。隐式注意力分支专注于出血区域，而分割分支生成精确的分割掩码，用于分类和可解释性。模型基于ResNet18和VGG16架构的集成，以提升分类性能。此外，采用Soft Non-Maximum Suppression（Soft NMS）与YOLOv8结合的方法，改进了重叠边界框的处理，从而提高了出血区域检测的准确性和细致度。通过使用分割掩码解释分类结果，系统增强了可解释性，提供了类似于胃肠病学家识别出血区域的决策过程。该解决方案不仅实现了胃肠道出血的自动化检测，还提供了可解释的结果，有助于减轻医疗专业人员的负担并提高诊断效率。

链接: https://arxiv.org/abs/2412.18591
作者: S. Balasubramanian,Ammu Abhishek,Yedu Krishna,Darshan Gera
机构: 未知
关键词: presents significant diagnostic, significant diagnostic challenges, http URL, Wireless Capsule Endoscopy, http URL attention
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gastrointestinal (GI) bleeding is a serious medical condition that presents significant diagnostic challenges, particularly in settings with limited access to healthcare resources. Wireless Capsule Endoscopy (WCE) has emerged as a powerful diagnostic tool for visualizing the GI tract, but it requires time-consuming manual analysis by experienced gastroenterologists, which is prone to human error and inefficient given the increasing number of this http URL address this challenge, we propose ClassifyViStA, an AI-based framework designed for the automated detection and classification of bleeding and non-bleeding frames from WCE videos. The model consists of a standard classification path, augmented by two specialized branches: an implicit attention branch and a segmentation this http URL attention branch focuses on the bleeding regions, while the segmentation branch generates accurate segmentation masks, which are used for classification and interpretability. The model is built upon an ensemble of ResNet18 and VGG16 architectures to enhance classification performance. For the bleeding region detection, we implement a Soft Non-Maximum Suppression (Soft NMS) approach with YOLOv8, which improves the handling of overlapping bounding boxes, resulting in more accurate and nuanced this http URL system’s interpretability is enhanced by using the segmentation masks to explain the classification results, offering insights into the decision-making process similar to the way a gastroenterologist identifies bleeding regions. Our approach not only automates the detection of GI bleeding but also provides an interpretable solution that can ease the burden on healthcare professionals and improve diagnostic efficiency. Our code is available at ClassifyViStA.
zh

[CV-9] Resolution-Robust 3D MRI Reconstruction with 2D Diffusion Priors: Diverse-Resolution Training Outperforms Interpolation

【速读】：该论文旨在解决基于深度学习的3D磁共振成像（MRI）重建中，由于训练数据有限且现有方法对体素（voxel）尺寸固定的依赖，导致在临床实践中体素尺寸变化时性能下降的问题。论文提出并研究了多种基于2D扩散先验的分辨率鲁棒性3D MRI重建方法。其关键解决方案包括：1）提出了一种基于扩散引导正则化的简单分辨率鲁棒变分3D重建方法，通过随机采样的2D切片实现；2）探索了多种模型驱动方法，如高斯溅射（Gaussian splatting）、神经表示（neural representations）和无限维扩散模型（infinite-dimensional diffusion models），但这些方法未能显著提升3D MRI重建性能；3）提出了一种数据驱动的方法，即在多种分辨率上训练扩散模型，从而在不牺牲准确性的前提下实现分辨率鲁棒性。实验结果表明，数据驱动方法在解决分辨率变化问题上更为有效。

链接: https://arxiv.org/abs/2412.18584
作者: Anselm Krainovic,Stefan Ruschke,Reinhard Heckel
机构: 未知
关键词: magnetic resonance imaging, Deep learning-based, resonance imaging, MRI, MRI reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Deep learning-based 3D imaging, in particular magnetic resonance imaging (MRI), is challenging because of limited availability of 3D training data. Therefore, 2D diffusion models trained on 2D slices are starting to be leveraged for 3D MRI reconstruction. However, as we show in this paper, existing methods pertain to a fixed voxel size, and performance degrades when the voxel size is varied, as it is often the case in clinical practice. In this paper, we propose and study several approaches for resolution-robust 3D MRI reconstruction with 2D diffusion priors. As a result of this investigation, we obtain a simple resolution-robust variational 3D reconstruction approach based on diffusion-guided regularization of randomly sampled 2D slices. This method provides competitive reconstruction quality compared to posterior sampling baselines. Towards resolving the sensitivity to resolution-shifts, we investigate state-of-the-art model-based approaches including Gaussian splatting, neural representations, and infinite-dimensional diffusion models, as well as a simple data-centric approach of training the diffusion model on several resolutions. Our experiments demonstrate that the model-based approaches fail to close the performance gap in 3D MRI. In contrast, the data-centric approach of training the diffusion model on various resolutions effectively provides a resolution-robust method without compromising accuracy.
zh

[CV-10] 3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement

【速读】：该论文旨在解决神经渲染（neural rendering）领域中由于高质量3D数据集稀缺以及多视角扩散模型（multi-view diffusion models）固有局限性所导致的视图合成和3D模型生成分辨率低、多视角一致性差的问题。为此，论文提出了一种名为3DEnhancer的新型3D增强管道，其核心解决方案包括：1）采用多视角潜在扩散模型（multi-view latent diffusion model）来增强粗糙的3D输入，同时保持多视角一致性；2）引入姿态感知编码器（pose-aware encoder）和基于扩散的去噪器（diffusion-based denoiser）来优化低质量的多视角图像；3）通过数据增强和多视角注意力模块（multi-view attention module）结合极线聚合（epipolar aggregation）来确保跨视角的高质量3D输出一致性。与现有的基于视频的方法不同，3DEnhancer支持无缝的多视角增强，并在不同视角下显著提升了连贯性。实验结果表明，3DEnhancer在多视角增强和单实例3D优化任务上均显著优于现有方法。

链接: https://arxiv.org/abs/2412.18565
作者: Yihang Luo,Shangchen Zhou,Yushi Lan,Xingang Pan,Chen Change Loy
机构: 未知
关键词: suboptimal multi-view consistency, neural rendering, advances in neural, inherent limitations, generation are restricted
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite advances in neural rendering, due to the scarcity of high-quality 3D datasets and the inherent limitations of multi-view diffusion models, view synthesis and 3D model generation are restricted to low resolutions with suboptimal multi-view consistency. In this study, we present a novel 3D enhancement pipeline, dubbed 3DEnhancer, which employs a multi-view latent diffusion model to enhance coarse 3D inputs while preserving multi-view consistency. Our method includes a pose-aware encoder and a diffusion-based denoiser to refine low-quality multi-view images, along with data augmentation and a multi-view attention module with epipolar aggregation to maintain consistent, high-quality 3D outputs across views. Unlike existing video-based approaches, our model supports seamless multi-view enhancement with improved coherence across diverse viewing angles. Extensive evaluations show that 3DEnhancer significantly outperforms existing methods, boosting both multi-view enhancement and per-instance 3D optimization tasks.
zh

[CV-11] he Key of Understanding Vision Tasks: Explanatory Instructions

【速读】：该论文试图解决计算机视觉（CV）领域在零样本任务泛化（zero-shot task generalization）方面尚未达到自然语言处理（NLP）水平的问题。尽管CV领域已经采用了NLP中的许多里程碑式方法，如大规模Transformer模型、广泛的预训练和自回归范式等，但其零样本任务泛化能力仍然有限。论文提出，CV领域采用离散且术语化的任务定义（如“图像分割”）可能是阻碍零样本任务泛化的关键因素。为解决这一问题，论文引入了“解释性指令”（Explanatory Instructions），通过详细的语言转换从输入图像到输出结果，直观地定义CV任务目标。论文构建了一个包含1200万“图像输入→解释性指令→输出”三元组的大规模数据集，并训练了一个基于自回归的视觉语言模型（AR-based VLM），该模型以图像和解释性指令为输入。通过学习这些指令，AR-based VLM在已见任务上实现了指令级别的零样本能力，并在未见CV任务上展示了强大的零样本泛化能力。

链接: https://arxiv.org/abs/2412.18525
作者: Yang Shen,Xiu-Shen Wei,Yifan Sun,Yuxin Song,Tao Yuan,Jian Jin,Heyang Xu,Yazhou Yao,Errui Ding
机构: 未知
关键词: Natural Language Processing, Computer Vision, Language Processing, Natural Language, observed in Natural
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages

点击查看摘要

Abstract:Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we explore the idea that CV adopts discrete and terminological task definitions (\eg, image segmentation''), which may be a key barrier to zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks--due to these terminological definitions--deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million image input \to explanatory instruction \to output’’ triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be openly available on our GitHub repository.
zh

[CV-12] HTR-JAND: Handwritten Text Recognition with Joint Attention Network and Knowledge Distillation

【速读】：该论文旨在解决当前手写文本识别（Handwritten Text Recognition, HTR）系统在处理历史文档时面临的挑战，包括多样化的书写风格、文本质量退化以及跨语言和跨时期计算效率的需求。为解决这些问题，论文提出了HTR-JAND框架，其关键解决方案包括三个核心组件：(1) 结合FullGatedConv2d层和Squeeze-and-Excitation块的CNN架构，用于自适应特征提取；(2) 融合Multi-Head Self-Attention与Proxima Attention的Combined Attention机制，以增强序列建模的鲁棒性；(3) 基于知识蒸馏（Knowledge Distillation）的框架，通过课程学习（curriculum-based training）实现模型压缩的同时保持准确性。此外，HTR-JAND采用多阶段训练策略，结合课程学习、合成数据生成和多任务学习，实现跨数据集的知识迁移，并通过上下文感知的T5后处理进一步提升识别精度。实验结果表明，HTR-JAND在IAM、RIMES和Bentham数据集上分别达到了1.23%、1.02%和2.02%的字符错误率（Character Error Rate, CER），同时学生模型在参数减少48%的情况下仍保持了竞争力。

链接: https://arxiv.org/abs/2412.18524
作者: Mohammed Hamdan,Abderrahmane Rahiche,Mohamed Cheriet
机构: 未知
关键词: Handwritten Text Recognition, current Handwritten Text, degraded text quality, including diverse writing, diverse writing styles
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite significant advances in deep learning, current Handwritten Text Recognition (HTR) systems struggle with the inherent complexity of historical documents, including diverse writing styles, degraded text quality, and computational efficiency requirements across multiple languages and time periods. This paper introduces HTR-JAND (HTR-JAND: Handwritten Text Recognition with Joint Attention Network and Knowledge Distillation), an efficient HTR framework that combines advanced feature extraction with knowledge distillation. Our architecture incorporates three key components: (1) a CNN architecture integrating FullGatedConv2d layers with Squeeze-and-Excitation blocks for adaptive feature extraction, (2) a Combined Attention mechanism fusing Multi-Head Self-Attention with Proxima Attention for robust sequence modeling, and (3) a Knowledge Distillation framework enabling efficient model compression while preserving accuracy through curriculum-based training. The HTR-JAND framework implements a multi-stage training approach combining curriculum learning, synthetic data generation, and multi-task learning for cross-dataset knowledge transfer. We enhance recognition accuracy through context-aware T5 post-processing, particularly effective for historical documents. Comprehensive evaluations demonstrate HTR-JAND’s effectiveness, achieving state-of-the-art Character Error Rates (CER) of 1.23%, 1.02%, and 2.02% on IAM, RIMES, and Bentham datasets respectively. Our Student model achieves a 48% parameter reduction (0.75M versus 1.5M parameters) while maintaining competitive performance through efficient knowledge transfer. Source code and pre-trained models are available at \hrefthis https URLGithub.
zh

[CV-13] VORTEX: A Spatial Computing Framework for Optimized Drone Telemetry Extraction from First-Person View Flight Data

【速读】：该论文旨在解决从第一人称视角（FPV）无人机系统（UAS）视频中提取和分析无人机遥测数据的问题。解决方案的关键在于开发了视觉光学识别遥测提取系统（VORTEX），该系统利用基于PyTorch的光学字符识别（OCR）工具箱MMOCR，结合CLAHE增强和自适应阈值等先进的图像预处理技术，从无人机平视显示器（HUD）记录中提取遥测变量。研究通过系统性地调查时间采样率（1秒、5秒、10秒、15秒、20秒）和坐标处理方法，优化了空间精度和计算效率。结果表明，5秒采样率在保留64%数据点的同时，将计算开销减少了80.5%，并且速度精度与1秒基线相比误差在4.2%以内。此外，研究还发现UTM Zone 33N投影和Haversine计算在结果上具有一致性（差异在0.1%以内），而原始WGS84坐标会低估距离和速度。该研究首次为使用开源工具和空间库建立无人机遥测提取和分析的稳健框架提供了定量基准。

链接: https://arxiv.org/abs/2412.18505
作者: James E. Gallagher,Edward J. Oughton
机构: 未知
关键词: Uncrewed Aerial System, Visual Optical Recognition, Optical Character Recognition, Optical Recognition Telemetry, Person View
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents the Visual Optical Recognition Telemetry EXtraction (VORTEX) system for extracting and analyzing drone telemetry data from First Person View (FPV) Uncrewed Aerial System (UAS) footage. VORTEX employs MMOCR, a PyTorch-based Optical Character Recognition (OCR) toolbox, to extract telemetry variables from drone Heads Up Display (HUD) recordings, utilizing advanced image preprocessing techniques, including CLAHE enhancement and adaptive thresholding. The study optimizes spatial accuracy and computational efficiency through systematic investigation of temporal sampling rates (1s, 5s, 10s, 15s, 20s) and coordinate processing methods. Results demonstrate that the 5-second sampling rate, utilizing 4.07% of available frames, provides the optimal balance with a point retention rate of 64% and mean speed accuracy within 4.2% of the 1-second baseline while reducing computational overhead by 80.5%. Comparative analysis of coordinate processing methods reveals that while UTM Zone 33N projection and Haversine calculations provide consistently similar results (within 0.1% difference), raw WGS84 coordinates underestimate distances by 15-30% and speeds by 20-35%. Altitude measurements showed unexpected resilience to sampling rate variations, with only 2.1% variation across all intervals. This research is the first of its kind, providing quantitative benchmarks for establishing a robust framework for drone telemetry extraction and analysis using open-source tools and spatial libraries.
zh

[CV-14] A region-wide multi-year set of crop field boundary labels for Africa

【速读】：该论文旨在解决非洲农业快速转型过程中缺乏年度作物田地图的问题。为了理解这一转型的性质，高分辨率的遥感影像和基于机器学习的模型是关键。论文通过使用自定义标注平台，在2017年至2023年间对33,746张Planet影像进行了田块边界标注，并收集了42,403个标注数据，其中包括用于评估标注质量的Class 1标签、单次标注的Class 2标签以及多次标注的Class 4标签。通过贝叶斯风险度量（Bayesian risk metric）进一步评估了标注的不确定性。尽管在3-5米分辨率的影像中，小规模田块的标注质量较低，但研究表明这些标注仍能有效训练田块映射模型。此外，这些数据本身为区域农业特征提供了有价值的见解，揭示了田块大小和密度的变化。影像和矢量标注数据及其质量信息可从两个公共仓库下载。

链接: https://arxiv.org/abs/2412.18483
作者: L.D. Estes,A. Wussah,M. Asipunu,M. Gathigi,P. Kovačič,J. Muhando,B.V. Yeboah,F.K. Addai,E.S. Akakpo,M.K. Allotey,P. Amkoya,E. Amponsem,K.D. Donkoh,N. Ha,E. Heltzel,C. Juma,R. Mdawida,A. Miroyo,J. Mucha,J. Mugami,F. Mwawaza,D.A. Nyarko,P. Oduor,K.N. Ohemeng,S.I.D. Segbefia,T. Tumbula,F. Wambua,G.H. Xeflide,S. Ye,F. Yeboah
机构: 未知
关键词: undergoing rapid transformation, African agriculture, Class, agriculture is undergoing, undergoing rapid
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:African agriculture is undergoing rapid transformation. Annual maps of crop fields are key to understanding the nature of this transformation, but such maps are currently lacking and must be developed using advanced machine learning models trained on high resolution remote sensing imagery. To enable the development of such models, we delineated field boundaries in 33,746 Planet images captured between 2017 and 2023 across the continent using a custom labeling platform with built-in procedures for assessing and mitigating label error. We collected 42,403 labels, including 7,204 labels arising from tasks dedicated to assessing label quality (Class 1 labels), 32,167 from sites mapped once by a single labeller (Class 2) and 3,032 labels from sites where 3 or more labellers were tasked to map the same location (Class 4). Class 1 labels were used to calculate labeller-specific quality scores, while Class 1 and 4 sites mapped by at least 3 labellers were used to further evaluate label uncertainty using a Bayesian risk metric. Quality metrics showed that label quality was moderately high (0.75) for measures of total field extent, but low regarding the number of individual fields delineated (0.33), and the position of field edges (0.05). These values are expected when delineating small-scale fields in 3-5 m resolution imagery, which can be too coarse to reliably distinguish smaller fields, particularly in dense croplands, and therefore requires substantial labeller judgement. Nevertheless, previous work shows that such labels can train effective field mapping models. Furthermore, this large, probabilistic sample on its own provides valuable insight into regional agricultural characteristics, highlighting variations in the median field size and density. The imagery and vectorized labels along with quality information is available for download from two public repositories.
zh

[CV-15] Underwater Image Restoration via Polymorphic Large Kernel CNNs ICASSP2025

【速读】：该论文旨在解决水下图像复原（Underwater Image Restoration, UIR）这一计算机视觉领域的挑战性问题，主要针对水下环境中图像复杂退化的问题。尽管现有方法通过使用Transformer和复杂、参数密集的深度学习模型在复原效果上取得了显著进展，但本文提出了一种基于纯卷积神经网络（CNN）架构的轻量级参数模型，能够达到与之相当的效果。解决方案的关键在于引入了UIR-PolyKernel方法，该方法利用多态大核卷积（Polymorphic Large Kernel CNNs），通过结合不同尺寸和形状的大核卷积来有效捕捉水下图像中的长程依赖关系。此外，论文还提出了混合域注意力模块（Hybrid Domain Attention），该模块整合了频域和空间域的注意力机制，以增强特征的重要性。通过利用频域信息，模型能够捕捉到人类难以察觉但对识别水下和空中图像模式至关重要的隐藏特征，从而提升了模型的泛化能力和鲁棒性。实验结果表明，UIR-PolyKernel在多个基准数据集上实现了最先进的性能，证明了纯CNN架构在复杂图像复原任务中的潜力，并在性能和计算效率之间取得了平衡。

链接: https://arxiv.org/abs/2412.18459
作者: Xiaojiao Guo,Yihang Dong,Xuhang Chen,Weiwen Chen,Zimeng Li,FuChen Zheng,Chi-Man Pun
机构: 未知
关键词: computer vision due, Underwater Image Restoration, Image Restoration, Large Kernel CNNs, Polymorphic Large Kernel
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by ICASSP2025

点击查看摘要

Abstract:Underwater Image Restoration (UIR) remains a challenging task in computer vision due to the complex degradation of images in underwater environments. While recent approaches have leveraged various deep learning techniques, including Transformers and complex, parameter-heavy models to achieve significant improvements in restoration effects, we demonstrate that pure CNN architectures with lightweight parameters can achieve comparable results. In this paper, we introduce UIR-PolyKernel, a novel method for underwater image restoration that leverages Polymorphic Large Kernel CNNs. Our approach uniquely combines large kernel convolutions of diverse sizes and shapes to effectively capture long-range dependencies within underwater imagery. Additionally, we introduce a Hybrid Domain Attention module that integrates frequency and spatial domain attention mechanisms to enhance feature importance. By leveraging the frequency domain, we can capture hidden features that may not be perceptible to humans but are crucial for identifying patterns in both underwater and on-air images. This approach enhances the generalization and robustness of our UIR model. Extensive experiments on benchmark datasets demonstrate that UIR-PolyKernel achieves state-of-the-art performance in underwater image restoration tasks, both quantitatively and qualitatively. Our results show that well-designed pure CNN architectures can effectively compete with more complex models, offering a balance between performance and computational efficiency. This work provides new insights into the potential of CNN-based approaches for challenging image restoration tasks in underwater environments. The code is available at \hrefthis https URLthis https URL.
zh

[CV-16] 3DGraphLLM : Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

【速读】：该论文旨在解决如何在3D场景中利用语义关系信息来增强大语言模型（LLMs）在视觉-语言任务中的表现。现有的方法主要依赖于物体的坐标信息，而忽略了物体之间的语义关系，这限制了LLMs在3D场景中的理解和响应能力。论文提出的解决方案3DGraphLLM，通过构建3D场景图的可学习表示，将物体及其语义关系信息整合为LLMs的输入，从而提升其在3D视觉-语言任务中的表现。实验结果表明，该方法在多个数据集（如ScanRefer、RIORefer、Multi3DRefer、ScanQA、Sqa3D和Scan2cap）上优于不利用语义关系信息的基线方法。

链接: https://arxiv.org/abs/2412.18450
作者: Tatiana Zemskova,Dmitry Yudin
机构: 未知
关键词: compact scene model, Large Language Models, represents a compact, promising for robotic, scene graph represents
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects. The code is publicly available at this https URL.
zh

[CV-17] Fashionability-Enhancing Outfit Image Editing with Conditional Diffusion Models

【速读】：该论文旨在解决时尚图像生成领域中输出图像时尚性（fashionability）不足的问题。现有方法主要关注保持身体特征或遵循输入提示，而忽略了提升生成图像的固有时尚性。论文提出了一种基于扩散模型（diffusion model）的新方法，通过三个关键组件来提升生成图像的时尚性：1）时尚性增强，确保生成图像比输入图像更具时尚感；2）身体特征保持，鼓励生成图像保留输入图像的原始形状和比例；3）自动时尚优化，无需依赖手动输入或外部提示。此外，论文还采用了两种方法来收集训练数据，并通过基于OpenSkill和五个关键方面的成对比较，由多位时尚专家对服装图像进行时尚性评分。实验结果表明，该方法在生成更具时尚性的图像方面优于基线模型Fashion++，证明了其在生成更时尚和吸引人的时尚图像方面的有效性。

链接: https://arxiv.org/abs/2412.18421
作者: Qice Qin,Yuki Hirakawa,Ryotaro Shimizu,Takuya Furusawa,Edgar Simo-Serra
机构: 未知
关键词: preserving body characteristics, images, domain has predominantly, predominantly focused, focused on preserving
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Image generation in the fashion domain has predominantly focused on preserving body characteristics or following input prompts, but little attention has been paid to improving the inherent fashionability of the output images. This paper presents a novel diffusion model-based approach that generates fashion images with improved fashionability while maintaining control over key attributes. Key components of our method include: 1) fashionability enhancement, which ensures that the generated images are more fashionable than the input; 2) preservation of body characteristics, encouraging the generated images to maintain the original shape and proportions of the input; and 3) automatic fashion optimization, which does not rely on manual input or external prompts. We also employ two methods to collect training data for guidance while generating and evaluating the images. In particular, we rate outfit images using fashionability scores annotated by multiple fashion experts through OpenSkill-based and five critical aspect-based pairwise comparisons. These methods provide complementary perspectives for assessing and improving the fashionability of the generated images. The experimental results show that our approach outperforms the baseline Fashion++ in generating images with superior fashionability, demonstrating its effectiveness in producing more stylish and appealing fashion images.
zh

[CV-18] Re-assessing ImageNet: How aligned is its single-label assumption with its multi-label nature?

【速读】：该论文旨在解决ImageNet数据集在传统单标签分类（single-label classification）评估方法下无法充分捕捉图像复杂语义的问题。传统方法假设每张图像仅由一个概念或标签描述，这可能限制了深度神经网络（DNN）模型有效学习图像细节的能力。论文提出将评估方法从单标签转向多标签（multi-label benchmarking），以更全面地评估DNN模型的性能。关键解决方案在于重新审视ImageNet及其变体ImageNetV2的数据集特性，特别是图像多标签比例的影响。研究表明，文献中报道的ImageNetV2上11%至14%的准确率下降主要归因于数据集的多标签特性，而非模型性能的实质性退化。此外，论文提出了一种新的评估方法，以增强现有方法对模型捕捉多标签能力的评估。这一研究强调了在基准测试中考虑ImageNet数据集多标签特性的重要性，以避免对DNN模型有效性的错误结论，并确保研究重点集中在模型可靠性和鲁棒性等实质性挑战上。

链接: https://arxiv.org/abs/2412.18409
作者: Esla Timothy Anzaku,Seyed Amir Mousavi,Arnout Van Messem,Wesley De Neve
机构: 未知
关键词: computer vision, traditionally evaluated, single concept, ImageNet, single-label classification
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:ImageNet, an influential dataset in computer vision, is traditionally evaluated using single-label classification, which assumes that an image can be adequately described by a single concept or label. However, this approach may not fully capture the complex semantics within the images available in ImageNet, potentially hindering the development of models that effectively learn these intricacies. This study critically examines the prevalent single-label benchmarking approach and advocates for a shift to multi-label benchmarking for ImageNet. This shift would enable a more comprehensive assessment of the capabilities of deep neural network (DNN) models. We analyze the effectiveness of pre-trained state-of-the-art DNNs on ImageNet and one of its variants, ImageNetV2. Studies in the literature have reported unexpected accuracy drops of 11% to 14% on ImageNetV2. Our findings show that these reported declines are largely attributable to a characteristic of the dataset that has not received sufficient attention – the proportion of images with multiple labels. Taking this characteristic into account, the results of our experiments provide evidence that there is no substantial degradation in effectiveness on ImageNetV2. Furthermore, we acknowledge that ImageNet pre-trained models exhibit some capability at capturing the multi-label nature of the dataset even though they were trained under the single-label assumption. Consequently, we propose a new evaluation approach to augment existing approaches that assess this capability. Our findings highlight the importance of considering the multi-label nature of the ImageNet dataset during benchmarking. Failing to do so could lead to incorrect conclusions regarding the effectiveness of DNNs and divert research efforts from addressing other substantial challenges related to the reliability and robustness of these models.
zh

[CV-19] Extract Free Dense Misalignment from CLIP AAAI2025

【速读】：该论文旨在解决当前视觉-语言基础模型（vision-language foundation models）在生成输出时与输入不对齐的问题，具体表现为图像描述中的物体幻觉（object hallucination）和文本到图像生成模型中的提示不对齐（prompt misalignment）。现有方法主要依赖于零样本（zero-shot）方式的大型基础模型或基于人工标注的微调模型，这些方法由于计算成本高，限制了其可扩展性。论文提出了一种名为CLIP4DM的新方法，通过预训练的CLIP模型检测密集不对齐（dense misalignments），特别是精确定位图像与文本之间的不对齐词汇。关键解决方案包括：1）改进基于梯度的归因计算方法，使单个文本标记的负梯度能够指示不对齐；2）提出F-CLIPScore，将不对齐归因与全局对齐分数相结合。该方法在多个密集不对齐检测基准测试中表现出色，尤其在检测实体级对象、无形对象和属性方面具有独特优势，同时保持了较高的效率。

链接: https://arxiv.org/abs/2412.18404
作者: JeongYeon Nam,Jinbae Im,Wonjae Kim,Taeho Kil
机构: 未知
关键词: frequently produce outputs, Recent vision-language foundation, produce outputs misaligned, Recent vision-language, vision-language foundation models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 14 figures, AAAI 2025

点击查看摘要

Abstract:Recent vision-language foundation models still frequently produce outputs misaligned with their inputs, evidenced by object hallucination in captioning and prompt misalignment in the text-to-image generation model. Recent studies have explored methods for identifying misaligned elements, aiming not only to enhance interpretability but also to improve model performance. However, current approaches primarily rely on large foundation models in a zero-shot manner or fine-tuned models with human annotations, which limits scalability due to significant computational costs. This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP, specifically focusing on pinpointing misaligned words between image and text. We carefully revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment. We also propose F-CLIPScore, which aggregates misaligned attributions with a global alignment score. We evaluate our method on various dense misalignment detection benchmarks, covering various image and text domains and misalignment types. Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models while maintaining superior efficiency. Our qualitative examples show that our method has a unique strength to detect entity-level objects, intangible objects, and attributes that can not be easily detected for existing works. We conduct ablation studies and analyses to highlight the strengths and limitations of our approach. Our code is publicly available at this https URL.
zh

[CV-20] RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction

【速读】：该论文旨在解决扩散概率模型（Diffusion Probabilistic Models, DPMs）在连续变分自编码器（VAE）潜在空间中操作时与文本生成方法（如大语言模型，LLMs）之间的差异问题。为此，论文提出了一种新的生成框架——循环扩散概率模型（Recurrent Diffusion Probabilistic Model, RDPM），其关键创新在于通过循环令牌预测机制增强扩散过程，从而开创了离散扩散（Discrete Diffusion）领域。RDPM通过在图像的潜在表示中逐步引入高斯噪声，并以循环方式将其编码为向量量化的令牌，实现了在离散值域上的独特扩散过程。该过程迭代地预测后续时间步的令牌代码，将初始的标准高斯噪声转换为源数据分布，并在损失函数方面与GPT风格的模型保持一致。RDPM不仅利用扩散过程确保高质量生成，还将连续信号转换为一系列高保真离散令牌，从而与其他离散令牌（如文本）保持统一的优化策略。这一解决方案为多模态生成模型的开发提供了新的思路，特别是在整合图像、视频、音频等连续信号域与文本方面。

链接: https://arxiv.org/abs/2412.18390
作者: Wu Xiaoping,Hu Jie,Wei Xiaoming
机构: 未知
关键词: Large Language Models, Large Language, Recurrent Diffusion Probabilistic, operating diffusion processes, Diffusion Probabilistic Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 8 pages

点击查看摘要

Abstract:Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach for high-fidelity image synthesis, operating diffusion processes on continuous VAE latent, which significantly differ from the text generation methods employed by Large Language Models (LLMs). In this paper, we introduce a novel generative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which enhances the diffusion process through a recurrent token prediction mechanism, thereby pioneering the field of Discrete Diffusion. By progressively introducing Gaussian noise into the latent representations of images and encoding them into vector-quantized tokens in a recurrent manner, RDPM facilitates a unique diffusion process on discrete-value domains. This process iteratively predicts the token codes for subsequent timesteps, transforming the initial standard Gaussian noise into the source data distribution, aligning with GPT-style models in terms of the loss function. RDPM demonstrates superior performance while benefiting from the speed advantage of requiring only a few inference steps. This model not only leverages the diffusion process to ensure high-quality generation but also converts continuous signals into a series of high-fidelity discrete tokens, thereby maintaining a unified optimization strategy with other discrete tokens, such as text. We anticipate that this work will contribute to the development of a unified model for multimodal generation, specifically by integrating continuous signal domains such as images, videos, and audio with text. We will release the code and model weights to the open-source community.
zh

[CV-21] Switch-a-View: Few-Shot View Selection Learned from Edited Videos

【速读】：该论文旨在解决在制作教学视频（how-to video）过程中如何自动选择每个时间点应展示的视角（viewpoint）的问题。其解决方案的关键在于通过未标注但经过人工编辑的视频样本训练模型，提出了一种伪标签（pseudo-labeling）的预训练任务，将训练视频片段标记为主要视角（如第一人称视角 egocentric 或第三人称视角 exocentric），并发现视角切换时刻与视频视觉和语音内容之间的关联模式。基于此预测器，模型能够对未见过的多视角视频进行输入，并动态决定何时展示何种视角。此外，论文还引入了少样本训练（few-shot training）设置，使模型能够适应新的数据领域。

链接: https://arxiv.org/abs/2412.18386
作者: Sagnik Majumder,Tushar Nagarajan,Ziad Al-Halah,Kristen Grauman
机构: 未知
关键词: learns to automatically, automatically select, timepoint when creating, how-to video, video
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Switch-a-View, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled–but human-edited–video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between those view-switch moments on the one hand and the visual and spoken content in the how-to video on the other hand. Armed with this predictor, our model then takes an unseen multi-view video as input and orchestrates which viewpoint should be displayed when. We further introduce a few-shot training setting that permits steering the model towards a new data domain. We demonstrate our idea on a variety of real-world video from HowTo100M and Ego-Exo4D and rigorously validate its advantages.
zh

[CV-22] RSGaussian:3D Gaussian Splatting with LiDAR for Aerial Remote Sensing Novel View Synthesis

【速读】：该论文旨在解决航空遥感场景中基于 LiDAR 点云约束的 3D 高斯溅射（3D Gaussian Splatting）方法在新视角合成（Novel View Synthesis, NVS）中存在的过度生长和漂浮物问题。其关键解决方案包括：首先，通过将 LiDAR 点云作为几何基准引入 3D 高斯溅射方法，确保高斯沿几何基准生长和分裂，从而有效抑制过度生长和漂浮物现象；其次，引入带有畸变参数的相机模型坐标变换，实现 LiDAR 点云与 2D 图像之间的像素级对齐，促进异构数据融合，满足航空遥感中的高精度地理对齐需求；最后，在损失函数中引入深度和平面一致性损失，引导高斯向真实深度和平面表示收敛，显著提升深度估计精度。实验结果表明，该方法在航空遥感数据集中实现了兼顾照片级视觉质量和高精度几何估计的新视角合成。

链接: https://arxiv.org/abs/2412.18380
作者: Yiling Yao,Wenjuan Zhang,Bing Zhang,Bocheng Li,Yaning Wang,Bowen Wang
机构: 未知
关键词: Gaussian Splatting method, study presents RSGaussian, floaters issues occurs, aerial remote sensing, LiDAR point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:This study presents RSGaussian, an innovative novel view synthesis (NVS) method for aerial remote sensing scenes that incorporate LiDAR point cloud as constraints into the 3D Gaussian Splatting method, which ensures that Gaussians grow and split along geometric benchmarks, addressing the overgrowth and floaters issues occurs. Additionally, the approach introduces coordinate transformations with distortion parameters for camera models to achieve pixel-level alignment between LiDAR point clouds and 2D images, facilitating heterogeneous data fusion and achieving the high-precision geo-alignment required in aerial remote sensing. Depth and plane consistency losses are incorporated into the loss function to guide Gaussians towards real depth and plane representations, significantly improving depth estimation accuracy. Experimental results indicate that our approach has achieved novel view synthesis that balances photo-realistic visual quality and high-precision geometric estimation under aerial remote sensing datasets. Finally, we have also established and open-sourced a dense LiDAR point cloud dataset along with its corresponding aerial multi-view images, AIR-LONGYAN.
zh

[CV-23] Addressing Spatial-Temporal Data Heterogeneity in Federated Continual Learning via Tail Anchor

【速读】：该论文旨在解决联邦持续学习（Federated Continual Learning, FCL）中的空间和时间数据异质性（spatial and temporal data heterogeneity）问题，这些问题导致模型在局部和先前知识上出现严重的时空灾难性遗忘（spatial-temporal catastrophic forgetting）。为解决这一问题，论文提出了联邦尾部锚点（Federated Tail Anchor, FedTA）方法，其关键是通过将可训练的尾部锚点（Tail Anchor）与冻结的输出特征混合，调整它们在特征空间中的位置，从而克服参数遗忘（parameter-forgetting）和输出遗忘（output-forgetting）。此外，FedTA还包含三个创新组件：输入增强（Input Enhancement）用于提升预训练模型在下游任务中的性能；选择性输入知识融合（Selective Input Knowledge Fusion）用于在服务器端融合异质局部知识；最佳全局原型选择（Best Global Prototype Selection）用于在特征空间中找到每个类别的最佳锚点。实验表明，FedTA不仅优于现有的FCL方法，还能有效保持特征的相对位置，不受时空变化的影响。

链接: https://arxiv.org/abs/2412.18355
作者: Hao Yu,Xin Yang,Le Zhang,Hanlin Gu,Tianrui Li,Lixin Fan,Qiang Yang
机构: 未知
关键词: Federated continual learning, continual learning, Federated Tail Anchor, enhancing the applicability, real-world scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Federated continual learning (FCL) allows each client to continually update its knowledge from task streams, enhancing the applicability of federated learning in real-world scenarios. However, FCL needs to address not only spatial data heterogeneity between clients but also temporal data heterogeneity between tasks. In this paper, empirical experiments demonstrate that such input-level heterogeneity significantly affects the model’s internal parameters and outputs, leading to severe spatial-temporal catastrophic forgetting of local and previous knowledge. To this end, we propose Federated Tail Anchor (FedTA) to mix trainable Tail Anchor with the frozen output features to adjust their position in the feature space, thereby overcoming parameter-forgetting and output-forgetting. Moreover, three novel components are also included in FedTA: Input Enhancement for improving the performance of pre-trained models on downstream tasks; Selective Input Knowledge Fusion for fusion of heterogeneous local knowledge on the server side; and Best Global Prototype Selection for finding the best anchor point for each class in the feature space. Extensive experiments demonstrate that FedTA not only outperforms existing FCL methods but also effectively preserves the relative positions of features, remaining unaffected by spatial and temporal changes.
zh

[CV-24] Mitigating Label Noise using Prompt-Based Hyperbolic Meta-Learning in Open-Set Domain Generalization

【速读】：该论文旨在解决开放集领域泛化（Open-Set Domain Generalization, OSDG）任务中标签噪声（Label Noise）带来的挑战。标签噪声在现实世界的数据集中普遍存在，会误导模型优化，从而加剧在新领域中进行开放集识别的难度。为了解决这一问题，论文首次提出了在噪声标签下的开放集领域泛化（OSDG-NL）任务，并构建了基于PACS和DigitsDG等广泛使用的OSDG数据集的专用基准。论文评估了结合标签去噪和OSDG方法的基线策略，并指出了现有方法在处理标签噪声方面的局限性。为解决这些局限性，论文提出了HyProMeta框架，该框架集成了双曲类别原型（hyperbolic category prototypes）用于标签噪声感知的元学习，并结合可学习的新类别无关提示（new-category agnostic prompt），以增强对未见类别的泛化能力。实验结果表明，HyProMeta在新建立的基准上优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.18342
作者: Kunyu Peng,Di Wen,Sarfraz M. Saquib,Yufan Chen,Junwei Zheng,David Schneider,Kailun Yang,Jiamin Wu,Alina Roitberg,Rainer Stiefelhagen
机构: 未知
关键词: predict familiar categories, challenging task requiring, accurately predict familiar, task requiring models, Open-Set Domain Generalization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: The source code of this work is released at this https URL

点击查看摘要

Abstract:Open-Set Domain Generalization (OSDG) is a challenging task requiring models to accurately predict familiar categories while minimizing confidence for unknown categories to effectively reject them in unseen domains. While the OSDG field has seen considerable advancements, the impact of label noise–a common issue in real-world datasets–has been largely overlooked. Label noise can mislead model optimization, thereby exacerbating the challenges of open-set recognition in novel domains. In this study, we take the first step towards addressing Open-Set Domain Generalization under Noisy Labels (OSDG-NL) by constructing dedicated benchmarks derived from widely used OSDG datasets, including PACS and DigitsDG. We evaluate baseline approaches by integrating techniques from both label denoising and OSDG methodologies, highlighting the limitations of existing strategies in handling label noise effectively. To address these limitations, we propose HyProMeta, a novel framework that integrates hyperbolic category prototypes for label noise-aware meta-learning alongside a learnable new-category agnostic prompt designed to enhance generalization to unseen classes. Our extensive experiments demonstrate the superior performance of HyProMeta compared to state-of-the-art methods across the newly established benchmarks. The source code of this work is released at this https URL.
zh

[CV-25] FloNa: Floor Plan Guided Embodied Visual Navigation AAAI2025

【速读】：该论文旨在解决现有视觉导航任务中忽视建筑平面图（floor plan）这一重要先验知识的问题，提出了一个新的导航任务：平面图视觉导航（Floor Plan Visual Navigation, FloNa）。平面图在导航中具有显著优势，但同时也带来了两个关键挑战：一是如何处理平面图与实际场景布局之间的空间不一致性，以实现无碰撞导航；二是如何在观察图像与平面图草图之间进行对齐，尽管它们属于不同的模态。为解决这些挑战，论文提出了FloDiff框架，该框架结合了定位模块（localization module），以促进当前观察与平面图之间的对齐。此外，研究团队在iGibson模拟器中收集了20,000个导航片段，涵盖117个场景，用于训练和评估。实验结果表明，该框架在利用平面图知识进行陌生场景导航时具有高效性和有效性。

链接: https://arxiv.org/abs/2412.18335
作者: Jiaxin Li,Weiqi Huang,Zan Wang,Wei Liang,Huijun Di,Feng Liu
机构: 未知
关键词: Humans naturally rely, rich geometrical guidance, provide rich geometrical, Floor Plan, Humans naturally
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Humans naturally rely on floor plans to navigate in unfamiliar environments, as they are readily available, reliable, and provide rich geometrical guidance. However, existing visual navigation settings overlook this valuable prior knowledge, leading to limited efficiency and accuracy. To eliminate this gap, we introduce a novel navigation task: Floor Plan Visual Navigation (FloNa), the first attempt to incorporate floor plan into embodied visual navigation. While the floor plan offers significant advantages, two key challenges emerge: (1) handling the spatial inconsistency between the floor plan and the actual scene layout for collision-free navigation, and (2) aligning observed images with the floor plan sketch despite their distinct modalities. To address these challenges, we propose FloDiff, a novel diffusion policy framework incorporating a localization module to facilitate alignment between the current observation and the floor plan. We further collect 20k navigation episodes across 117 scenes in the iGibson simulator to support the training and evaluation. Extensive experiments demonstrate the effectiveness and efficiency of our framework in unfamiliar scenes using floor plan knowledge. Project website: this https URL.
zh

[CV-26] HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images

【速读】：该论文旨在解决现有视觉问答（Vision Question Answering, VQA）模型在理解文本密集图像中人类标注（human annotations）方面的显著局限性。为此，作者提出了人类标注理解与识别（Human Annotation Understanding and Recognition, HAUR）任务，并引入了包含五种常见人类标注类型的HAUR-5数据集。解决方案的关键在于开发并训练了OCR-Mix模型，该模型通过综合跨模型比较，在HAUR任务中表现出优于其他模型的性能。

链接: https://arxiv.org/abs/2412.18327
作者: Yuchen Yang,Haoran Yan,Yanhao Chen,Qingqiang Wu,Qingqi Hong
机构: 未知
关键词: Vision Question Answering, Question Answering, answer text-based questions, Vision Question, convey critical information
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions, which is one of the most common forms of question answering in real-world scenarios. Numerous vision-text models exist today and have performed well on certain VQA tasks. However, these models exhibit significant limitations in understanding human annotations on text-heavy images. To address this, we propose the Human Annotation Understanding and Recognition (HAUR) task. As part of this effort, we introduce the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which encompasses five common types of human annotations. Additionally, we developed and trained our model, OCR-Mix. Through comprehensive cross-model comparisons, our results demonstrate that OCR-Mix outperforms other models in this task. Our dataset and model will be released soon .
zh

[CV-27] Computer Vision-Driven Gesture Recognition: Toward Natural and Intuitive Human-Computer

【速读】：该论文主要探讨了基于计算机视觉的自然手势识别在人机交互中的应用，旨在通过手势识别技术提升人机交互的流畅性和自然性。传统输入方法在虚拟现实、增强现实和智能家居等领域逐渐无法满足用户对交互体验的需求，而手势作为一种直观且便捷的交互方式，受到了越来越多的关注。论文提出了一种基于三维手部骨架模型的手势识别方法，通过模拟手部关节的三维空间分布，构建了简化的手部骨架结构，并通过连接手掌与各手指关节，形成了手部的动态与静态手势模型，从而进一步提高了手势识别的准确性和效率。实验结果表明，该方法能够有效识别多种手势，并在不同环境中保持较高的识别精度和实时响应能力。此外，结合眼动追踪等多模态技术，可以进一步提升手势识别系统的智能化水平，带来更丰富和直观的用户体验。未来，随着计算机视觉、深度学习和多模态交互技术的不断发展，基于手势的自然交互将在更广泛的应用场景中发挥重要作用，并推动人机交互的革命性进步。

链接: https://arxiv.org/abs/2412.18321
作者: Fenghua Shao,Tong Zhang,Shang Gao,Qi Sun,Liuqingqing Yang
机构: 未知
关键词: study mainly explores, fluency and naturalness, gesture recognition, interaction, human-computer interaction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study mainly explores the application of natural gesture recognition based on computer vision in human-computer interaction, aiming to improve the fluency and naturalness of human-computer interaction through gesture recognition technology. In the fields of virtual reality, augmented reality and smart home, traditional input methods have gradually failed to meet the needs of users for interactive experience. As an intuitive and convenient interaction method, gestures have received more and more attention. This paper proposes a gesture recognition method based on a three-dimensional hand skeleton model. By simulating the three-dimensional spatial distribution of hand joints, a simplified hand skeleton structure is constructed. By connecting the palm and each finger joint, a dynamic and static gesture model of the hand is formed, which further improves the accuracy and efficiency of gesture recognition. Experimental results show that this method can effectively recognize various gestures and maintain high recognition accuracy and real-time response capabilities in different environments. In addition, combined with multimodal technologies such as eye tracking, the intelligence level of the gesture recognition system can be further improved, bringing a richer and more intuitive user experience. In the future, with the continuous development of computer vision, deep learning and multimodal interaction technology, natural interaction based on gestures will play an important role in a wider range of application scenarios and promote revolutionary progress in human-computer interaction.
zh

[CV-28] Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

【速读】：该论文旨在开发一种多模态大语言模型（MLLM），能够通过学习和创建推理过程中的每个中间步骤来理解和解决问题。为此，作者提出了集体蒙特卡洛树搜索（Collective Monte Carlo Tree Search, CoMCTS），这是一种新的学习推理方法，将集体学习的概念引入“树搜索”中，以实现高效且有效的推理路径搜索和学习。CoMCTS的核心思想是利用多个模型的集体知识，通过扩展（Expansion）、模拟与错误定位（Simulation and Error Positioning）、反向传播（Backpropagation）和选择（Selection）四个迭代操作，协作推测、搜索并识别出通向正确答案的有效推理路径。基于CoMCTS，作者构建了Mulberry-260k数据集，该数据集为每个问题提供了丰富、明确且定义良好的推理节点树。通过Mulberry-260k，作者进行了集体监督微调（collective SFT）来训练模型Mulberry，该模型具备类似o1的逐步推理和反思能力。实验结果表明，所提出的方法在多个基准测试中表现出优越性。

链接: https://arxiv.org/abs/2412.18319
作者: Huanjin Yao,Jiaxing Huang,Wenhao Wu,Jingyi Zhang,Yibo Wang,Shunyu Liu,Yingjie Wang,Yuxin Song,Haocheng Feng,Li Shen,Dacheng Tao
机构: 未知
关键词: reasoning involved till, Monte Carlo Tree, Collective Monte Carlo, Carlo Tree Search, aim to develop
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical report

点击查看摘要

Abstract:In this work, we aim to develop an MLLM that understands and solves questions by learning to create each intermediate step of the reasoning involved till the final answer. To this end, we propose Collective Monte Carlo Tree Search (CoMCTS), a new learning-to-reason method for MLLMs, which introduces the concept of collective learning into ``tree search’’ for effective and efficient reasoning-path searching and learning. The core idea of CoMCTS is to leverage collective knowledge from multiple models to collaboratively conjecture, search and identify effective reasoning paths toward correct answers via four iterative operations including Expansion, Simulation and Error Positioning, Backpropagation, and Selection. Using CoMCTS, we construct Mulberry-260k, a multimodal dataset with a tree of rich, explicit and well-defined reasoning nodes for each question. With Mulberry-260k, we perform collective SFT to train our model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and Reflection capabilities. Extensive experiments demonstrate the superiority of our proposed methods on various benchmarks. Code will be available at this https URL
zh

[CV-29] Efficient and Context-Aware Label Propagation for Zero-/Few-Shot Training-Free Adaptation of Vision-Language Model

【速读】：该论文旨在解决当前视觉语言模型（Vision-Language Models, VLMs）在任务特定超参数调优和测试样本利用不足方面的问题。尽管现有的VLMs在标签、训练和数据效率上有所改进，但仍需针对具体任务进行超参数调整，且未能充分利用测试样本。为此，论文提出了一种基于图（graph-based）的方法，用于标签高效的适应和推理。该方法通过动态构建一个包含文本提示（text prompts）、少样本示例（few-shot examples）和测试样本的图，利用标签传播（label propagation）进行推理，而无需任务特定的调优。与现有的零样本标签传播技术不同，该方法无需额外的未标注支持集，并通过动态图扩展有效利用测试样本流形（manifold）。此外，论文引入了上下文感知的特征重加权机制，以提高任务适应的准确性，并支持高效的图扩展，实现实时归纳推理。通过在下游任务（如细粒度分类和分布外泛化）上的广泛评估，验证了该方法的有效性。

链接: https://arxiv.org/abs/2412.18303
作者: Yushu Li,Yongyi Su,Adam Goodge,Kui Jia,Xun Xu
机构: 未知
关键词: large pre-trained models, revolutionized machine learning, leveraging large pre-trained, Vision-language models, pre-trained models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have revolutionized machine learning by leveraging large pre-trained models to tackle various downstream tasks. Despite improvements in label, training, and data efficiency, many state-of-the-art VLMs still require task-specific hyperparameter tuning and fail to fully exploit test samples. To overcome these challenges, we propose a graph-based approach for label-efficient adaptation and inference. Our method dynamically constructs a graph over text prompts, few-shot examples, and test samples, using label propagation for inference without task-specific tuning. Unlike existing zero-shot label propagation techniques, our approach requires no additional unlabeled support set and effectively leverages the test sample manifold through dynamic graph expansion. We further introduce a context-aware feature re-weighting mechanism to improve task adaptation accuracy. Additionally, our method supports efficient graph expansion, enabling real-time inductive inference. Extensive evaluations on downstream tasks, such as fine-grained categorization and out-of-distribution generalization, demonstrate the effectiveness of our approach.
zh

[CV-30] FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models

【速读】：该论文探讨了文本到图像（Text-to-Image, T2I）扩散模型在生成高质量图像时可能被滥用于传播宣传和其他恶意活动的问题。具体而言，攻击者可以通过简单的微调（fine-tuning）将偏见嵌入这些模型，使其在特定触发词的作用下生成目标图像，从而成为传播宣传的工具。为解决这一问题，论文提出了FameBias，一种T2I偏置攻击方法，通过操纵输入提示的嵌入向量（embedding vectors）来生成包含特定公众人物的图像。与现有方法不同，FameBias仅需操作输入嵌入向量，而无需额外的模型训练。通过在Stable Diffusion V2上的全面评估，FameBias在多个触发词-目标对中实现了高攻击成功率，同时保持了原始提示的语义上下文。该解决方案的关键在于其无需模型训练的特性，使其在实际应用中更具灵活性和隐蔽性。

链接: https://arxiv.org/abs/2412.18302
作者: Jaechul Roh,Andrew Yuan,Jinsong Mao
机构: 未知
关键词: rapidly advanced, enabling the generation, textual descriptions, generation of high-quality, align closely
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) diffusion models have rapidly advanced, enabling the generation of high-quality images that align closely with textual descriptions. However, this progress has also raised concerns about their misuse for propaganda and other malicious activities. Recent studies reveal that attackers can embed biases into these models through simple fine-tuning, causing them to generate targeted imagery when triggered by specific phrases. This underscores the potential for T2I models to act as tools for disseminating propaganda, producing images aligned with an attacker’s objective for end-users. Building on this concept, we introduce FameBias, a T2I biasing attack that manipulates the embeddings of input prompts to generate images featuring specific public figures. Unlike prior methods, Famebias operates solely on the input embedding vectors without requiring additional model training. We evaluate FameBias comprehensively using Stable Diffusion V2, generating a large corpus of images based on various trigger nouns and target public figures. Our experiments demonstrate that FameBias achieves a high attack success rate while preserving the semantic context of the original prompts across multiple trigger-target pairs. Subjects: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2412.18302 [cs.CV] (or arXiv:2412.18302v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.18302 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-31] Quo Vadis Anomaly Detection? LLM s and VLMs in the Spotlight

【速读】：该论文旨在解决视频异常检测（Video Anomaly Detection, VAD）中的关键挑战，包括可解释性（interpretability）、时序推理（temporal reasoning）以及动态开放世界场景中的泛化能力（generalization）。通过整合大语言模型（Large Language Models, LLMs）和视觉语言模型（Vision-Language Models, VLMs），论文提出了四个关键解决方案：(i) 通过语义洞察和文本解释增强可解释性，使视觉异常更易理解；(ii) 捕捉复杂的时序关系，以检测和定位视频帧中的动态异常；(iii) 实现少样本（few-shot）和零样本（zero-shot）检测，减少对大规模标注数据集的依赖；(iv) 利用语义理解和运动特征处理开放世界和类别无关的异常，确保时空一致性。这些方法的核心在于充分利用LLMs和VLMs在视觉与文本模态上的协同作用，从而重新定义视频异常检测的研究方向。

链接: https://arxiv.org/abs/2412.18298
作者: Xi Ding,Lei Wang
机构: 未知
关键词: large language models, witnessed significant advancements, addressing critical challenges, Video anomaly detection, language models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Research report

点击查看摘要

Abstract:Video anomaly detection (VAD) has witnessed significant advancements through the integration of large language models (LLMs) and vision-language models (VLMs), addressing critical challenges such as interpretability, temporal reasoning, and generalization in dynamic, open-world scenarios. This paper presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024, focusing on four key aspects: (i) enhancing interpretability through semantic insights and textual explanations, making visual anomalies more understandable; (ii) capturing intricate temporal relationships to detect and localize dynamic anomalies across video frames; (iii) enabling few-shot and zero-shot detection to minimize reliance on large, annotated datasets; and (iv) addressing open-world and class-agnostic anomalies by using semantic understanding and motion features for spatiotemporal coherence. We highlight their potential to redefine the landscape of VAD. Additionally, we explore the synergy between visual and textual modalities offered by LLMs and VLMs, highlighting their combined strengths and proposing future directions to fully exploit the potential in enhancing video anomaly detection.
zh

[CV-32] owards understanding how attention mechanism works in deep learning

【速读】：该论文旨在深入理解注意力机制（Attention Mechanism）的本质及其与传统机器学习算法之间的联系。通过分析流形学习（Manifold Learning）、聚类（Clustering）和监督学习（Supervised Learning）中的相似性计算和信息传播过程，论文揭示了深度学习中的自注意力机制（Self-Attention Mechanism）遵循相同的原理，但具有更高的灵活性和自适应性。论文将自注意力机制分解为可学习的伪度量函数（Pseudo-Metric Function）和基于相似性计算的信息传播过程，并证明在伪度量是度量的变换且满足一定假设条件下，自注意力机制通过连续建模收敛于漂移-扩散过程（Drift-Diffusion Process），该方程可转化为新度量下的热方程（Heat Equation）。此外，论文对一般伪度量函数的注意力机制进行了一阶分析，提出了基于度量学习（Metric Learning）的改进注意力机制——度量注意力（Metric-Attention），实验结果表明其在训练效率、准确性和鲁棒性方面优于自注意力机制。

链接: https://arxiv.org/abs/2412.18288
作者: Tianyu Ruan,Shihua Zhang
机构: 未知
关键词: neural network architectures, mainstream neural network, Transformers and graph, graph attention networks, network architectures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 38 pages, 6 figures

点击查看摘要

Abstract:Attention mechanism has been extensively integrated within mainstream neural network architectures, such as Transformers and graph attention networks. Yet, its underlying working principles remain somewhat elusive. What is its essence? Are there any connections between it and traditional machine learning algorithms? In this study, we inspect the process of computing similarity using classic metrics and vector space properties in manifold learning, clustering, and supervised learning. We identify the key characteristics of similarity computation and information propagation in these methods and demonstrate that the self-attention mechanism in deep learning adheres to the same principles but operates more flexibly and adaptively. We decompose the self-attention mechanism into a learnable pseudo-metric function and an information propagation process based on similarity computation. We prove that the self-attention mechanism converges to a drift-diffusion process through continuous modeling provided the pseudo-metric is a transformation of a metric and certain reasonable assumptions hold. This equation could be transformed into a heat equation under a new metric. In addition, we give a first-order analysis of attention mechanism with a general pseudo-metric function. This study aids in understanding the effects and principle of attention mechanism through physical intuition. Finally, we propose a modified attention mechanism called metric-attention by leveraging the concept of metric learning to facilitate the ability to learn desired metrics more effectively. Experimental results demonstrate that it outperforms self-attention regarding training efficiency, accuracy, and robustness.
zh

[CV-33] Improved Feature Generating Framework for Transductive Zero-shot Learning

【速读】：该论文旨在解决在转导式零样本学习（Transductive Zero-shot Learning, TZSL）框架中，未见类先验（unseen class priors）对模型性能的负面影响问题。研究发现，即使存在微小的先验偏差，也会导致显著的准确率下降，这一问题的根本原因在于现有TZSL框架中使用的无条件未见类判别器（unconditional unseen discriminator）。论文提出了一种改进的特征生成框架，称为I-VAEGAN，其核心创新在于引入了伪条件特征对抗学习（Pseudo-conditional Feature Adversarial, PFA）和变分嵌入回归（Variational Embedding Regression, VER）。PFA通过将预测的语义作为伪条件显式注入，避免了先验估计的需求；VER则通过重构预训练来学习类统计信息，从而获得更好的语义回归。I-VAEGAN在多个基准测试和先验条件下实现了最先进的TZSL准确率。

链接: https://arxiv.org/abs/2412.18282
作者: Zihan Ye,Xinyuan Ru,Shiming Chen,Yaochu Jin,Kaizhu Huang,Xiaobo Jin
机构: 未知
关键词: Generative Adversarial Networks, powerful generative models, Feature Generative Adversarial, producing high-quality representations, Networks have emerged
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feature Generative Adversarial Networks have emerged as powerful generative models in producing high-quality representations of unseen classes within the scope of Zero-shot Learning (ZSL). This paper delves into the pivotal influence of unseen class priors within the framework of transductive ZSL (TZSL) and illuminates the finding that even a marginal prior bias can result in substantial accuracy declines. Our extensive analysis uncovers that this inefficacy fundamentally stems from the utilization of an unconditional unseen discriminator - a core component in existing TZSL. We further establish that the detrimental effects of this component are inevitable unless the generator perfectly fits class-specific distributions. Building on these insights, we introduce our Improved Feature Generation Framework, termed I-VAEGAN, which incorporates two novel components: Pseudo-conditional Feature Adversarial (PFA) learning and Variational Embedding Regression (VER). PFA circumvents the need for prior estimation by explicitly injecting the predicted semantics as pseudo conditions for unseen classes premised by precise semantic regression. Meanwhile, VER utilizes reconstructive pre-training to learn class statistics, obtaining better semantic regression. Our I-VAEGAN achieves state-of-the-art TZSL accuracy across various benchmarks and priors. Our code would be released upon acceptance.
zh

[CV-34] owards Modality Generalization: A Benchmark and Prospective Analysis

【速读】：该论文旨在解决多模态学习（Multi-modal Learning）中模型在面对训练时未见过的模态（unseen modalities）时的泛化问题。由于资源和隐私限制，现实场景中常常会出现训练时未见过的新模态，而现有方法难以应对这一挑战。为此，论文提出了模态泛化（Modality Generalization, MG）的概念，并将其分为两种情况：弱模态泛化（weak MG），即已见和未见模态可以通过现有感知器映射到联合嵌入空间；强模态泛化（strong MG），即不存在此类映射。为了推动这一领域的研究，论文提出了一个包含多模态算法的综合基准，并调整了现有专注于泛化的方法。通过大量实验，论文揭示了模态泛化的复杂性，暴露了现有方法的局限性，并为未来研究指明了关键方向。该工作为开发能够处理现实场景中未见模态的鲁棒且适应性强的多模态模型奠定了基础。

链接: https://arxiv.org/abs/2412.18277
作者: Xiaohao Liu,Xiaobo Xia,Zhuo Huang,Tat-Seng Chua
机构: 未知
关键词: achieving superior performance, achieved remarkable success, achieving superior, uni-modal approaches, learning has achieved
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-modal learning has achieved remarkable success by integrating information from various modalities, achieving superior performance in tasks like recognition and retrieval compared to uni-modal approaches. However, real-world scenarios often present novel modalities that are unseen during training due to resource and privacy constraints, a challenge current methods struggle to address. This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities. We define two cases: weak MG, where both seen and unseen modalities can be mapped into a joint embedding space via existing perceptors, and strong MG, where no such mappings exist. To facilitate progress, we propose a comprehensive benchmark featuring multi-modal algorithms and adapt existing methods that focus on generalization. Extensive experiments highlight the complexity of MG, exposing the limitations of existing methods and identifying key directions for future research. Our work provides a foundation for advancing robust and adaptable multi-modal models, enabling them to handle unseen modalities in realistic scenarios.
zh

[CV-35] UNet–: Memory-Efficient and Feature-Enhanced Network Architecture based on U-Net with Reduced Skip-Connections ACCV2024

【速读】：该论文旨在解决U-Net模型在资源受限设备上内存消耗过高的问题，特别是由于跳跃连接（skip-connections）需要在解码阶段之前保留特征图（feature maps）所导致的内存负担。为解决这一问题，论文提出了一种通用的方法和架构，通过设计多尺度信息聚合模块（Multi-Scale Information Aggregation Module, MSIAM）和信息增强模块（Information Enhancement Module, IEM）来减少内存消耗并生成增强的特征图，从而提升网络性能。MSIAM将多尺度特征图聚合为单尺度特征图以减少内存占用，而IEM则将这些聚合后的特征图扩展并增强为多尺度特征图。通过在图像修复领域的SOTA模型NAFNet上应用该方法，论文设计了一种内存高效且特征增强的网络架构UNet–，显著减少了跳跃连接的内存需求（降低了93.3%），同时提升了模型性能。此外，该方法在多种视觉任务中均表现出内存消耗和网络准确率的双重改进。

链接: https://arxiv.org/abs/2412.18276
作者: Lingxiao Yin,Wei Tao,Dongyue Zhao,Tadayuki Ito,Kinya Osa,Masami Kato,Tse-Wei Chen
机构: 未知
关键词: feature maps, Information Aggregation Module, Information Enhancement Module, components have demonstrated, demonstrated effectiveness
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 17 pages, 7 figures, accepted by ACCV2024

点击查看摘要

Abstract:U-Net models with encoder, decoder, and skip-connections components have demonstrated effectiveness in a variety of vision tasks. The skip-connections transmit fine-grained information from the encoder to the decoder. It is necessary to maintain the feature maps used by the skip-connections in memory before the decoding stage. Therefore, they are not friendly to devices with limited resource. In this paper, we propose a universal method and architecture to reduce the memory consumption and meanwhile generate enhanced feature maps to improve network performance. To this end, we design a simple but effective Multi-Scale Information Aggregation Module (MSIAM) in the encoder and an Information Enhancement Module (IEM) in the decoder. The MSIAM aggregates multi-scale feature maps into single-scale with less memory. After that, the aggregated feature maps can be expanded and enhanced to multi-scale feature maps by the IEM. By applying the proposed method on NAFNet, a SOTA model in the field of image restoration, we design a memory-efficient and feature-enhanced network architecture, UNet–. The memory demand by the skip-connections in the UNet-- is reduced by 93.3%, while the performance is improved compared to NAFNet. Furthermore, we show that our proposed method can be generalized to multiple visual tasks, with consistent improvements in both memory consumption and network accuracy compared to the existing efficient architectures.
zh

[CV-36] Sampling Bag of Views for Open-Vocabulary Object Detection

【速读】：该论文旨在解决现有开放词汇目标检测（Open-Vocabulary Object Detection, OVD）方法在捕捉图像中语义概念的组合结构时存在的不足。具体而言，现有方法通过将目标区域嵌入与视觉语言模型（Vision-Language Model, VLM）特征对齐来检测未见过的类别，但这种方法往往无法有效捕捉每个区域的上下文概念，导致组合结构噪声较大，仅带来边际性能提升且效率降低。为解决这一问题，论文提出了一种基于概念的对齐方法，通过将上下文相关的“概念”分组并调整概念尺度，以更有效地进行嵌入对齐。该方法结合Faster R-CNN，在开放词汇COCO和LVIS基准测试中，相较于先前工作，在未见类别上实现了2.6 box AP50和0.5 mask AP的提升，同时将CLIP计算量减少了80.3%，显著提高了效率。实验结果表明，该方法在OVD数据集上优于现有最先进模型。

链接: https://arxiv.org/abs/2412.18273
作者: Hojun Choi,Junsuk Choe,Hyunjung Shim
机构: 未知
关键词: Existing open-vocabulary object, aligning object region, open-vocabulary object detection, testing unseen categories, object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:Existing open-vocabulary object detection (OVD) develops methods for testing unseen categories by aligning object region embeddings with corresponding VLM features. A recent study leverages the idea that VLMs implicitly learn compositional structures of semantic concepts within the image. Instead of using an individual region embedding, it utilizes a bag of region embeddings as a new representation to incorporate compositional structures into the OVD task. However, this approach often fails to capture the contextual concepts of each region, leading to noisy compositional structures. This results in only marginal performance improvements and reduced efficiency. To address this, we propose a novel concept-based alignment method that samples a more powerful and efficient compositional structure. Our approach groups contextually related ``concepts’’ into a bag and adjusts the scale of concepts within the bag for more effective embedding alignment. Combined with Faster R-CNN, our method achieves improvements of 2.6 box AP50 and 0.5 mask AP over prior work on novel categories in the open-vocabulary COCO and LVIS benchmarks. Furthermore, our method reduces CLIP computation in FLOPs by 80.3% compared to previous research, significantly enhancing efficiency. Experimental results demonstrate that the proposed method outperforms previous state-of-the-art models on the OVD datasets.
zh

[CV-37] AdaCo: Overcoming Visual Foundation Model Noise in 3D Semantic Segmentation via Adaptive Label Correction AAAI

【速读】：该论文旨在解决大规模户外数据集上3D语义分割任务中，由于缺乏精确的监督信号、户外环境变化引起的噪声以及未知物体过多等问题，导致视觉基础模型（Visual Foundation Models, VFMs）性能受限的挑战。为此，论文提出了一种无标签学习方法——自适应标签校正（Adaptive Label Correction, AdaCo）。其关键解决方案包括：1）引入跨模态标签生成模块（Cross-modal Label Generation Module, CLGM），利用VFMs的强大解释能力提供跨模态监督；2）采用自适应噪声校正器（Adaptive Noise Corrector, ANC），在训练过程中迭代更新和调整噪声样本；3）开发自适应鲁棒损失函数（Adaptive Robust Loss, ARL），调节每个样本对噪声监督的敏感性，防止鲁棒损失可能导致的欠拟合问题。通过这些创新，AdaCo有效提升了无标签学习网络在3D语义分割任务中的性能。

链接: https://arxiv.org/abs/2412.18255
作者: Pufan Zou,Shijia Zhao,Weijie Huang,Qiming Xia,Chenglu Wen,Wei Li,Cheng Wang
机构: 未知
关键词: Visual Foundation Models, Visual Foundation, Foundation Models, remarkable generalization performance, Adaptive Label Correction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 AAAI

点击查看摘要

Abstract:Recently, Visual Foundation Models (VFMs) have shown a remarkable generalization performance in 3D perception tasks. However, their effectiveness in large-scale outdoor datasets remains constrained by the scarcity of accurate supervision signals, the extensive noise caused by variable outdoor conditions, and the abundance of unknown objects. In this work, we propose a novel label-free learning method, Adaptive Label Correction (AdaCo), for 3D semantic segmentation. AdaCo first introduces the Cross-modal Label Generation Module (CLGM), providing cross-modal supervision with the formidable interpretive capabilities of the VFMs. Subsequently, AdaCo incorporates the Adaptive Noise Corrector (ANC), updating and adjusting the noisy samples within this supervision iteratively during training. Moreover, we develop an Adaptive Robust Loss (ARL) function to modulate each sample’s sensitivity to noisy supervision, preventing potential underfitting issues associated with robust loss. Our proposed AdaCo can effectively mitigate the performance limitations of label-free learning networks in 3D semantic segmentation tasks. Extensive experiments on two outdoor benchmark datasets highlight the superior performance of our method.
zh

[CV-38] RaCMC: Residual-Aware Compensation Network with Multi-Granularity Constraints for Fake News Detection

【速读】：该论文致力于解决多模态假新闻检测（Multimodal Fake News Detection）中的跨模态特征融合与优化问题。现有方法在跨模态特征交互和分类精炼方面仍存在挑战。为此，作者提出了一种基于多粒度约束的残差感知补偿网络（Residual-aware Compensation Network with Multi-granularity Constraints, RaCMC），旨在充分交互和融合跨模态特征，同时放大真实新闻与假新闻之间的差异。解决方案的关键包括：1）设计了一个多尺度残差感知补偿模块，用于在不同尺度上交互和融合特征，确保特征交互的一致性和排他性，从而获取高质量特征；2）引入了一个多粒度约束模块，限制新闻整体及新闻内图文对的分布，从而在新闻和特征层面放大真实与假新闻的差异；3）开发了一个主导特征融合推理模块，从一致性和不一致性两个角度全面评估新闻的真实性。实验结果表明，该方法在Weibo17、Politifact和GossipCop三个公开数据集上表现出优越性。

链接: https://arxiv.org/abs/2412.18254
作者: Xinquan Yu,Ziqi Sheng,Wei Lu,Xiangyang Luo,Jiantao Zhou
机构: 未知
关键词: adverse effects caused, automatically identify real, Multimodal fake, automatically identify, mitigating the adverse
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Multimodal fake news detection aims to automatically identify real or fake news, thereby mitigating the adverse effects caused by such misinformation. Although prevailing approaches have demonstrated their effectiveness, challenges persist in cross-modal feature fusion and refinement for classification. To address this, we present a residual-aware compensation network with multi-granularity constraints (RaCMC) for fake news detection, that aims to sufficiently interact and fuse cross-modal features while amplifying the differences between real and fake news. First, a multiscale residual-aware compensation module is designed to interact and fuse features at different scales, and ensure both the consistency and exclusivity of feature interaction, thus acquiring high-quality features. Second, a multi-granularity constraints module is implemented to limit the distribution of both the news overall and the image-text pairs within the news, thus amplifying the differences between real and fake news at the news and feature levels. Finally, a dominant feature fusion reasoning module is developed to comprehensively evaluate news authenticity from the perspectives of both consistency and inconsistency. Experiments on three public datasets, including Weibo17, Politifact and GossipCop, reveal the superiority of the proposed method.
zh

[CV-39] Band Prompting Aided SAR and Multi-Spectral Data Fusion Framework for Local Climate Zone Classification ICASSP2025

【速读】：该论文旨在解决局部气候区（Local Climate Zone, LCZ）分类中合成孔径雷达（Synthetic Aperture Radar, SAR）与多光谱数据融合的挑战。由于这两类数据具有不同的物理特性，且缺乏有效的融合指导，现有的分类性能仍有待提升。论文提出了一种新颖的波段提示辅助数据融合框架（Band Prompting aided data fusion framework, BP-LCZ），其关键在于通过波段组提示（Band Group Prompting, BGP）策略，利用与波段组相关的文本提示来引导模型学习不同波段的物理属性和各类别的语义信息，从而增强融合特征并提升LCZ分类性能。此外，论文还引入了多元监督矩阵（Multivariate Supervised Matrix, MSM）训练策略，以缓解正负样本混淆问题，进一步完善监督信息。实验结果表明，该框架在LCZ分类中具有显著的有效性和优越性。

链接: https://arxiv.org/abs/2412.18235
作者: Haiyan Lan,Shujun Li,Mingjie Xie,Xuanjia Zhao,Hongning Liu,Pengming Feng,Dongli Xu,Guangjun He,Jian Guan
机构: 未知
关键词: Local climate zone, LCZ classification performance, Local climate, LCZ classification, improve LCZ classification
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Local climate zone (LCZ) classification is of great value for understanding the complex interactions between urban development and local climate. Recent studies have increasingly focused on the fusion of synthetic aperture radar (SAR) and multi-spectral data to improve LCZ classification performance. However, it remains challenging due to the distinct physical properties of these two types of data and the absence of effective fusion guidance. In this paper, a novel band prompting aided data fusion framework is proposed for LCZ classification, namely BP-LCZ, which utilizes textual prompts associated with band groups to guide the model in learning the physical attributes of different bands and semantics of various categories inherent in SAR and multi-spectral data to augment the fused feature, thus enhancing LCZ classification performance. Specifically, a band group prompting (BGP) strategy is introduced to align the visual representation effectively at the level of band groups, which also facilitates a more adequate extraction of semantic information of different bands with textual information. In addition, a multivariate supervised matrix (MSM) based training strategy is proposed to alleviate the problem of positive and negative sample confusion by completing the supervised information. The experimental results demonstrate the effectiveness and superiority of the proposed data fusion framework.
zh

[CV-40] Efficient Detection Framework Adaptation for Edge Computing: A Plug-and-play Neural Network Toolbox Enabling Edge Deployment

【速读】：该论文旨在解决边缘计算（Edge Computing）环境下基于深度学习的物体检测（Object Detection）所面临的三个主要挑战：1) 在保持检测精度的同时实现轻量化模型的平衡；2) 通用化部署设计的适应性有限；3) 实际应用场景中的验证不足。为解决这些问题，作者提出了边缘检测工具箱（ED-TOOLBOX），其核心解决方案包括：1) 引入轻量化的重参数化动态卷积网络（Reparameterized Dynamic Convolutional Network, Rep-DConvNet），通过加权多形状卷积分支提升检测性能；2) 设计稀疏交叉注意力网络（Sparse Cross-Attention, SC-A），利用局部映射辅助的自注意力机制实现自适应特征传递；3) 在YOLO框架中集成高效头部（Efficient Head），加速边缘模型优化。此外，作者还创建了头盔带检测数据集（Helmet Band Detection Dataset, HBDD），以解决头盔检测中忽视带扣这一关键安全因素的问题。实验结果表明，ED-TOOLBOX优化的模型在视觉监控模拟中优于六种最先进方法，实现了实时且准确的检测性能，验证了其作为边缘物体检测解决方案的优越性。

链接: https://arxiv.org/abs/2412.18230
作者: Jiaqi Wu,Shihao Zhang,Simin Chen,Lixu Wang,Zehua Wang,Wei Chen,Fangyuan He,Zijian Tian,F. Richard Yu,Victor C. M. Leung
机构: 未知
关键词: deploying deep learning-based, deep learning-based object, detection, learning-based object detection, time-sensitive scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Edge computing has emerged as a key paradigm for deploying deep learning-based object detection in time-sensitive scenarios. However, existing edge detection methods face challenges: 1) difficulty balancing detection precision with lightweight models, 2) limited adaptability of generalized deployment designs, and 3) insufficient real-world validation. To address these issues, we propose the Edge Detection Toolbox (ED-TOOLBOX), which utilizes generalizable plug-and-play components to adapt object detection models for edge environments. Specifically, we introduce a lightweight Reparameterized Dynamic Convolutional Network (Rep-DConvNet) featuring weighted multi-shape convolutional branches to enhance detection performance. Additionally, we design a Sparse Cross-Attention (SC-A) network with a localized-mapping-assisted self-attention mechanism, enabling a well-crafted joint module for adaptive feature transfer. For real-world applications, we incorporate an Efficient Head into the YOLO framework to accelerate edge model optimization. To demonstrate practical impact, we identify a gap in helmet detection – overlooking band fastening, a critical safety factor – and create the Helmet Band Detection Dataset (HBDD). Using ED-TOOLBOX-optimized models, we address this real-world task. Extensive experiments validate the effectiveness of ED-TOOLBOX, with edge detection models outperforming six state-of-the-art methods in visual surveillance simulations, achieving real-time and accurate performance. These results highlight ED-TOOLBOX as a superior solution for edge object detection.
zh

[CV-41] Expand VSR Benchmark for VLLM to Expertize in Spatial Rules

【速读】：该论文旨在解决视觉大语言模型（Vision Large Language Models, VLLMs）在视觉位置推理（Visual Spatial Reasoning, VSR）任务中表现不足的问题。具体而言，现有VLLMs在语言指令上表现出过度敏感，而在视觉位置信息上则表现出敏感性不足的矛盾现象。为解决这一问题，论文首先通过VSR数据集对当前VLLMs进行了诊断，并提出了一个统一的测试集。关键解决方案包括两个方面：一是通过扩散模型（diffusion models）可控地扩展了空间定位图像数据，二是将原有的视觉编码（CLIP）与其他三种强大的视觉编码器（SigLIP、SAM和DINO）进行了集成。通过数据扩展和模型结构的优化，论文最终开发出了一个在VSR任务上表现优异的VLLM VSR Expert（VSRE），其在VSR测试集上的准确率提升了超过27%，并在其他相关评估基准的子集上也表现出色。

链接: https://arxiv.org/abs/2412.18224
作者: Peijin Xie,Lin Sun,Bingquan Liu,Dexin Wang,Xiangzheng Zhang,Chengjie Sun,Jiajia Zhang
机构: 未知
关键词: Distinguishing spatial relations, requires fine-grained perception, Distinguishing spatial, Vision Large Language, perception on cross-instance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distinguishing spatial relations is a basic part of human cognition which requires fine-grained perception on cross-instance. Although benchmarks like MME, MMBench and SEED comprehensively have evaluated various capabilities which already include visual spatial reasoning(VSR). There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning. To handle this, we first diagnosed current VLLMs with the VSR dataset and proposed a unified test set. We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information. By expanding the original benchmark from two aspects of tunning data and model structure, we mitigated this phenomenon. To our knowledge, we expanded spatially positioned image data controllably using diffusion models for the first time and integrated original visual encoding(CLIP) with other 3 powerful visual encoders(SigLIP, SAM and DINO). After conducting combination experiments on scaling data and models, we obtained a VLLM VSR Expert(VSRE) that not only generalizes better to different instructions but also accurately distinguishes differences in visual positional information. VSRE achieved over a 27% increase in accuracy on the VSR test set. It becomes a performant VLLM on the position reasoning of both the VSR dataset and relevant subsets of other evaluation benchmarks. We open-sourced the expanded model with data and Appendix at \urlthis https URL and hope it will accelerate advancements in VLLM on VSR learning.
zh

[CV-42] GIMS: Image Matching System Based on Adaptive Graph Construction and Graph Neural Network

【速读】：该论文旨在解决基于特征的图像匹配（feature-based image matching）问题，特别是在构建精确且鲁棒的图结构（graph structures）以及提升匹配性能方面的挑战。论文提出了一种创新的自适应图构建方法，该方法通过基于距离和动态阈值相似性的过滤机制，动态调整新顶点（vertices）的纳入标准，从而避免冗余并构建更精确的图结构。此外，论文结合了图神经网络（GNNs）的顶点处理能力和Transformer的全局感知能力，以增强模型对图结构中空间和特征信息的表示。这种混合模型能够更深入地理解顶点之间的相互关系及其对匹配过程的贡献。最后，论文采用Sinkhorn算法迭代求解最优匹配结果，并通过多GPU技术加速训练过程。实验结果表明，该系统在整体匹配性能上实现了3.8倍至40.3倍的平均提升。

链接: https://arxiv.org/abs/2412.18221
作者: Xianfeng Song,Yi Zou,Zheng Shi,Zheng Liu
机构: 未知
关键词: Feature-based image matching, Graph Neural Networks, Feature-based image, computer vision, applications in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Feature-based image matching has extensive applications in computer vision. Keypoints detected in images can be naturally represented as graph structures, and Graph Neural Networks (GNNs) have been shown to outperform traditional deep learning techniques. Consequently, the paradigm of image matching via GNNs has gained significant prominence in recent academic research. In this paper, we first introduce an innovative adaptive graph construction method that utilizes a filtering mechanism based on distance and dynamic threshold similarity. This method dynamically adjusts the criteria for incorporating new vertices based on the characteristics of existing vertices, allowing for the construction of more precise and robust graph structures while avoiding redundancy. We further combine the vertex processing capabilities of GNNs with the global awareness capabilities of Transformers to enhance the model’s representation of spatial and feature information within graph structures. This hybrid model provides a deeper understanding of the interrelationships between vertices and their contributions to the matching process. Additionally, we employ the Sinkhorn algorithm to iteratively solve for optimal matching results. Finally, we validate our system using extensive image datasets and conduct comprehensive comparative experiments. Experimental results demonstrate that our system achieves an average improvement of 3.8x-40.3x in overall matching performance. Additionally, the number of vertices and edges significantly impacts training efficiency and memory usage; therefore, we employ multi-GPU technology to accelerate the training process. Our code is available at this https URL.
zh

[CV-43] Adapter Merging with Centroid Prototype Mapping for Scalable Class-Incremental Learning

【速读】：该论文旨在解决类增量学习（Class-Incremental Learning, CIL）中的两个核心问题：灾难性遗忘（catastrophic forgetting）和可扩展性（scalability）。现有方法通常在推理时间和准确性之间进行权衡，而本文提出的 Adapter Merging with Centroid Prototype Mapping (ACMap) 框架通过将任务特定的适配器（adapters）合并为一个单一适配器，确保跨任务的恒定推理时间，同时不牺牲准确性。关键解决方案包括：1）使用适配器合并（adapter merging）构建共享子空间，以对齐任务表示并缓解遗忘；2）通过质心原型映射（centroid prototype mapping）在共享子空间中进行一致适应，以保持高准确性；3）采用早停策略（early stopping strategy）限制适配器合并，进一步提升可扩展性。实验结果表明，ACMap 在五个基准数据集上达到了最先进的准确性，同时保持了与现有最快方法相当的推理时间。

链接: https://arxiv.org/abs/2412.18219
作者: Takuma Fukuda,Hiroshi Kera,Kazuhiko Kawamoto
机构: 未知
关键词: Centroid Prototype Mapping, Centroid Prototype, Prototype Mapping, class-incremental learning, propose Adapter Merging
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages (main text), 6 pages (supplementary material)

点击查看摘要

Abstract:We propose Adapter Merging with Centroid Prototype Mapping (ACMap), an exemplar-free framework for class-incremental learning (CIL) that addresses both catastrophic forgetting and scalability. While existing methods trade-off between inference time and accuracy, ACMap consolidates task-specific adapters into a single adapter, ensuring constant inference time across tasks without compromising accuracy. The framework employs adapter merging to build a shared subspace that aligns task representations and mitigates forgetting, while centroid prototype mapping maintains high accuracy through consistent adaptation in the shared subspace. To further improve scalability, an early stopping strategy limits adapter merging as tasks increase. Extensive experiments on five benchmark datasets demonstrate that ACMap matches state-of-the-art accuracy while maintaining inference time comparable to the fastest existing methods. The code is available at this https URL
zh

[CV-44] SDM-Car: A Dataset for Small and Dim Moving Vehicles Detection in Satellite Videos

【速读】：该论文旨在解决卫星视频中低辐射强度（low radiation intensity）和低对比度（limited contrast）背景下移动车辆检测效果不佳的问题。现有数据集中，这类暗淡车辆（dim vehicles）的标注较少，导致现有方法在低辐射条件下的检测效果不理想。为解决这一问题，论文构建了一个名为SDM-Car（Small and Dim Moving Cars）的数据集，该数据集包含99个由Luojia 3-01卫星采集的高质量视频，并对暗淡车辆进行了大量标注。此外，论文提出了一种基于图像增强（image enhancement）和注意力机制（attention mechanisms）的方法，以提高暗淡车辆的检测精度，并作为评估该数据集的基准。最终，论文评估了几种代表性方法在SDM-Car上的性能，并提出了有见地的发现。

链接: https://arxiv.org/abs/2412.18214
作者: Zhen Zhang,Tao Peng,Liang Liao,Jing Xiao,Mi Wang
机构: 未知
关键词: remote sensing, dim vehicles, essential in remote, Luojia 3-01 satellite, low radiation conditions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 7 figures, IEEE Geoscience and Remote Sensing Letters

点击查看摘要

Abstract:Vehicle detection and tracking in satellite video is essential in remote sensing (RS) applications. However, upon the statistical analysis of existing datasets, we find that the dim vehicles with low radiation intensity and limited contrast against the background are rarely annotated, which leads to the poor effect of existing approaches in detecting moving vehicles under low radiation conditions. In this paper, we address the challenge by building a \textbfSmall and \textbfDim \textbfMoving Cars (SDM-Car) dataset with a multitude of annotations for dim vehicles in satellite videos, which is collected by the Luojia 3-01 satellite and comprises 99 high-quality videos. Furthermore, we propose a method based on image enhancement and attention mechanisms to improve the detection accuracy of dim vehicles, serving as a benchmark for evaluating the dataset. Finally, we assess the performance of several representative methods on SDM-Car and present insightful findings. The dataset is openly available at this https URL.
zh

[CV-45] BoxMAC – A Boxing Dataset for Multi-label Action Classification

【速读】：该论文旨在解决拳击比赛中拳手表现的多标签动作分类问题，特别是在同一时间戳内两名拳手可能执行不同拳法的情况下。解决方案的关键在于引入BoxMAC数据集，该数据集包含15名职业拳击手的超过60,000帧图像，并由拳击教练精心标注了13种不同的动作标签。此外，论文提出了一种新颖的架构，用于在单个图像和视频中联合识别多个动作，并探讨了基于深度神经网络的基线方法来解决这些任务。BoxMAC数据集的真实性和多样性使其成为拳击运动性能分析和模型开发的宝贵资源。

链接: https://arxiv.org/abs/2412.18204
作者: Shashikanta Sahoo
机构: 未知
关键词: competitive combat sports, delivered during bouts, competitive combat, statics is crucial, crucial for evaluating
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:In competitive combat sports like boxing, analyzing a boxers’s performance statics is crucial for evaluating the quantity and variety of punches delivered during bouts. These statistics provide valuable data and feedback, which are routinely used for coaching and performance enhancement. We introduce BoxMAC, a real-world boxing dataset featuring 15 professional boxers and encompassing 13 distinct action labels. Comprising over 60,000 frames, our dataset has been meticulously annotated for multiple actions per frame with inputs from a boxing coach. Since two boxers can execute different punches within a single timestamp, this problem falls under the domain of multi-label action classification. We propose a novel architecture for jointly recognizing multiple actions in both individual images and videos. We investigate baselines using deep neural network architectures to address both tasks. We believe that BoxMAC will enable researchers and practitioners to develop and evaluate more efficient models for performance analysis. With its realistic and diverse nature, BoxMAC can serve as a valuable resource for the advancement of boxing as a sport
zh

[CV-46] Leveraging Deep Learning with Multi-Head Attention for Accurate Extraction of Medicine from Handwritten Prescriptions

【速读】：该论文旨在解决从手写医生处方中提取药物名称的难题，这一挑战主要源于手写风格和处方格式的广泛多样性。解决方案的关键在于结合了Mask R-CNN和基于Transformer的光学字符识别（TrOCR）技术，并引入了多头注意力机制（Multi-Head Attention）和位置嵌入（Positional Embeddings）。具体而言，Mask R-CNN用于分割处方图像以聚焦于药物部分，而TrOCR则通过多头注意力和位置嵌入的增强来转录隔离的文本。转录后的文本随后与预存数据库进行匹配，以实现准确的药物名称识别。该方法在标准基准测试中实现了1.4%的字符错误率（CER），展示了其作为自动化药物名称提取工具的可靠性和高效性。

链接: https://arxiv.org/abs/2412.18199
作者: Usman Ali,Sahil Ranmbail,Muhammad Nadeem,Hamid Ishfaq,Muhammad Umer Ramzan,Waqas Ali
机构: 未知
关键词: Optical Character Recognition, Transformer-based Optical Character, Positional Embeddings, Attention and Positional, handwritten doctor prescriptions
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Extracting medication names from handwritten doctor prescriptions is challenging due to the wide variability in handwriting styles and prescription formats. This paper presents a robust method for extracting medicine names using a combination of Mask R-CNN and Transformer-based Optical Character Recognition (TrOCR) with Multi-Head Attention and Positional Embeddings. A novel dataset, featuring diverse handwritten prescriptions from various regions of Pakistan, was utilized to fine-tune the model on different handwriting styles. The Mask R-CNN model segments the prescription images to focus on the medicinal sections, while the TrOCR model, enhanced by Multi-Head Attention and Positional Embeddings, transcribes the isolated text. The transcribed text is then matched against a pre-existing database for accurate identification. The proposed approach achieved a character error rate (CER) of 1.4% on standard benchmarks, highlighting its potential as a reliable and efficient tool for automating medicine name extraction.
zh

[CV-47] xtMatch: Enhancing Image-Text Consistency Through Multimodal Optimization

【速读】：该论文旨在解决文本到图像生成模型（Text-to-Image, T2I）在生成图像时与输入文本之间的对齐和一致性问题。现有的生成模型虽然在从文本生成图像方面表现出色，但在确保输出图像与提示文本的语义一致性方面仍存在挑战。为此，论文提出了TextMatch框架，其关键解决方案是通过多模态优化（multimodal optimization）来减少图像与文本之间的差异。TextMatch利用大语言模型（Large Language Models, LLMs）和视觉问答模型（Visual Question-Answering, VQA）进行评分，评估生成图像与提示文本的语义一致性。通过结合多模态上下文学习（multimodal in-context learning）和链式思维推理（chain of thought reasoning），该方法能够动态优化提示文本，确保生成的图像更好地捕捉用户意图，从而提高图像的保真度和相关性。实验结果表明，TextMatch在多个基准测试中显著提升了文本与图像的一致性，为文本到图像生成模型的进一步发展提供了可靠框架。

链接: https://arxiv.org/abs/2412.18185
作者: Yucong Luo,Mingyue Cheng,Jie Ouyang,Xiaoyu Tao,Qi Liu
机构: 未知
关键词: generative models excel, excel in creating, text but struggle, struggle with ensuring, ensuring alignment
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image generative models excel in creating images from text but struggle with ensuring alignment and consistency between outputs and prompts. This paper introduces TextMatch, a novel framework that leverages multimodal optimization to address image-text discrepancies in text-to-image (T2I) generation and editing. TextMatch employs a scoring strategy powered by large language models (LLMs) and visual question-answering (VQA) models to evaluate semantic consistency between prompts and generated images. By integrating multimodal in-context learning and chain of thought reasoning, our method dynamically refines prompts through iterative optimization. This process ensures that the generated images better capture user intent of, resulting in higher fidelity and relevance. Extensive experiments demonstrate that TextMatch significantly improves text-image consistency across multiple benchmarks, establishing a reliable framework for advancing the capabilities of text-to-image generative models. Our code is available at this https URL.
zh

[CV-48] VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis

【速读】：该论文旨在解决卷积神经网络（CNNs）和视觉Transformer（ViTs）在处理高分辨率图像时计算成本高的问题。尽管CNNs在多尺度特征提取方面表现出色，而ViTs在捕捉全局依赖关系方面有效，但两者在处理高分辨率图像时均面临较高的计算和内存开销。为此，论文提出了一种基于循环神经网络（RNNs）的新型架构——VisionGRU，其关键解决方案包括：1）采用简化的门控循环单元（minGRU）以线性复杂度处理大规模图像特征；2）将图像划分为小块，逐步减少序列长度并增加通道深度，从而促进多尺度特征提取；3）引入具有双向扫描的分层2DGRU模块，以同时捕捉局部和全局上下文，提升长程依赖建模能力，特别是在语义分割等任务中。实验结果表明，VisionGRU在ImageNet和ADE20K数据集上优于ViTs，显著降低了内存使用和计算成本，尤其是在处理高分辨率图像时。这些发现凸显了基于RNN的方法在开发高效且可扩展的计算机视觉解决方案中的潜力。

链接: https://arxiv.org/abs/2412.18178
作者: Shicheng Yin,Kaixuan Yin,Weixing Chen,Enbo Huang,Yang Liu
机构: 未知
关键词: Convolutional Neural Networks, Convolutional Neural, recurrent neural networks, Neural Networks, Vision Transformers
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Codes will be available at this https URL

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are two dominant models for image analysis. While CNNs excel at extracting multi-scale features and ViTs effectively capture global dependencies, both suffer from high computational costs, particularly when processing high-resolution images. Recently, state-space models (SSMs) and recurrent neural networks (RNNs) have attracted attention due to their efficiency. However, their performance in image classification tasks remains limited. To address these challenges, this paper introduces VisionGRU, a novel RNN-based architecture designed for efficient image classification. VisionGRU leverages a simplified Gated Recurrent Unit (minGRU) to process large-scale image features with linear complexity. It divides images into smaller patches and progressively reduces the sequence length while increasing the channel depth, thus facilitating multi-scale feature extraction. A hierarchical 2DGRU module with bidirectional scanning captures both local and global contexts, improving long-range dependency modeling, particularly for tasks like semantic segmentation. Experimental results on the ImageNet and ADE20K datasets demonstrate that VisionGRU outperforms ViTs, significantly reducing memory usage and computational costs, especially for high-resolution images. These findings underscore the potential of RNN-based approaches for developing efficient and scalable computer vision solutions. Codes will be available at this https URL.
zh

[CV-49] Enhancing Online Continual Learning with Plug-and-Play State Space Model and Class-Conditional Mixture of Discretization

【速读】：该论文旨在解决在线持续学习（Online Continual Learning, OCL）中模型适应性不足的问题。现有方法主要依赖于重放机制，通过正则化或蒸馏来增强记忆保持能力，但往往忽视了模型的适应性，限制了从在线训练数据中逐步学习可泛化和判别性特征的能力。为此，论文提出了一种即插即用模块S6MOD，该模块可集成到大多数现有方法中，直接提升模型的适应性。S6MOD的关键在于在主干网络后引入一个额外分支，通过离散化混合选择性地调整选择性状态空间模型中的参数，丰富选择性扫描模式，使模型能够自适应地选择对当前动态最敏感的离散化方法。此外，论文还设计了一种基于类别条件路由的动态、不确定性调整算法，并实现了对比离散化损失来优化该模块。实验表明，S6MOD显著增强了模型的适应性，带来了显著的性能提升，并达到了最先进的结果。

链接: https://arxiv.org/abs/2412.18177
作者: Sihao Liu,Yibo Yang,Xiaojie Li,David A. Clifton,Bernard Ghanem
机构: 未知
关键词: previously learned tasks, Online continual learning, continual learning, learned tasks, retaining knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online continual learning (OCL) seeks to learn new tasks from data streams that appear only once, while retaining knowledge of previously learned tasks. Most existing methods rely on replay, focusing on enhancing memory retention through regularization or distillation. However, they often overlook the adaptability of the model, limiting the ability to learn generalizable and discriminative features incrementally from online training data. To address this, we introduce a plug-and-play module, S6MOD, which can be integrated into most existing methods and directly improve adaptability. Specifically, S6MOD introduces an extra branch after the backbone, where a mixture of discretization selectively adjusts parameters in a selective state space model, enriching selective scan patterns such that the model can adaptively select the most sensitive discretization method for current dynamics. We further design a class-conditional routing algorithm for dynamic, uncertainty-based adjustment and implement a contrastive discretization loss to optimize it. Extensive experiments combining our module with various models demonstrate that S6MOD significantly enhances model adaptability, leading to substantial performance gains and achieving the state-of-the-art results.
zh

[CV-50] Parallel Neural Computing for Scene Understanding from LiDAR Perception in Autonomous Racing

【速读】：该论文旨在解决高速赛车场景下自动驾驶系统在实时感知和决策方面的挑战，特别是在赛道环境快速变化的情况下，传统顺序网络方法难以满足高速移动的自主代理（autonomous agent）对实时知识和决策的需求。论文提出了一种新颖的基线架构——并行感知网络（Parallel Perception Network, PPN），其关键解决方案在于通过硬件支持的并行处理能力，实现与代理高速运动相匹配的神经处理速度。PPN由两个独立的神经网络（分割网络和重建网络）组成，分别在独立的加速硬件上并行运行。该模型以LiDAR传感器获取的原始3D点云数据作为输入，并在两个设备上将其转换为2D鸟瞰图（Bird’s Eye View Map）。每个网络在空间和时间维度上独立提取输入特征并并行生成输出。通过在配备两块NVIDIA T4 GPU的系统上进行训练，并结合边缘保留等损失函数，该模型在推理时间上实现了相比顺序配置的2倍加速。

链接: https://arxiv.org/abs/2412.18165
作者: Suwesh Prasad Sah
机构: 未知
关键词: presents significant challenges, scene understanding due, urban environments, track environment, high-speed racing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/ISED 2024

点击查看摘要

Abstract:Autonomous driving in high-speed racing, as opposed to urban environments, presents significant challenges in scene understanding due to rapid changes in the track environment. Traditional sequential network approaches may struggle to meet the real-time knowledge and decision-making demands of an autonomous agent covering large displacements in a short time. This paper proposes a novel baseline architecture for developing sophisticated models capable of true hardware-enabled parallelism, achieving neural processing speeds that mirror the agent’s high velocity. The proposed model (Parallel Perception Network (PPN)) consists of two independent neural networks, segmentation and reconstruction networks, running parallelly on separate accelerated hardware. The model takes raw 3D point cloud data from the LiDAR sensor as input and converts it into a 2D Bird’s Eye View Map on both devices. Each network independently extracts its input features along space and time dimensions and produces outputs parallelly. The proposed method’s model is trained on a system with two NVIDIA T4 GPUs, using a combination of loss functions, including edge preservation, and demonstrates a 2x speedup in model inference time compared to a sequential configuration. Implementation is available at: this https URL. Learned parameters of the trained networks are provided at: this https URL.
zh

[CV-51] Semantics Disentanglement and Composition for Versatile Codec toward both Human-eye Perception and Machine Vision Task

【速读】：该论文旨在解决现有学习型图像压缩方法在人类视觉感知和机器视觉任务中往往仅适用于单一领域的问题，这种局限性限制了其在不同场景中的通用性和适应性，且需要重新训练以适应新应用，增加了实际应用中的复杂性和成本。为解决这一问题，论文提出了一种创新的语义解耦与组合多功能编解码器（DISCOVER），通过多模态大模型为每个任务生成一组标签，并应用基础模型进行精确定位，从而在编码端实现对图像组件的全面理解和解耦。在解码阶段，利用这些编码组件以及生成模型的先验知识，实现图像的全面重建，从而优化人类视觉感知和机器分析任务的双重性能。该方案的关键在于通过语义解耦和组合，同时提升图像压缩在人类视觉和机器视觉任务中的表现，并通过实验验证了其鲁棒性和有效性。

链接: https://arxiv.org/abs/2412.18158
作者: Jinming Liu,Yuntao Wei,Junyan Lin,Shengyang Zhao,Heming Sun,Zhibo Chen,Wenjun Zeng,Xin Jin
机构: 未知
关键词: achieved impressive results, learned image compression, image compression methods, machine vision tasks, compression methods
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:While learned image compression methods have achieved impressive results in either human visual perception or machine vision tasks, they are often specialized only for one domain. This drawback limits their versatility and generalizability across scenarios and also requires retraining to adapt to new applications-a process that adds significant complexity and cost in real-world scenarios. In this study, we introduce an innovative semantics DISentanglement and COmposition VERsatile codec (DISCOVER) to simultaneously enhance human-eye perception and machine vision tasks. The approach derives a set of labels per task through multimodal large models, which grounding models are then applied for precise localization, enabling a comprehensive understanding and disentanglement of image components at the encoder side. At the decoding stage, a comprehensive reconstruction of the image is achieved by leveraging these encoded components alongside priors from generative models, thereby optimizing performance for both human visual perception and machine-based analytical tasks. Extensive experimental evaluations substantiate the robustness and effectiveness of DISCOVER, demonstrating superior performance in fulfilling the dual objectives of human and machine vision requirements.
zh

[CV-52] DepthLab: From Partial to Complete

【速读】：该论文旨在解决深度数据中普遍存在的缺失值问题，这些问题源于数据采集不完整和视角变化等多种原因。论文提出了一种名为DepthLab的基础深度修复模型，其关键解决方案在于利用图像扩散先验（image diffusion priors）进行深度修复。该模型具有两大显著优势：首先，它能够有效应对深度缺失区域，为连续区域和孤立点提供可靠的修复；其次，在填充缺失值时，它能够忠实保持与已知深度条件的尺度一致性。基于这些优势，DepthLab在多种下游任务中表现出色，包括3D场景修复、文本到3D场景生成、基于DUST3R的稀疏视图重建以及LiDAR深度补全，在数值性能和视觉质量上均超越了现有解决方案。

链接: https://arxiv.org/abs/2412.18153
作者: Zhiheng Liu,Ka Leong Cheng,Qiuyu Wang,Shuzhe Wang,Hao Ouyang,Bin Tan,Kai Zhu,Yujun Shen,Qifeng Chen,Ping Luo
机构: 未知
关键词: incomplete data acquisition, range of applications, perspective alteration, incomplete data, data acquisition
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page and code: this https URL

点击查看摘要

Abstract:Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it demonstrates resilience to depth-deficient regions, providing reliable completion for both continuous areas and isolated points, and (2) it faithfully preserves scale consistency with the conditioned known depth when filling in missing values. Drawing on these advantages, our approach proves its worth in various downstream tasks, including 3D scene inpainting, text-to-3D scene generation, sparse-view reconstruction with DUST3R, and LiDAR depth completion, exceeding current solutions in both numerical performance and visual quality. Our project page with source code is available at this https URL.
zh

[CV-53] EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation

【速读】：该论文旨在解决当前文本到图像（Text-to-Image, T2I）生成模型评估中存在的两个主要问题：一是现有小数据集限制了自动化评估指标的性能比较；二是这些数据集缺乏细粒度评估自动化指标的能力。为此，论文提出了EvalMuse-40K基准，包含40K个带有细粒度人工标注的图像-文本对，通过平衡提示采样和数据重新标注等策略确保基准的多样性和可靠性。此外，论文还引入了两种新的评估方法：FGA-BLIP2，通过对视觉-语言模型进行端到端微调以生成细粒度图像-文本对齐分数；以及PN-VQA，采用正负视觉问答（VQA）方式进行零样本细粒度评估。这两种方法在图像-文本对齐评估中表现出色，并为当前生成式AI模型的排名提供了参考，推动了T2I生成技术的发展。

链接: https://arxiv.org/abs/2412.18150
作者: Shuhao Han,Haotian Fan,Jiachen Fu,Liang Li,Tao Li,Junhui Cui,Yunqiu Wang,Yang Tai,Jingwei Sun,Chunle Guo,Chongyi Li
机构: 未知
关键词: achieved significant advancements, significant advancements, image-text alignment, achieved significant, Recently
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, Text-to-Image (T2I) generation models have achieved significant advancements. Correspondingly, many automated metrics have emerged to evaluate the image-text alignment capabilities of generative models. However, the performance comparison among these automated metrics is limited by existing small datasets. Additionally, these datasets lack the capacity to assess the performance of automated metrics at a fine-grained level. In this study, we contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. In the construction process, we employ various strategies such as balanced prompt sampling and data re-annotation to ensure the diversity and reliability of our benchmark. This allows us to comprehensively evaluate the effectiveness of image-text alignment metrics for T2I models. Meanwhile, we introduce two new methods to evaluate the image-text alignment capabilities of T2I models: FGA-BLIP2 which involves end-to-end fine-tuning of a vision-language model to produce fine-grained image-text alignment scores and PN-VQA which adopts a novel positive-negative VQA manner in VQA models for zero-shot fine-grained evaluation. Both methods achieve impressive performance in image-text alignment evaluations. We also use our methods to rank current AIGC models, in which the results can serve as a reference source for future study and promote the development of T2I generation. The data and code will be made publicly available.
zh

[CV-54] Dense-Face: Personalized Face Generation Model via Dense Annotation Prediction

【速读】：该论文旨在解决现有文本到图像（Text-to-Image, T2I）个性化方法在生成与给定文本描述（text caption）高度一致且具有一致身份（consistent identity）的人脸图像时面临的挑战。现有方法要么需要在测试时进行微调（test-time fine-tuning），要么难以生成与文本描述良好对齐的图像。为此，论文提出了一种新的T2I个性化扩散模型——Dense-Face，其关键解决方案包括引入一个姿态可控适配器（pose-controllable adapter），以在保持预训练稳定扩散（Stable Diffusion, SD）模型文本编辑能力的同时，实现高保真图像生成；此外，利用SD UNet的内部特征预测密集人脸标注（dense face annotations），从而增强模型在人脸生成领域的知识。实验表明，该方法在图像-文本对齐、身份保持和姿态控制方面达到了最先进或具有竞争力的生成性能。

链接: https://arxiv.org/abs/2412.18149
作者: Xiao Guo,Manh Tran,Jiaxin Cheng,Xiaoming Liu
机构: 未知
关键词: input text caption, user input text, personalization diffusion model, text caption, concept based
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 figures, 5 tables

点击查看摘要

Abstract:The text-to-image (T2I) personalization diffusion model can generate images of the novel concept based on the user input text caption. However, existing T2I personalized methods either require test-time fine-tuning or fail to generate images that align well with the given text caption. In this work, we propose a new T2I personalization diffusion model, Dense-Face, which can generate face images with a consistent identity as the given reference subject and align well with the text caption. Specifically, we introduce a pose-controllable adapter for the high-fidelity image generation while maintaining the text-based editing ability of the pre-trained stable diffusion (SD). Additionally, we use internal features of the SD UNet to predict dense face annotations, enabling the proposed method to gain domain knowledge in face generation. Empirically, our method achieves state-of-the-art or competitive generation performance in image-text alignment, identity preservation, and pose control.
zh

[CV-55] Accelerating Post-Tornado Disaster Assessment Using Advanced Deep Learning Models

【速读】：该论文旨在解决灾后建筑和基础设施评估的自动化问题，以提高灾后恢复效率和长期韧性规划。解决方案的关键在于利用先进的深度学习模型，特别是基于计算机视觉技术（YOLOv11 和 ResNet50）的系统，能够快速分析灾后现场的图像和视频，提取建筑特征的关键信息，包括结构部件的损坏程度和损坏范围。实验结果表明，ResNet50 在多类别损坏分类任务中达到了 90.28% 的准确率，每张图像的推理时间为 1529ms。这一研究为灾害管理领域提供了一个可扩展、高效且客观的灾后分析工具，有望改变社区和当局应对和从灾难事件中学习的方式。

链接: https://arxiv.org/abs/2412.18147
作者: Robinson Umeike,Thang Dao,Shane Crawford
机构: 未知
关键词: long-term resilience planning, automating post-disaster assessments, resilience planning, Post-disaster assessments, infrastructure are crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3 pages, 4 Figures, 1 Table

点击查看摘要

Abstract:Post-disaster assessments of buildings and infrastructure are crucial for both immediate recovery efforts and long-term resilience planning. This research introduces an innovative approach to automating post-disaster assessments through advanced deep learning models. Our proposed system employs state-of-the-art computer vision techniques (YOLOv11 and ResNet50) to rapidly analyze images and videos from disaster sites, extracting critical information about building characteristics, including damage level of structural components and the extent of damage. Our experimental results show promising performance, with ResNet50 achieving 90.28% accuracy and an inference time of 1529ms per image on multiclass damage classification. This study contributes to the field of disaster management by offering a scalable, efficient, and objective tool for post-disaster analysis, potentially capable of transforming how communities and authorities respond to and learn from catastrophic events.
zh

[CV-56] ERVD: An Efficient and Robust ViT-Based Distillation Framework for Remote Sensing Image Retrieval

【速读】：该论文旨在解决遥感图像检索（Remote Sensing Image Retrieval）中的效率与鲁棒性问题。其核心解决方案是提出了一种基于视觉Transformer（ViT）的高效且鲁棒的蒸馏框架（ERVD）。该框架通过蒸馏技术（Distillation）将复杂的ViT模型的知识迁移到更轻量级的模型中，从而在保持高检索精度的同时，显著提升计算效率与模型鲁棒性。

链接: https://arxiv.org/abs/2412.18136
作者: Le Dong,Qixuan Cao,Lei Pu,Fangfang Wu,Weisheng Dong,Xin Li,Guangming Shi
机构: 未知
关键词: Sensing Image Retrieval, Remote Sensing Image, Robust ViT-Based Distillation, ViT-Based Distillation Framework, Image Retrieval
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:ERVD: An Efficient and Robust ViT-Based Distillation Framework for Remote Sensing Image Retrieval
zh

[CV-57] UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision

【速读】：该论文旨在解决开放世界三维场景理解中多模态数据（点云、图像和文本）的统一学习问题。其核心解决方案包括两个关键策略：首先，通过图像模态作为桥梁，将点云与预对齐的图像和文本共同嵌入到一个共享特征空间中，避免了需要精心设计的点云-文本对；其次，提出了图像与点云之间的logit和特征蒸馏模块，以及一个视觉-点匹配模块，以显式校正由点云到像素投影引起的错位问题。此外，为了提高框架性能，采用了四种任务特定损失函数和两阶段训练策略。实验结果表明，该方法在基础标注和无标注任务上的语义分割性能分别比现有最先进方法平均提高了15.6%和14.8%。

链接: https://arxiv.org/abs/2412.18131
作者: Yuru Wang,Songtao Wang,Zehan Zhang,Xinyan Lu,Changwei Cai,Hao Li,Fu Liu,Peng Jia,Xianpeng Lang
机构: 未知
关键词: single learning paradigm, unifies point clouds, point cloud text, scene understanding, cloud text pairs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present UniPLV, a powerful framework that unifies point clouds, images and text in a single learning paradigm for open-world 3D scene understanding. UniPLV employs the image modal as a bridge to co-embed 3D points with pre-aligned images and text in a shared feature space without requiring carefully crafted point cloud text pairs. To accomplish multi-modal alignment, we propose two key strategies:(i) logit and feature distillation modules between images and point clouds, and (ii) a vison-point matching module is given to explicitly correct the misalignment caused by points to pixels projection. To further improve the performance of our unified framework, we adopt four task-specific losses and a two-stage training strategy. Extensive experiments show that our method outperforms the state-of-the-art methods by an average of 15.6% and 14.8% for semantic segmentation over Base-Annotated and Annotation-Free tasks, respectively. The code will be released later.
zh

[CV-58] VisionLLM -based Multimodal Fusion Network for Glottic Carcinoma Early Detection

【速读】：该论文旨在解决声门癌（glottic carcinoma）早期检测中由于声门癌与声带发育不良（vocal cord dysplasia）形态相似导致的检测准确率不足的问题。为解决这一问题，作者提出了一种基于视觉大语言模型（VisionLLM-based）的多模态融合网络（MMGC-Net），通过整合图像和文本模态来捕捉互补信息，从而提高检测的准确性和鲁棒性。解决方案的关键在于利用图像编码器和Q-Former提取视觉嵌入（vision embeddings），并结合大语言模型Meta AI（Llama3）获取文本嵌入（text embeddings），最后通过喉部特征融合块（laryngeal feature fusion block）实现图像和文本特征的全面整合，从而提升声门癌的识别性能。

链接: https://arxiv.org/abs/2412.18124
作者: Zhaohui Jin,Yi Shuai,Yongcheng Li,Lingcong Cai,Yun Li,Huifen Liu,Xiaomao Fan
机构: 未知
关键词: enables timely intervention, preserves vocal function, improving patient outcomes, glottic carcinoma, patient outcomes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The early detection of glottic carcinoma is critical for improving patient outcomes, as it enables timely intervention, preserves vocal function, and significantly reduces the risk of tumor progression and metastasis. However, the similarity in morphology between glottic carcinoma and vocal cord dysplasia results in suboptimal detection accuracy. To address this issue, we propose a vision large language model-based (VisionLLM-based) multimodal fusion network for glottic carcinoma detection, known as MMGC-Net. By integrating image and text modalities, multimodal models can capture complementary information, leading to more accurate and robust predictions. In this paper, we collect a private real glottic carcinoma dataset named SYSU1H from the First Affiliated Hospital of Sun Yat-sen University, with 5,799 image-text pairs. We leverage an image encoder and additional Q-Former to extract vision embeddings and the Large Language Model Meta AI (Llama3) to obtain text embeddings. These modalities are then integrated through a laryngeal feature fusion block, enabling a comprehensive integration of image and text features, thereby improving the glottic carcinoma identification performance. Extensive experiments on the SYSU1H dataset demonstrate that MMGC-Net can achieve state-of-the-art performance, which is superior to previous multimodal models.
zh

[CV-59] Spectrum-oriented Point-supervised Saliency Detector for Hyperspectral Images

【速读】：该论文旨在解决高光谱显著目标检测（HSOD）中像素级标注难以获取的问题。现有的深度学习方法虽然能够取得良好的检测效果，但通常依赖于像素级标注，而高光谱图像的像素级标注获取难度较大。为此，论文提出了一种基于点监督（point supervision）的解决方案，并引入了光谱显著性（Spectral Saliency）作为框架中的关键光谱表示，从而开发了一种新的光谱导向点监督显著性检测器（SPSD）。该方案的核心在于设计了一种专门用于高光谱图像的伪标签生成管道，有效缓解了点监督策略带来的性能下降问题。此外，光谱显著性被用于对抗模型监督和显著性细化过程中的信息损失，从而保持检测目标的结构完整性和边缘准确性。论文还引入了光谱变换空间门（Spectrum-transformed Spatial Gate），以更精确地聚焦于显著区域并减少特征冗余。通过在HSOD-BIT和HS-SOD数据集上的全面实验，验证了该方法的有效性，并进行了详细的消融研究以验证各个模块的作用。

链接: https://arxiv.org/abs/2412.18112
作者: Peifu Liu,Tingfa Xu,Guokai Shi,Jingxuan Xu,Huan Chen,Jianan Li
机构: 未知
关键词: hyperspectral images, aims to extract, Point-supervised Saliency Detector, extract targets, significantly different spectra
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TIM. Code: this https URL

点击查看摘要

Abstract:Hyperspectral salient object detection (HSOD) aims to extract targets or regions with significantly different spectra from hyperspectral images. While existing deep learning-based methods can achieve good detection results, they generally necessitate pixel-level annotations, which are notably challenging to acquire for hyperspectral images. To address this issue, we introduce point supervision into HSOD, and incorporate Spectral Saliency, derived from conventional HSOD methods, as a pivotal spectral representation within the framework. This integration leads to the development of a novel Spectrum-oriented Point-supervised Saliency Detector (SPSD). Specifically, we propose a novel pipeline, specifically designed for HSIs, to generate pseudo-labels, effectively mitigating the performance decline associated with point supervision strategy. Additionally, Spectral Saliency is employed to counteract information loss during model supervision and saliency refinement, thereby maintaining the structural integrity and edge accuracy of the detected objects. Furthermore, we introduce a Spectrum-transformed Spatial Gate to focus more precisely on salient regions while reducing feature redundancy. We have carried out comprehensive experiments on both HSOD-BIT and HS-SOD datasets to validate the efficacy of our proposed method, using mean absolute error (MAE), E-measure, F-measure, Area Under Curve, and Cross Correlation as evaluation metrics. For instance, on the HSOD-BIT dataset, our SPSD achieves a MAE of 0.031 and an F-measure of 0.878. Thorough ablation studies have substantiated the effectiveness of each individual module and provided insights into the model’s working mechanism. Further evaluations on RGB-thermal salient object detection datasets highlight the versatility of our approach.
zh

[CV-60] Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach

【速读】：该论文旨在探讨多模态大语言模型（Multimodal Large Language Models, MLLMs）如何有效解释和处理视觉内容，尽管这些模型最初仅基于语言数据进行训练。为了解决这一问题，论文通过对4个模型家族和4个模型规模的系统性研究，揭示了一类独特的注意力头（attention heads），这些注意力头专门聚焦于视觉内容。分析表明，这些注意力头的行为、注意力权重的分布及其对输入中视觉标记（visual tokens）的集中程度之间存在强相关性。这一发现深化了我们对大语言模型如何适应多模态任务的理解，展示了其在文本与视觉理解之间架起桥梁的潜力。该研究为开发能够处理多种模态的AI系统奠定了基础。

链接: https://arxiv.org/abs/2412.18108
作者: Jing Bi,Junjia Guo,Yunlong Tang,Lianggong Bruce Wen,Zhang Liu,Chenliang Xu
机构: 未知
关键词: Large Language Models, Multimodal Large Language, demonstrated remarkable progress, Recent advancements, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding. This impressive leap raises a compelling question: how can language models, initially trained solely on linguistic data, effectively interpret and process visual content? This paper aims to address this question with systematic investigation across 4 model families and 4 model scales, uncovering a unique class of attention heads that focus specifically on visual content. Our analysis reveals a strong correlation between the behavior of these attention heads, the distribution of attention weights, and their concentration on visual tokens within the input. These findings enhance our understanding of how LLMs adapt to multimodal tasks, demonstrating their potential to bridge the gap between textual and visual understanding. This work paves the way for the development of AI systems capable of engaging with diverse modalities.
zh

[CV-61] Beyond the Known: Enhancing Open Set Domain Adaptation with Unknown Exploration

【速读】：该论文旨在解决开放集域适应（Open Set Domain Adaptation, OSDA）问题，即在非可控环境中，面对未标记数据集且存在域和类别偏移时，模型准确性下降的挑战。现有OSDA方法通常仅对齐已知类别或通过监督训练将未知类别视为单一新类别，未能有效处理未知类别的识别。本文提出了一种新方法，通过提取一组高置信度的未知实例，并将其作为硬约束来收紧分类边界，从而提升OSDA性能。具体而言，本文引入了一种新的损失约束，并通过三种方式进行评估：(1) 直接使用原始负实例；(2) 使用数据增强技术生成随机变换的负实例；(3) 生成包含对抗特征的合成负实例。此外，本文还分析了改进判别器和生成对抗网络（GAN）训练的策略，以生成更有效的合成负实例。通过在Office-31、Office-Home和VisDA三个广泛使用的公开基准数据集上进行实验，本文在保持与其他最先进方法相似的H-score的同时，显著提高了对未知类别的识别准确率。

链接: https://arxiv.org/abs/2412.18105
作者: Lucas Fernando Alvarenga e Silva,Samuel Felipe dos Santos,Nicu Sebe,Jurandy Almeida
机构: 未知
关键词: Convolutional neural networks, Convolutional neural, resulting in exceptional, research areas, exceptional performance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) can learn directly from raw data, resulting in exceptional performance across various research areas. However, factors present in non-controllable environments such as unlabeled datasets with varying levels of domain and category shift can reduce model accuracy. The Open Set Domain Adaptation (OSDA) is a challenging problem that arises when both of these issues occur together. Existing OSDA approaches in literature only align known classes or use supervised training to learn unknown classes as a single new category. In this work, we introduce a new approach to improve OSDA techniques by extracting a set of high-confidence unknown instances and using it as a hard constraint to tighten the classification boundaries. Specifically, we use a new loss constraint that is evaluated in three different ways: (1) using pristine negative instances directly; (2) using data augmentation techniques to create randomly transformed negatives; and (3) with generated synthetic negatives containing adversarial features. We analyze different strategies to improve the discriminator and the training of the Generative Adversarial Network (GAN) used to generate synthetic negatives. We conducted extensive experiments and analysis on OVANet using three widely-used public benchmarks, the Office-31, Office-Home, and VisDA datasets. We were able to achieve similar H-score to other state-of-the-art methods, while increasing the accuracy on unknown categories.
zh

[CV-62] Multi-Point Positional Insertion Tuning for Small Object Detection

【速读】：该论文旨在解决小目标检测（small object detection）中微调大规模预训练模型时计算和内存开销过大的问题。为解决这一问题，论文提出了一种参数高效的微调方法，称为多点位置插入（multi-point positional insertion, MPI）调优。其关键解决方案是在冻结的预训练模型中引入多个位置嵌入（positional embeddings），从而为潜在特征提供精确的位置信息，进而高效地检测小目标。实验结果表明，MPI在SODA-D数据集上表现优异，与传统参数高效微调方法（如CoOp和VPT）相当，同时显著减少了需要调优的参数数量。

链接: https://arxiv.org/abs/2412.18090
作者: Kanoko Goto,Takumi Karasawa,Takumi Hirose,Rei Kawakami,Nakamasa Inoue
机构: 未知
关键词: classify small objects, aims to localize, localize and classify, object detection aims, Small object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Small object detection aims to localize and classify small objects within images. With recent advances in large-scale vision-language pretraining, finetuning pretrained object detection models has emerged as a promising approach. However, finetuning large models is computationally and memory expensive. To address this issue, this paper introduces multi-point positional insertion (MPI) tuning, a parameter-efficient finetuning (PEFT) method for small object detection. Specifically, MPI incorporates multiple positional embeddings into a frozen pretrained model, enabling the efficient detection of small objects by providing precise positional information to latent features. Through experiments, we demonstrated the effectiveness of the proposed method on the SODA-D dataset. MPI performed comparably to conventional PEFT methods, including CoOp and VPT, while significantly reducing the number of parameters that need to be tuned.
zh

[CV-63] Convolutional Prompting for Broad-Domain Retinal Vessel Segmentation ICASSP2025

【速读】：该论文旨在解决广域视网膜血管分割（Broad-Domain Retinal Vessel Segmentation, BD-RVS）的挑战性任务，即开发一个适用于多种图像域的统一模型，包括彩色眼底照相（CFP）、扫描激光眼底成像（SLO）、超广角眼底成像（UWF）、光学相干断层扫描血管成像（OCTA）和荧光素眼底血管造影（FFA）。为实现这一目标，论文提出了双卷积提示（Dual Convolutional Prompting, DCP）方法，该方法通过在位置和通道维度上进行局部提示，学习提取特定域的特征。DCP设计为一个插件模块，能够有效地将基于R2AU-Net的血管分割网络转换为统一模型，而无需修改其网络结构。论文通过构建一个包含五个公开数据集（ROSSA、FIVES、IOSTAR、PRIME-FP20和VAMPIRE）的广域数据集，并对多种现有方法进行重新调整，生成了八个基线方法进行对比实验。实验结果表明，所提出的方法在广域视网膜血管分割任务中优于基线方法。

链接: https://arxiv.org/abs/2412.18089
作者: Qijie Wei,Weihong Yu,Xirong Li
机构: 未知
关键词: color fundus photography, retinal vessel segmentation, specific image domain, Previous research, domains including CFP
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:Previous research on retinal vessel segmentation is targeted at a specific image domain, mostly color fundus photography (CFP). In this paper we make a brave attempt to attack a more challenging task of broad-domain retinal vessel segmentation (BD-RVS), which is to develop a unified model applicable to varied domains including CFP, SLO, UWF, OCTA and FFA. To that end, we propose Dual Convoltuional Prompting (DCP) that learns to extract domain-specific features by localized prompting along both position and channel dimensions. DCP is designed as a plug-in module that can effectively turn a R2AU-Net based vessel segmentation network to a unified model, yet without the need of modifying its network structure. For evaluation we build a broad-domain set using five public domain-specific datasets including ROSSA, FIVES, IOSTAR, PRIME-FP20 and VAMPIRE. In order to benchmark BD-RVS on the broad-domain dataset, we re-purpose a number of existing methods originally developed in other contexts, producing eight baseline methods in total. Extensive experiments show the the proposed method compares favorably against the baselines for BD-RVS.
zh

[CV-64] COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection

【速读】：该论文旨在解决多模态目标检测任务中由于不同传感器捕获的图像存在不对齐问题，导致直接匹配困难，进而影响跨模态对象强相关性建立的问题。为解决这一问题，论文提出了一种名为CrOss-Mamba interaction and Offset-guided fusion (COMO)框架的新方法。该框架的关键在于采用跨Mamba技术（cross-mamba technique）来构建特征交互方程，实现多模态序列化状态计算，从而在减少计算开销和提高效率的同时，生成交互式融合输出。此外，COMO利用受不对齐影响较小的高层特征来促进模态间的交互和互补信息传递，解决了由相机角度和捕获时间变化引起的位置偏移问题。框架还通过全局和局部扫描机制捕捉具有局部相关性的特征，特别是在遥感图像中，并通过偏移引导融合机制有效利用多尺度特征，构建多尺度融合数据立方体，从而提升检测性能。

链接: https://arxiv.org/abs/2412.18076
作者: Chang Liu,Xin Ma,Xiaochen Yang,Yuxiang Zhang,Yanni Dong
机构: 未知
关键词: Single-modal object detection, encountering diverse scenarios, object detection tasks, Single-modal object, multimodal object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Single-modal object detection tasks often experience performance degradation when encountering diverse scenarios. In contrast, multimodal object detection tasks can offer more comprehensive information about object features by integrating data from various modalities. Current multimodal object detection methods generally use various fusion techniques, including conventional neural networks and transformer-based models, to implement feature fusion strategies and achieve complementary information. However, since multimodal images are captured by different sensors, there are often misalignments between them, making direct matching challenging. This misalignment hinders the ability to establish strong correlations for the same object across different modalities. In this paper, we propose a novel approach called the CrOss-Mamba interaction and Offset-guided fusion (COMO) framework for multimodal object detection tasks. The COMO framework employs the cross-mamba technique to formulate feature interaction equations, enabling multimodal serialized state computation. This results in interactive fusion outputs while reducing computational overhead and improving efficiency. Additionally, COMO leverages high-level features, which are less affected by misalignment, to facilitate interaction and transfer complementary information between modalities, addressing the positional offset challenges caused by variations in camera angles and capture times. Furthermore, COMO incorporates a global and local scanning mechanism in the cross-mamba module to capture features with local correlation, particularly in remote sensing images. To preserve low-level features, the offset-guided fusion mechanism ensures effective multiscale feature utilization, allowing the construction of a multiscale fusion data cube that enhances detection performance.
zh

[CV-65] BIG-MoE: Bypass Isolated Gating MoE for Generalized Multimodal Face Anti-Spoofing ICASSP2025

【速读】：该论文旨在解决多模态人脸防伪（FAS）领域中存在的模态偏差、模态不平衡以及域转移等问题。传统专家混合模型（MoE）在多模态FAS任务中存在三个主要局限性：专家模型粒度较粗，难以捕捉细微的伪造特征；门控网络对输入噪声敏感，影响决策；以及MoE对提示令牌的敏感性导致传统学习方法下容易过拟合。为解决这些问题，论文提出了Bypass Isolated Gating MoE（BIG-MoE）框架，其关键创新点包括：引入细粒度专家模型以增强对细微伪造特征的检测能力；采用隔离门控机制以抵御输入噪声的干扰；以及通过新型差分卷积提示旁路（differential convolutional prompt bypass）为门控网络注入关键局部特征，从而提升感知能力。实验结果表明，该框架在四个基准数据集上显著提升了多模态FAS任务的泛化性能。

链接: https://arxiv.org/abs/2412.18065
作者: Yingjie Ma,Zitong Yu,Xun Lin,Weicheng Xie,Linlin Shen
机构: 未知
关键词: facial recognition security, countering presentation attacks, multimodal Face Anti-Spoofing, Face Anti-Spoofing, recognition security
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:In the domain of facial recognition security, multimodal Face Anti-Spoofing (FAS) is essential for countering presentation attacks. However, existing technologies encounter challenges due to modality biases and imbalances, as well as domain shifts. Our research introduces a Mixture of Experts (MoE) model to address these issues effectively. We identified three limitations in traditional MoE approaches to multimodal FAS: (1) Coarse-grained experts’ inability to capture nuanced spoofing indicators; (2) Gated networks’ susceptibility to input noise affecting decision-making; (3) MoE’s sensitivity to prompt tokens leading to overfitting with conventional learning methods. To mitigate these, we propose the Bypass Isolated Gating MoE (BIG-MoE) framework, featuring: (1) Fine-grained experts for enhanced detection of subtle spoofing cues; (2) An isolation gating mechanism to counteract input noise; (3) A novel differential convolutional prompt bypass enriching the gating network with critical local features, thereby improving perceptual capabilities. Extensive experiments on four benchmark datasets demonstrate significant generalization performance improvement in multimodal FAS task. The code is released at this https URL.
zh

[CV-66] An Ensemble Approach to Short-form Video Quality Assessment Using Multimodal LLM ICASSP2025

【速读】：该论文旨在解决基于学习的盲视频质量评估（Blind Video Quality Assessment, BVQA）模型在处理短格式视频时面临的挑战，特别是由于视频内容、编辑风格和伪影的多样性导致的泛化能力不足问题。论文提出利用预训练的多模态大语言模型（Multimodal Large Language Models, MLLMs）来提升短格式视频质量评估的效果。解决方案的关键在于：首先，研究了帧预处理和采样技术对MLLM性能的影响；其次，提出了一种轻量级的基于学习的集成方法，自适应地结合MLLM与现有最先进的BVQA模型的预测结果。实验表明，该集成方法在泛化性能上表现优异，并且通过分析内容感知的集成权重，揭示了现有BVQA模型在某些视频特征上的不足，为进一步改进BVQA模型提供了潜在方向。

链接: https://arxiv.org/abs/2412.18060
作者: Wen Wen,Yilin Wang,Neil Birkbeck,Balu Adsumilli
机构: 未知
关键词: poses substantial challenges, video quality assessment, BVQA models, editing styles, blind video quality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2025

点击查看摘要

Abstract:The rise of short-form videos, characterized by diverse content, editing styles, and artifacts, poses substantial challenges for learning-based blind video quality assessment (BVQA) models. Multimodal large language models (MLLMs), renowned for their superior generalization capabilities, present a promising solution. This paper focuses on effectively leveraging a pretrained MLLM for short-form video quality assessment, regarding the impacts of pre-processing and response variability, and insights on combining the MLLM with BVQA models. We first investigated how frame pre-processing and sampling techniques influence the MLLM’s performance. Then, we introduced a lightweight learning-based ensemble method that adaptively integrates predictions from the MLLM and state-of-the-art BVQA models. Our results demonstrated superior generalization performance with the proposed ensemble approach. Furthermore, the analysis of content-aware ensemble weights highlighted that some video characteristics are not fully represented by existing BVQA models, revealing potential directions to improve BVQA models further.
zh

[CV-67] AA-SGAN: Adversarially Augmented Social GAN with Synthetic Data

【速读】：该论文旨在解决在自动驾驶和服务机器人等应用中，如何准确预测行人轨迹的问题。现有方法依赖于大量标注的轨迹数据进行训练，尽管存在大量通过视频游戏等途径生成的合成轨迹数据，但这些数据往往无法真实反映行人运动，导致训练效果不佳。论文提出了一种在训练时通过对抗性方法（adversarial approach）增强合成轨迹的解决方案，其关键在于利用轨迹增强技术（trajectory augmentation）来提升生成模型在真实世界轨迹上的预测性能。实验表明，该方法显著提高了生成模型在真实轨迹数据上的表现。

链接: https://arxiv.org/abs/2412.18038
作者: Mirko Zaffaroni,Federico Signoretta,Marco Grangetto,Attilio Fiandrotti
机构: 未知
关键词: Accurately predicting pedestrian, Accurately predicting, service robotics, crucial in applications, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately predicting pedestrian trajectories is crucial in applications such as autonomous driving or service robotics, to name a few. Deep generative models achieve top performance in this task, assuming enough labelled trajectories are available for training. To this end, large amounts of synthetically generated, labelled trajectories exist (e.g., generated by video games). However, such trajectories are not meant to represent pedestrian motion realistically and are ineffective at training a predictive model. We propose a method and an architecture to augment synthetic trajectories at training time and with an adversarial approach. We show that trajectory augmentation at training time unleashes significant gains when a state-of-the-art generative model is evaluated over real-world trajectories.
zh

[CV-68] LayerDropBack: A Universally Applicable Approach for Accelerating Training of Deep Networks

【速读】：该论文旨在解决训练非常深的卷积网络（Convolutional Networks）时所需的大量计算资源和时间的问题。现有的加速方法通常依赖于特定的网络架构或需要对网络进行修改，限制了其通用性。为此，作者提出了LayerDropBack（LDB）方法，这是一种简单但有效的加速训练技术，适用于广泛的深度网络。LDB的关键创新在于仅在反向传播（backward pass）中引入随机性，保持前向传播（forward pass）的完整性，从而确保训练和推理时使用相同的网络。LDB无需修改网络架构，可以无缝集成到任何模型的训练过程中，适用于多种网络拓扑结构。实验结果表明，LDB在多种架构（如ViT、Swin Transformer、EfficientNet、DLA）和数据集（如CIFAR-100、ImageNet）上显著减少了训练时间（16.93%至23.97%），同时保持甚至提升了模型精度。

链接: https://arxiv.org/abs/2412.18027
作者: Evgeny Hershkovitch Neiterman,Gil Ben-Artzi
机构: 未知
关键词: requiring significant computational, significant computational resources, computational resources, LDB, deep convolutional networks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training very deep convolutional networks is challenging, requiring significant computational resources and time. Existing acceleration methods often depend on specific architectures or require network modifications. We introduce LayerDropBack (LDB), a simple yet effective method to accelerate training across a wide range of deep networks. LDB introduces randomness only in the backward pass, maintaining the integrity of the forward pass, guaranteeing that the same network is used during both training and inference. LDB can be seamlessly integrated into the training process of any model without altering its architecture, making it suitable for various network topologies. Our extensive experiments across multiple architectures (ViT, Swin Transformer, EfficientNet, DLA) and datasets (CIFAR-100, ImageNet) show significant training time reductions of 16.93% to 23.97%, while preserving or even enhancing model accuracy. Code is available at \urlthis https URL.
zh

[CV-69] Online Adaptation for Myographic Control of Natural Dexterous Hand and Finger Movements

【速读】：该论文旨在解决肌电假肢控制中一个长期存在的难题，即如何可靠地解码多个自由度（degrees-of-freedom, DoF）的连续位置，以实现灵巧、自然且仿生的手指和手腕控制。为实现这一目标，研究团队结合了序列时间回归模型（sequential temporal regression models）和强化学习（reinforcement learning），利用肌电信号（myographic signals）预测9名非截肢受试者在最小约束的自由训练过程中7个手指和手腕自由度的连续同步位置。研究结果表明，该方法在基于肌电信号的假肢控制中实现了高度灵巧的7自由度位置回归，显著降低了传统方法的误差率（p < 0.001），并几乎消除了预测响应时间延迟（p < 0.001）。此外，通过自由形式的强化过程，系统性能可以随时持续改进。该研究的关键在于其创新的训练方法，摒弃了标准训练协议，允许受试者以任意方式移动，同时模型自适应调整，从而实现了迄今为止最为灵巧、仿生且自然的假肢控制性能。

链接: https://arxiv.org/abs/2412.17991
作者: Joseph L. Betthauser,Rebecca Greene,Ananya Dhawan,John T. Krall,Christopher L. Hunt,Gyorgy Levay,Rahul R. Kaliki,Matthew S. Fifer,Siddhartha Sikdar,Nitish V. Thakor
机构: 未知
关键词: Modular Prosthetic Limb, decode continuous positions, continuous positions simultaneously, reliably decode continuous, robotic Modular Prosthetic
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Modified from Chapter 5 of J. L. Betthauser, “Robust Adaptive Strategies for Myographic Prosthesis Movement Decoding,” Doctoral Dissertation, Dept. of Electrical and Computer Engr, Johns Hopkins University, 2020

点击查看摘要

Abstract:One of the most elusive goals in myographic prosthesis control is the ability to reliably decode continuous positions simultaneously across multiple degrees-of-freedom. Goal: To demonstrate dexterous, natural, biomimetic finger and wrist control of the highly advanced robotic Modular Prosthetic Limb. Methods: We combine sequential temporal regression models and reinforcement learning using myographic signals to predict continuous simultaneous predictions of 7 finger and wrist degrees-of-freedom for 9 non-amputee human subjects in a minimally-constrained freeform training process. Results: We demonstrate highly dexterous 7 DoF position-based regression for prosthesis control from EMG signals, with significantly lower error rates than traditional approaches (p 0.001) and nearly zero prediction response time delay (p 0.001). Their performance can be continuously improved at any time using our freeform reinforcement process. Significance: We have demonstrated the most dexterous, biomimetic, and natural prosthesis control performance ever obtained from the surface EMG signal. Our reinforcement approach allowed us to abandon standard training protocols and simply allow the subject to move in any desired way while our models adapt. Conclusions: This work redefines the state-of-the-art in myographic decoding in terms of the reliability, responsiveness, and movement complexity available from prosthesis control systems. The present-day emergence and convergence of advanced algorithmic methods, experiment protocols, dexterous robotic prostheses, and sensor modalities represents a unique opportunity to finally realize our ultimate goal of achieving fully restorative natural upper-limb function for amputees.
zh

[CV-70] ICPR 2024 Competition on Domain Adaptation and GEneralization for Character Classification (DAGECC) ICPR2024

【速读】：该论文旨在解决字符分类（Character Classification）领域中的域适应（Domain Adaptation）和泛化（Generalization）问题。其核心目标是通过提供一个高质量、轻量级的真实世界数据集，支持快速原型设计和验证新想法，从而促进这些领域的研究进展。解决方案的关键在于组织DAGECC竞赛，该竞赛为社区提供了具体的任务和数据集，并通过总结竞赛结果和描述前三名获胜方案，推动域适应和泛化技术的创新与应用。

链接: https://arxiv.org/abs/2412.17984
作者: Sofia Marino,Jennifer Vandoni,Emanuel Aldea,Ichraq Lemghari,Sylvie Le Hégarat-Mascle,Frédéric Jurie
机构: 未知
关键词: Character Classification, Domain Adaptation, Adaptation and GEneralization, winning entries, companion paper
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Companion paper for the ICPR 2024 Competition on Domain Adaptation and GEneralization for Character Classification (DAGECC)

点击查看摘要

Abstract:In this companion paper for the DAGECC (Domain Adaptation and GEneralization for Character Classification) competition organized within the frame of the ICPR 2024 conference, we present the general context of the tasks we proposed to the community, we introduce the data that were prepared for the competition and we provide a summary of the results along with a description of the top three winning entries. The competition was centered around domain adaptation and generalization, and our core aim is to foster interest and facilitate advancement on these topics by providing a high-quality, lightweight, real world dataset able to support fast prototyping and validation of novel ideas.
zh

[CV-71] Unsupervised learning of spatially varying regularization for diffeomorphic image registration

【速读】：该论文旨在解决深度学习（deep learning）在可变形图像配准（deformable image registration）中普遍采用的空间不变正则化（spatially invariant regularization）问题，即在整个图像上应用均匀的正则化强度，可能忽略局部解剖结构的变形差异。为了解决这一问题，论文提出了一种分层概率模型（hierarchical probabilistic model），该模型通过引入变形正则化强度的先验分布（prior distribution），实现了从数据中端到端学习空间变化的正则化器（spatially varying regularizer）。这一方法的关键在于能够自动调整超参数（hyperparameters），并通过贝叶斯优化（Bayesian optimization）高效识别给定配准任务中的最优超参数，从而显著提升配准性能并增强深度学习配准模型的可解释性，同时保持平滑的变形场。

链接: https://arxiv.org/abs/2412.17982
作者: Junyu Chen,Shuwen Wei,Yihao Liu,Zhangxing Bian,Yufan He,Aaron Carass,Harrison Bai,Yong Du
机构: 未知
关键词: Spatially varying regularization, varying regularization accommodates, Spatially varying, regions during deformable, harnessed spatially varying
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this http URL

点击查看摘要

Abstract:Spatially varying regularization accommodates the deformation variations that may be necessary for different anatomical regions during deformable image registration. Historically, optimization-based registration models have harnessed spatially varying regularization to address anatomical subtleties. However, most modern deep learning-based models tend to gravitate towards spatially invariant regularization, wherein a homogenous regularization strength is applied across the entire image, potentially disregarding localized variations. In this paper, we propose a hierarchical probabilistic model that integrates a prior distribution on the deformation regularization strength, enabling the end-to-end learning of a spatially varying deformation regularizer directly from the data. The proposed method is straightforward to implement and easily integrates with various registration network architectures. Additionally, automatic tuning of hyperparameters is achieved through Bayesian optimization, allowing efficient identification of optimal hyperparameters for any given registration task. Comprehensive evaluations on publicly available datasets demonstrate that the proposed method significantly improves registration performance and enhances the interpretability of deep learning-based registration, all while maintaining smooth deformations.
zh

[CV-72] Improving Sickle Cell Disease Classification: A Fusion of Conventional Classifiers Segmented Images and Convolutional Neural Networks

【速读】：该论文旨在解决镰状细胞贫血（sickle cell anemia）的自动化分类问题，特别是在计算资源有限的情况下如何提高诊断效率。镰状细胞贫血的特征是异常的红细胞形态，通常通过显微镜图像进行检测。尽管卷积神经网络（CNNs）在医学图像分析中表现出色，但其训练过程通常需要大量计算资源和时间。为此，论文提出了一种结合传统分类器、分割图像和CNN特征的新方法，以实现高效且准确的分类。解决方案的关键在于利用分割图像和CNN特征，并结合支持向量机（SVM）进行分类，最终达到了96.80%的准确率。这一方法在计算效率方面具有显著优势，为医学图像分析的进一步研究提供了新的思路。

链接: https://arxiv.org/abs/2412.17975
作者: Victor Júnio Alcântara Cardoso,Rodrigo Moreira,João Fernando Mari,Larissa Ferreira Rodrigues Moreira
机构: 未知
关键词: abnormal erythrocyte morphology, Sickle cell anemia, Convolutional Neural Networks, erythrocyte morphology, characterized by abnormal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages

点击查看摘要

Abstract:Sickle cell anemia, which is characterized by abnormal erythrocyte morphology, can be detected using microscopic images. Computational techniques in medicine enhance the diagnosis and treatment efficiency. However, many computational techniques, particularly those based on Convolutional Neural Networks (CNNs), require high resources and time for training, highlighting the research opportunities in methods with low computational overhead. In this paper, we propose a novel approach combining conventional classifiers, segmented images, and CNNs for the automated classification of sickle cell disease. We evaluated the impact of segmented images on classification, providing insight into deep learning integration. Our results demonstrate that using segmented images and CNN features with an SVM achieves an accuracy of 96.80%. This finding is relevant for computationally efficient scenarios, paving the way for future research and advancements in medical-image analysis.
zh

[CV-73] A Multimodal Fusion Framework for Bridge Defect Detection with Cross-Verification

【速读】：该论文旨在解决桥梁缺陷检测和分析中的精确性问题，特别是在混凝土结构中识别易损区域（如脱层和剥离）的挑战。解决方案的关键在于提出了一种多模态融合框架，该框架结合了无损检测技术（Non-Destructive Evaluation, NDE）和先进的图像处理技术。具体而言，通过整合冲击回波（Impact Echo, IE）和超声波表面波（Ultrasonic Surface Waves, USW）方法的数据，利用地理空间分析（如alpha形状）、缺陷点融合和统一车道边界，框架能够增强缺陷定位并识别重叠缺陷区域。此外，通过自适应图像处理进行交叉验证，利用基于轮廓的映射和边界框技术，进一步验证检测到的缺陷，从而提高检测精度并减少误报。实验结果表明，该框架在缺陷定位和检测准确性方面具有显著潜力，为未来的大规模验证和研究奠定了基础。

链接: https://arxiv.org/abs/2412.17968
作者: Ravi Datta Rachuri,Duoduo Liao,Samhita Sarikonda,Datha Vaishnavi Kondur
机构: 未知
关键词: integrating Non-Destructive Evaluation, pilot study introducing, Ultrasonic Surface Waves, Non-Destructive Evaluation, integrating Non-Destructive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Big Data 2024

点击查看摘要

Abstract:This paper presents a pilot study introducing a multimodal fusion framework for the detection and analysis of bridge defects, integrating Non-Destructive Evaluation (NDE) techniques with advanced image processing to enable precise structural assessment. By combining data from Impact Echo (IE) and Ultrasonic Surface Waves (USW) methods, this preliminary investigation focuses on identifying defect-prone regions within concrete structures, emphasizing critical indicators such as delamination and debonding. Using geospatial analysis with alpha shapes, fusion of defect points, and unified lane boundaries, the proposed framework consolidates disparate data sources to enhance defect localization and facilitate the identification of overlapping defect regions. Cross-verification with adaptive image processing further validates detected defects by aligning their coordinates with visual data, utilizing advanced contour-based mapping and bounding box techniques for precise defect identification. The experimental results, with an F1 score of 0.83, demonstrate the potential efficacy of the approach in improving defect localization, reducing false positives, and enhancing detection accuracy, which provides a foundation for future research and larger-scale validation. This preliminary exploration establishes the framework as a promising tool for efficient bridge health assessment, with implications for proactive structural monitoring and maintenance.
zh

[CV-74] ArchComplete: Autoregressive 3D Architectural Design Generation with Hierarchical Diffusion-Based Upsampling

【速读】：该论文旨在解决建筑几何和拓扑结构的高复杂性挑战，特别是在早期设计过程中的构思和几何细节化问题。为此，作者提出了 \textitArchComplete，一个两阶段的密集体素（voxel）生成式 AI (Generative AI) 管道。其关键解决方案包括：第一阶段，通过 \textit3D Voxel VQGAN 模型生成粗略的建筑模型，并利用自回归变压器（autoregressive transformer）进行建模；第二阶段，采用由一组 3D 条件去噪扩散概率模型（3D conditional denoising diffusion probabilistic models）组成的 \textitHierarchical Voxel Upsampling Networks，对粗略形状进行几何细节的增强。第一阶段在包含完整内外结构的房屋模型数据集上训练，并使用新颖的 2.5D 感知损失（perceptual loss）来捕捉多抽象层次的输入复杂性；第二阶段则在随机裁剪的局部体素块上训练，显著减少了计算和内存需求。最终，该管道能够从 64^3 分辨率逐步细化到 256^3 分辨率，支持多种交互模式，如插值、变体生成、无条件合成以及形状补全和平面图补全等条件合成任务。实验结果表明，该方法在多个指标上显著优于现有技术。

链接: https://arxiv.org/abs/2412.17957
作者: S. Rasoulzadeh,M. Bank,M. Wimmer,I. Kovacic,K. Schinegger,S. Rutzinger
机构: 未知
关键词: two-stage dense voxel-based, early design process, generative pipeline developed, Voxel Upsampling Networks, Hierarchical Voxel Upsampling
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 14 pages, 12 figures

点击查看摘要

Abstract: \textitArchComplete is a two-stage dense voxel-based 3D generative pipeline developed to tackle the high complexity in architectural geometries and topologies, assisting with ideation and geometric detailisation in the early design process. In stage 1, a \textit3D Voxel VQGAN model is devised, whose composition is then modelled with an autoregressive transformer for generating coarse models. Subsequently, in stage 2, \textitHierarchical Voxel Upsampling Networks consisting of a set of 3D conditional denoising diffusion probabilistic models are defined to augment the coarse shapes with fine geometric details. The first stage is trained on a dataset of house models with fully modelled exteriors and interiors with a novel 2.5D perceptual loss to capture input complexities across multiple abstraction levels, while the second stage trains on randomly cropped local volumetric patches, requiring significantly less compute and memory. For inference, the pipeline first autoregressively generates house models at a resolution of 64^3 and then progressively refines them to resolution of 256^3 with voxel sizes as small as 18\textcm . ArchComplete supports a range of interaction modes solving a variety of tasks, including interpolation, variation generation, unconditional synthesis, and two conditional synthesis tasks: shape completion and plan-drawing completion, as well as geometric detailisation. The results demonstrate notable improvements against state-of-the-art on established metrics.
zh

[CV-75] Hyperbolic Chamfer Distance for Point Cloud Completion and Beyond

【速读】：该论文旨在解决在点云补全任务中，Chamfer Distance (CD) 作为损失函数时对异常值敏感的问题，这可能导致模型收敛到次优解。现有的研究主要集中在欧几里得空间中解决这一问题，而本文提出了一种新颖且高效的度量方法：Hyperbolic Chamfer Distance (HyperCD)。该方法的创新之处在于将Chamfer Distance的计算置于双曲空间中进行，并在反向传播过程中系统地为具有较小欧几里得距离的匹配点对分配更大的权重。这一机制不仅保留了准确的点对匹配，还允许逐步调整次优匹配，从而显著提升了点云补全的效果。此外，HyperCD不仅在点云补全任务中表现出色，还在单图像重建和上采样等其他生成任务中展现了广泛的应用潜力。通过在PCN、ShapeNet-55和ShapeNet-34等基准数据集上的实验，HyperCD展示了其在表面平滑度方面的显著改进，并提供了超越补全任务的实验结果。

链接: https://arxiv.org/abs/2412.17951
作者: Fangzhou Lin,Songlin Hou,Haotian Liu,Shang Gao,Kazunori D Yamada,Haichong K. Zhang,Ziming Zhang
机构: 未知
关键词: Chamfer Distance, point cloud completion, Hyperbolic Chamfer Distance, point cloud, cloud completion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Chamfer Distance (CD) is widely used as a metric to quantify difference between two point clouds. In point cloud completion, Chamfer Distance (CD) is typically used as a loss function in deep learning frameworks. However, it is generally acknowledged within the field that Chamfer Distance (CD) is vulnerable to the presence of outliers, which can consequently lead to the convergence on suboptimal models. In divergence from the existing literature, which largely concentrates on resolving such concerns in the realm of Euclidean space, we put forth a notably uncomplicated yet potent metric specifically designed for point cloud completion tasks: Hyperbolic Chamfer Distance (HyperCD). This metric conducts Chamfer Distance computations within the parameters of hyperbolic space. During the backpropagation process, HyperCD systematically allocates greater weight to matched point pairs exhibiting reduced Euclidean distances. This mechanism facilitates the preservation of accurate point pair matches while permitting the incremental adjustment of suboptimal matches, thereby contributing to enhanced point cloud completion outcomes. Moreover, measure the shape dissimilarity is not solely work for point cloud completion task, we further explore its applications in other generative related tasks, including single image reconstruction from point cloud, and upsampling. We demonstrate state-of-the-art performance on the point cloud completion benchmark datasets, PCN, ShapeNet-55, and ShapeNet-34, and show from visualization that HyperCD can significantly improve the surface smoothness, we also provide the provide experimental results beyond completion task.
zh

[CV-76] Causal Deep Learning

【速读】：该论文旨在解决因果推断中的正向和逆向问题。正向因果问题通过一种由因果胶囊（causal capsules）和张量变换器（tensor transformer）组成的神经网络架构来处理，其中因果胶囊计算一组不变的因果因子表示，其相互作用由张量变换控制。逆向因果问题则通过实现多线性投影算法（multilinear projection algorithm）的神经网络来解决，该架构逆转了正向神经网络的操作顺序，从而估计效应的原因。为了避免激进的瓶颈维度缩减或正则化回归可能掩盖固有的欠定逆向问题，论文提出使用分段张量模型（piecewise tensor models）来建模数据生成机制的不同方面，其多线性投影产生多个候选解。尽管正向和逆向问题可以通过浅层架构解决，但为了实现计算可扩展性，论文利用块代数推导出一组深度神经网络，形成了一种双重非线性的张量因子模型。该因果神经网络基于张量因子分析，具有数据无关性，并以面部图像为例进行说明。此外，论文还描述了顺序、并行和异步并行计算策略。

链接: https://arxiv.org/abs/2301.00314
作者: M. Alex O. Vasilescu
机构: 未知
关键词: inverse causal inference, causal, neural networks, deep neural networks, framework that facilitates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We derive a set of causal deep neural networks whose architectures are a consequence of tensor (multilinear) factor analysis, a framework that facilitates forward and inverse causal inference. Forward causal questions are addressed with a neural architecture composed of causal capsules and a tensor transformer. Causal capsules compute a set of invariant causal factor representations, whose interactions are governed by a tensor transformation. Inverse causal questions are addressed with a neural network that implements the multilinear projection algorithm. The architecture reverses the order of the operations of a forward neural network and estimates the causes of effects. As an alternative to aggressive bottleneck dimension reduction or regularized regression that may camouflage an inherently underdetermined inverse problem, we prescribe modeling different aspects of the mechanism of data formation with piecewise tensor models whose multilinear projections produce multiple candidate solutions. Our forward and inverse questions may be addressed with shallow architectures, but for computationally scalable solutions, we derive a set of deep neural networks by taking advantage of block algebra. An interleaved kernel hierarchy results in a doubly non-linear tensor factor models. The causal neural networks that are a consequence of tensor factor analysis are data agnostic, but are illustrated with facial images. Sequential, parallel and asynchronous parallel computation strategies are described.
zh

[CV-77] xt-Driven Tumor Synthesis

【速读】：该论文旨在解决现有肿瘤合成方法在生成多样性肿瘤图像时缺乏对特定肿瘤特征（如纹理、异质性、边界和病理类型）的精确控制问题，导致生成的肿瘤图像过于相似或重复，无法有效提升AI在肿瘤检测、分割和分类中的性能。为解决这一问题，论文提出了一种新的文本驱动的肿瘤合成方法，称为TextoMorph。该方法通过从放射学报告中提取的文本信息来控制肿瘤特征，从而生成更具多样性和针对性的合成肿瘤图像。TextoMorph的关键在于结合对比学习技术，利用大量放射学报告（34,035份）减少对稀缺图像-报告对的依赖（仅使用141对），并通过严格的测试（如文本驱动的视觉图灵测试和放射组学模式分析）验证合成肿瘤的真实性和多样性。这一方法显著提升了AI在早期肿瘤检测、精确放疗中的肿瘤分割以及良恶性肿瘤分类中的性能。

链接: https://arxiv.org/abs/2412.18589
作者: Xinran Li,Yi Shuai,Chen Liu,Qi Chen,Qilong Wu,Pengfei Guo,Dong Yang,Can Zhao,Pedro R. A. S. Bassi,Daguang Xu,Kang Wang,Yang Yang,Alan Yuille,Zongwei Zhou
机构: 未知
关键词: misses or over-detects, Tumor, tumors, synthetic tumors, Tumor synthesis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional – generating images from random variables – or conditioned only by tumor shapes, lack controllability over specific tumor characteristics such as texture, heterogeneity, boundaries, and pathology type. As a result, the generated tumors may be overly similar or duplicates of existing training data, failing to effectively address AI’s weaknesses. We propose a new text-driven tumor synthesis approach, termed TextoMorph, that provides textual control over tumor characteristics. This is particularly beneficial for examples that confuse the AI the most, such as early tumor detection (increasing Sensitivity by +8.5%), tumor segmentation for precise radiotherapy (increasing DSC by +6.3%), and classification between benign and malignant tumors (improving Sensitivity by +8.2%). By incorporating text mined from radiology reports into the synthesis process, we increase the variability and controllability of the synthetic tumors to target AI’s failure cases more precisely. Moreover, TextoMorph uses contrastive learning across different texts and CT scans, significantly reducing dependence on scarce image-report pairs (only 141 pairs used in this study) by leveraging a large corpus of 34,035 radiology reports. Finally, we have developed rigorous tests to evaluate synthetic tumors, including Text-Driven Visual Turing Test and Radiomics Pattern Analysis, showing that our synthetic tumors is realistic and diverse in texture, heterogeneity, boundaries, and pathology.
zh

[CV-78] Advancing Deformable Medical Image Registration with Multi-axis Cross-covariance Attention

【速读】：该论文旨在解决基于深度学习的可变形图像配准（deformable image registration）方法在处理高分辨率图像特征时面临的挑战。具体而言，传统的自注意力机制（self-attention, SA）由于计算和内存开销随空间分辨率呈二次增长，难以有效捕捉高分辨率图像中的细微纹理信息，而这些信息对于精确的像素级解剖结构对应关系至关重要。论文提出了一种新的跨协方差注意力机制（Cross-covariance Attention, XCA），其计算复杂度随空间分辨率线性增长，能够有效捕捉高分辨率图像特征中的长程依赖关系。然而，现有的基于XCA的变换器仅能捕捉粗糙的全局长程依赖，无法满足可变形图像配准对细粒度局部对应关系的需求。为此，论文设计了一种名为多轴XCA（Multi-Axis XCA, MAXCA）的变换器模块，通过并行应用区域和扩张XCA，能够同时捕捉高分辨率图像特征中的全局和局部长程依赖关系。MAXCA作为一种通用网络模块，可嵌入到多种配准网络架构中，实验结果表明其在多个公开医学数据集上实现了最先进的配准性能。

链接: https://arxiv.org/abs/2412.18545
作者: Mingyuan Meng,Michael Fulham,Lei Bi,Jinman Kim
机构: 未知
关键词: image, registration, fundamental requirement, high-resolution image features, long-range dependency
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Deformable image registration is a fundamental requirement for medical image analysis. Recently, transformers have been widely used in deep learning-based registration methods for their ability to capture long-range dependency via self-attention (SA). However, the high computation and memory loads of SA (growing quadratically with the spatial resolution) hinder transformers from processing subtle textural information in high-resolution image features, e.g., at the full and half image resolutions. This limits deformable registration as the high-resolution textural information is crucial for finding precise pixel-wise correspondence between subtle anatomical structures. Cross-covariance Attention (XCA), as a “transposed” version of SA that operates across feature channels, has complexity growing linearly with the spatial resolution, providing the feasibility of capturing long-range dependency among high-resolution image features. However, existing XCA-based transformers merely capture coarse global long-range dependency, which are unsuitable for deformable image registration relying primarily on fine-grained local correspondence. In this study, we propose to improve existing deep learning-based registration methods by embedding a new XCA mechanism. To this end, we design an XCA-based transformer block optimized for deformable medical image registration, named Multi-Axis XCA (MAXCA). Our MAXCA serves as a general network block that can be embedded into various registration network architectures. It can capture both global and local long-range dependency among high-resolution image features by applying regional and dilated XCA in parallel via a multi-axis design. Extensive experiments on two well-benchmarked inter-/intra-patient registration tasks with seven public medical datasets demonstrate that our MAXCA block enables state-of-the-art registration performance.
zh

[CV-79] Ultra-Low Complexity On-Orbit Compression for Remote Sensing Imagery via Block Modulated Imaging

【速读】：该论文旨在解决遥感影像数据在卫星平台上存储和传输能力受限的问题。随着遥感影像数据量的不断增加，现有的压缩方法往往计算成本过高，难以在卫星上高效实施。论文提出了一种基于压缩感知理论（Compressed Sensing）的块调制成像（Block Modulated Imaging, BMI）方法，作为解决方案的关键。BMI通过单次曝光显著提高了成像采集速度，并消除了对数字微镜器件（Digital Micromirror Devices）的依赖，突破了图像分辨率的限制。此外，论文还设计了一种新型解码网络，利用门控3D卷积（Gated 3D Convolutions）和双向交叉注意力模块（Two-Way Cross-Attention Module）来高效重建BMI框架下压缩的图像。实验结果表明，该方法在多个知名遥感数据集上表现出卓越的重建性能，并通过BMI相机的原型测试进一步验证了其在轨图像压缩的实用潜力。

链接: https://arxiv.org/abs/2412.18417
作者: Zhibin Wang,Yanxin Cai,Jiayi Zhou,Yangming Zhang,Tianyu Li,Wei Li,Xun Liu,Guoqing Wang,Yang Yang
机构: 未知
关键词: remote sensing faces, remote sensing, faces a challenge, growing field, ever-increasing size
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing field of remote sensing faces a challenge: the ever-increasing size and volume of imagery data are exceeding the storage and transmission capabilities of satellite platforms. Efficient compression of remote sensing imagery is a critical solution to alleviate these burdens on satellites. However, existing compression methods are often too computationally expensive for satellites. With the continued advancement of compressed sensing theory, single-pixel imaging emerges as a powerful tool that brings new possibilities for on-orbit image compression. However, it still suffers from prolonged imaging times and the inability to perform high-resolution imaging, hindering its practical application. This paper advances the study of compressed sensing in remote sensing image compression, proposing Block Modulated Imaging (BMI). By requiring only a single exposure, BMI significantly enhances imaging acquisition speeds. Additionally, BMI obviates the need for digital micromirror devices and surpasses limitations in image resolution. Furthermore, we propose a novel decoding network specifically designed to reconstruct images compressed under the BMI framework. Leveraging the gated 3D convolutions and promoting efficient information flow across stages through a Two-Way Cross-Attention module, our decoding network exhibits demonstrably superior reconstruction performance. Extensive experiments conducted on multiple renowned remote sensing datasets unequivocally demonstrate the efficacy of our proposed method. To further validate its practical applicability, we developed and tested a prototype of the BMI-based camera, which has shown promising potential for on-orbit image compression. The code is available at this https URL.
zh

[CV-80] How accurate is mechanobiology?

【速读】：该论文旨在解决在微观尺度下测量生物力学力时缺乏误差范围、置信区间和p值等统计指标的问题。由于生物实验难以直接使用物理探针进行测量，通常通过逆问题（如牵引力显微镜，Traction Force Microscopy）间接测量这些力，但这些测量方法缺乏实验数据的统计验证。论文提出了一种通用的重建框架，作为解决这一问题的初步方案，该框架能够支持假设检验，从而为生物力学测量提供统计上的可靠性和验证手段。

链接: https://arxiv.org/abs/2412.18406
作者: Aleix Boquet-Pujadas
机构: 未知
关键词: Mechanobiology is gaining, Traction Force Microscopy, function becomes clearer, fundamental role, Traction Force
类目: Biological Physics (physics.bio-ph); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Mechanobiology is gaining more and more traction as the fundamental role of physical forces in biological function becomes clearer. Forces at the microscale are often measured indirectly using inverse problems such as Traction Force Microscopy because biological experiments are hard to access with physical probes. In contrast with the experimental nature of biology and physics, these measurements do not come with error bars, confidence regions, or p-values. The aim of this manuscript is to publicize this issue and to propose a first step towards a remedy in the form of a general reconstruction framework that enables hypothesis testing.
zh

[CV-81] An Improved Fault Diagnosis Strategy for Induction Motors Using Weighted Probability Ensemble Deep Learning

【速读】：该论文旨在解决感应电机（Induction Motors, IMs）早期故障检测的问题，特别是针对轴承、转子和定子等常见故障类型。为了实现这一目标，论文提出了一种加权概率集成深度学习（Weighted Probability Ensemble Deep Learning, WPEDL）方法，该方法专门用于处理从振动和电流信号中提取的高维数据。解决方案的关键在于利用短时傅里叶变换（Short-Time Fourier Transform, STFT）从振动和电流信号中提取特征，并通过WPEDL模型进行多类故障诊断。实验结果表明，WPEDL方法在轴承、转子和定子故障的诊断中均表现出高准确率，分别为99.05%、99.10%、99.50%、99.60%和99.52%。此外，在包含52,000个STFT图像的综合数据集上，WPEDL模型的整体准确率达到98.89%，显著优于传统深度学习模型。这些发现证明了WPEDL方法在感应电机早期故障诊断中的有效性和可靠性，为提升工业操作效率和可靠性提供了重要见解。

链接: https://arxiv.org/abs/2412.18249
作者: Usman Ali,Waqas Ali,Umer Ramzan
机构: 未知
关键词: ensuring uninterrupted operations, Early detection, Weighted Probability Ensemble, Probability Ensemble Deep, induction motors
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Early detection of faults in induction motors is crucial for ensuring uninterrupted operations in industrial settings. Among the various fault types encountered in induction motors, bearing, rotor, and stator faults are the most prevalent. This paper introduces a Weighted Probability Ensemble Deep Learning (WPEDL) methodology, tailored for effectively diagnosing induction motor faults using high-dimensional data extracted from vibration and current features. The Short-Time Fourier Transform (STFT) is employed to extract features from both vibration and current signals. The performance of the WPEDL fault diagnosis method is compared against conventional deep learning models, demonstrating the superior efficacy of the proposed system. The multi-class fault diagnosis system based on WPEDL achieves high accuracies across different fault types: 99.05% for bearing (vibrational signal), 99.10%, and 99.50% for rotor (current and vibration signal), and 99.60%, and 99.52% for stator faults (current and vibration signal) respectively. To evaluate the robustness of our multi-class classification decisions, tests have been conducted on a combined dataset of 52,000 STFT images encompassing all three faults. Our proposed model outperforms other models, achieving an accuracy of 98.89%. The findings underscore the effectiveness and reliability of the WPEDL approach for early-stage fault diagnosis in IMs, offering promising insights for enhancing industrial operational efficiency and reliability.
zh

[CV-82] Image Quality Assessment: Exploring Regional Heterogeneity via Response of Adaptive Multiple Quality Factors in Dictionary Space

【速读】：该论文旨在解决图像质量评估中因场景、内容和失真类型的不同而导致的区域异质性问题。针对这一问题，作者提出了一种自适应多质量因子（AMqF）框架，通过在字典空间中表示图像质量，精确捕捉非均匀失真区域的质量特征。解决方案的关键在于两个方面：首先，设计了一个适配器，能够根据人类视觉感知原则灵活分解并量化质量因子（如亮度、结构、对比度等），将其转化为离散的视觉词汇；其次，构建了一个全面且具有区分性的字典空间和基向量，使质量因子能够有效响应字典基向量，从而捕捉图像中的非均匀失真模式，显著提升视觉相似性度量的准确性。实验结果表明，该方法在处理多种失真类型的图像时优于现有的先进方法。

链接: https://arxiv.org/abs/2412.18160
作者: Xuting Lan,Mingliang Zhou,Jielu Yan,Xuekai Wei,Yueting Huang,Zhaowei Shang,Huayan Pu
机构: 未知
关键词: non-uniformly distorted regions, regional heterogeneity, enabling the precise, context of regional, features in non-uniformly
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Given that the factors influencing image quality vary significantly with scene, content, and distortion type, particularly in the context of regional heterogeneity, we propose an adaptive multi-quality factor (AMqF) framework to represent image quality in a dictionary space, enabling the precise capture of quality features in non-uniformly distorted regions. By designing an adapter, the framework can flexibly decompose quality factors (such as brightness, structure, contrast, etc.) that best align with human visual perception and quantify them into discrete visual words. These visual words respond to the constructed dictionary basis vector, and by obtaining the corresponding coordinate vectors, we can measure visual similarity. Our method offers two key contributions. First, an adaptive mechanism that extracts and decomposes quality factors according to human visual perception principles enhances their representation ability through reconstruction constraints. Second, the construction of a comprehensive and discriminative dictionary space and basis vector allows quality factors to respond effectively to the dictionary basis vector and capture non-uniform distortion patterns in images, significantly improving the accuracy of visual similarity measurement. The experimental results demonstrate that the proposed method outperforms existing state-of-the-art approaches in handling various types of distorted images. The source code is available at this https URL.
zh

[CV-83] Analysis of Transferred Pre-Trained Deep Convolution Neural Networks in Breast Masses Recognition

【速读】：该论文旨在探讨在基于预训练卷积神经网络（CNN）的乳腺癌检测中，冻结不同卷积层块对模型性能的影响。具体而言，研究通过将乳腺X光片图像分类为良性或恶性，评估了VGG19模型中不同冻结策略的效果。研究共设计了六种冻结场景，分别冻结不同数量的卷积层块。解决方案的关键在于通过冻结部分卷积层块，既提升了模型对乳腺癌病例的检测能力，又显著减少了VGG19的训练时间。实验结果表明，冻结VGG19的第一个卷积层块的场景取得了最佳识别率，灵敏度达到95.64%，而完整训练VGG19的灵敏度为94.48%。

链接: https://arxiv.org/abs/2412.17959
作者: Qusay Shihab Hamad,Hussein Samma,Shahrel Azmin Suandi
机构: 未知
关键词: conventional computer-based systems, Breast cancer detection, convolution neural network, Breast cancer, pre-trained convolution neural
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Its a conference paper; the full proceeding is avalible at this https URL

点击查看摘要

Abstract:Breast cancer detection based on pre-trained convolution neural network (CNN) has gained much interest among other conventional computer-based systems. In the past few years, CNN technology has been the most promising way to find cancer in mammogram scans. In this paper, the effect of layer freezing in a pre-trained CNN is investigated for breast cancer detection by classifying mammogram images as benign or malignant. Different VGG19 scenarios have been examined based on the number of convolution layer blocks that have been frozen. There are a total of six scenarios in this study. The primary benefits of this research are twofold: it improves the model’s ability to detect breast cancer cases and it reduces the training time of VGG19 by freezing certain this http URL evaluate the performance of these scenarios, 1693 microbiological images of benign and malignant breast cancers were utilized. According to the reported results, the best recognition rate was obtained from a frozen first block of VGG19 with a sensitivity of 95.64 %, while the training of the entire VGG19 yielded 94.48%.
zh

[CV-84] Optimization of Convolutional Neural Network Hyperparameter for Medical Image Diagnosis using Metaheuristic Algorithms: A short Recent Review (2019-2022)

【速读】：该论文旨在解决卷积神经网络（Convolutional Neural Networks, CNNs）在医学诊断中应用时，如何高效选择最优架构和超参数（hyperparameters）的问题。传统上，超参数的选择依赖于研究者的经验和手动调整，这一过程不仅耗时且计算成本高昂。论文提出了一种基于元启发式优化算法（metaheuristic optimization algorithms）的解决方案，通过自动化优化方法来确定高性能的CNN超参数设置。这一方法的核心在于利用优化算法替代手动调整，从而显著提高超参数选择的效率和准确性，为研究者提供了一种更为系统化的优化途径。

链接: https://arxiv.org/abs/2412.17956
作者: Qusay Shihab Hamad,Hussein Samma,Shahrel Azmin Suandi
机构: 未知
关键词: Convolutional Neural Networks, Convolutional Neural, Neural Networks, successfully utilized, medical diagnosis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: its a conference paper; the full proceeding is available online at this https URL

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have been successfully utilized in the medical diagnosis of many illnesses. Nevertheless, identifying the optimal architecture and hyperparameters among the available possibilities might be a substantial challenge. Typically, CNN hyperparameter selection is performed manually. Nonetheless, this is a computationally costly procedure, as numerous rounds of hyperparameter settings must be evaluated to determine which produces the best results. Choosing the proper hyperparameter settings has always been a crucial and challenging task, as it depends on the researcher’s knowledge and experience. This study will present work done in recent years on the usage of metaheuristic optimization algorithms in the CNN optimization process. It looks at a number of recent studies that focus on the use of optimization methods to optimize hyperparameters in order to find high-performing CNNs. This helps researchers figure out how to set hyperparameters efficiently.
zh

[CV-85] Adaptive Signal Analysis for Automated Subsurface Defect Detection Using Impact Echo in Concrete Slabs

【速读】：该论文旨在解决混凝土板中地下缺陷区域的自动检测与评估问题。传统方法通常依赖人工干预，难以实现大规模应用。为此，论文提出了一种基于冲击回波（Impact Echo, IE）信号分析的新型自动化方法，通过集成先进的信号处理、聚类和可视化分析技术，识别地下异常区域。解决方案的关键在于引入了一种独特的自适应阈值方法，该方法能够根据每块混凝土板的材料特性调整频率阈值，从而实现缺陷的精确识别。此外，该方法通过生成频率图、二值掩码和k-means聚类图，自动分类缺陷和非缺陷区域，并结合3D表面图、聚类图和等高线图等可视化工具，分析空间频率分布并突出结构异常。该方法在联邦公路管理局（FHWA）先进传感技术无损评估实验室构建的标注数据集上进行了评估，性能指标（F1分数和AUC-ROC）分别达到0.95和0.83，证明了其鲁棒性和准确性。自适应频率阈值机制确保了该方法在不同混凝土板间的灵活性，并具备扩展到其他频率信号和多模态传感器融合的潜力。

链接: https://arxiv.org/abs/2412.17953
作者: Deepthi Pavurala,Duoduo Liao,Chaithra Reddy Pasunuru
机构: 未知
关键词: Impact Echo, pilot study presents, evaluating subsurface defect-prone, evaluating subsurface, signal analysis
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Big Data 2024

点击查看摘要

Abstract:This pilot study presents a novel, automated, and scalable methodology for detecting and evaluating subsurface defect-prone regions in concrete slabs using Impact Echo (IE) signal analysis. The approach integrates advanced signal processing, clustering, and visual analytics to identify subsurface anomalies. A unique adaptive thresholding method tailors frequency-based defect identification to the distinct material properties of each slab. The methodology generates frequency maps, binary masks, and k-means cluster maps to automatically classify defect and non-defect regions. Key visualizations, including 3D surface plots, cluster maps, and contour plots, are employed to analyze spatial frequency distributions and highlight structural anomalies. The study utilizes a labeled dataset constructed at the Federal Highway Administration (FHWA) Advanced Sensing Technology Nondestructive Evaluation Laboratory. Evaluations involve ground-truth masking, comparing the generated defect maps with top-view binary masks derived from the information provided by the FHWA. The performance metrics, specifically F1-scores and AUC-ROC, achieve values of up to 0.95 and 0.83, respectively. The results demonstrate the robustness of the methodology, consistently identifying defect-prone areas with minimal false positives and few missed defects. Adaptive frequency thresholding ensures flexibility in addressing variations across slabs, providing a scalable framework for detecting structural anomalies. Additionally, the methodology is adaptable to other frequency-based signals due to its generalizable thresholding mechanism and holds potential for integrating multimodal sensor fusion. This automated and scalable pipeline minimizes manual intervention, ensuring accurate and efficient defect detection, further advancing Non-Destructive Evaluation (NDE) techniques.
zh

人工智能

[AI-0] Decentralized Intelligence in GameFi: Embodied AI Agents and the Convergence of DeFi and Virtual Ecosystems

链接: https://arxiv.org/abs/2412.18601
作者: Fernando Jia,Jade Zheng,Florence Li
关键词: rapidly evolving landscape, rapidly evolving, exists a critical, GameFi, evolving landscape
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:In the rapidly evolving landscape of GameFi, a fusion of gaming and decentralized finance (DeFi), there exists a critical need to enhance player engagement and economic interaction within gaming ecosystems. Our GameFi ecosystem aims to fundamentally transform this landscape by integrating advanced embodied AI agents into GameFi platforms. These AI agents, developed using cutting-edge large language models (LLMs), such as GPT-4 and Claude AI, are capable of proactive, adaptive, and contextually rich interactions with players. By going beyond traditional scripted responses, these agents become integral participants in the game’s narrative and economic systems, directly influencing player strategies and in-game economies. We address the limitations of current GameFi platforms, which often lack immersive AI interactions and mechanisms for community engagement or creator monetization. Through the deep integration of AI agents with blockchain technology, we establish a consensus-driven, decentralized GameFi ecosystem. This ecosystem empowers creators to monetize their contributions and fosters democratic collaboration among players and creators. Furthermore, by embedding DeFi mechanisms into the gaming experience, we enhance economic participation and provide new opportunities for financial interactions within the game. Our approach enhances player immersion and retention and advances the GameFi ecosystem by bridging traditional gaming with Web3 technologies. By integrating sophisticated AI and DeFi elements, we contribute to the development of more engaging, economically robust, and community-centric gaming environments. This project represents a significant advancement in the state-of-the-art in GameFi, offering insights and methodologies that can be applied throughout the gaming industry.

[AI-1] A Paragraph is All It Takes: Rich Robot Behaviors from Interacting Trusted LLM s

链接: https://arxiv.org/abs/2412.18588
作者: OpenMind,Shaohong Zhong,Adam Zhou,Boyuan Chen,Homin Luo,Jan Liphardt
关键词: Large Language Models, Large Language, Language Models, compact representations, environment and animal
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 10 pages, 1 figure

点击查看摘要

Abstract:Large Language Models (LLMs) are compact representations of all public knowledge of our physical environment and animal and human behaviors. The application of LLMs to robotics may offer a path to highly capable robots that perform well across most human tasks with limited or even zero tuning. Aside from increasingly sophisticated reasoning and task planning, networks of (suitably designed) LLMs offer ease of upgrading capabilities and allow humans to directly observe the robot’s thinking. Here we explore the advantages, limitations, and particularities of using LLMs to control physical robots. The basic system consists of four LLMs communicating via a human language data bus implemented via web sockets and ROS2 message passing. Surprisingly, rich robot behaviors and good performance across different tasks could be achieved despite the robot’s data fusion cycle running at only 1Hz and the central data bus running at the extremely limited rates of the human brain, of around 40 bits/s. The use of natural language for inter-LLM communication allowed the robot’s reasoning and decision making to be directly observed by humans and made it trivial to bias the system’s behavior with sets of rules written in plain English. These rules were immutably written into Ethereum, a global, public, and censorship resistant Turing-complete computer. We suggest that by using natural language as the data bus among interacting AIs, and immutable public ledgers to store behavior constraints, it is possible to build robots that combine unexpectedly rich performance, upgradability, and durable alignment with humans.

[AI-2] An Overview and Discussion of the Suitability of Existing Speech Datasets to Train Machine Learning Models for Collective Problem Solving

链接: https://arxiv.org/abs/2412.18489
作者: Gnaneswar Villuri,Alex Doboli
关键词: Machine Learning models, Collaborative Problem Solving, decision making methods, Machine Learning, improve Collaborative Problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This report characterized the suitability of existing datasets for devising new Machine Learning models, decision making methods, and analysis algorithms to improve Collaborative Problem Solving and then enumerated requirements for future datasets to be devised. Problem solving was assumed to be performed in teams of about three, four members, which talked to each other. A dataset consists of the speech recordings of such teams. The characterization methodology was based on metrics that capture cognitive, social, and emotional activities and situations. The report presented the analysis of a large group of datasets developed for Spoken Language Understanding, a research area with some similarity to Collaborative Problem Solving.

[AI-3] MotifGPL: Motif-Enhanced Graph Prototype Learning for Deciphering Urban Social Segregation AAAI-25 AAAI

链接: https://arxiv.org/abs/2412.18464
作者: Tengfei He,Xiao Zhou
关键词: urban, spanning racial, Social segregation, income dimensions, diverse and severe
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Accepted by the 39th Annual AAAI Conference on Artificial Intelligence (AAAI-25); 10 pages, 8 figures, 3 tables; Includes the appendix

点击查看摘要

Abstract:Social segregation in cities, spanning racial, residential, and income dimensions, is becoming more diverse and severe. As urban spaces and social relations grow more complex, residents in metropolitan areas experience varying levels of social segregation. If left unaddressed, this could lead to increased crime rates, heightened social tensions, and other serious issues. Effectively quantifying and analyzing the structures within urban spaces and resident interactions is crucial for addressing segregation. Previous studies have mainly focused on surface-level indicators of urban segregation, lacking comprehensive analyses of urban structure and mobility. This limitation fails to capture the full complexity of segregation. To address this gap, we propose a framework named Motif-Enhanced Graph Prototype Learning (MotifGPL),which consists of three key modules: prototype-based graph structure extraction, motif distribution discovery, and urban graph structure reconstruction. Specifically, we use graph structure prototype learning to extract key prototypes from both the urban spatial graph and the origin-destination graph, incorporating key urban attributes such as points of interest, street view images, and flow indices. To enhance interpretability, the motif distribution discovery module matches each prototype with similar motifs, representing simpler graph structures reflecting local patterns. Finally, we use the motif distribution results to guide the reconstruction of the two graphs. This model enables a detailed exploration of urban spatial structures and resident mobility patterns, helping identify and analyze motif patterns that influence urban segregation, guiding the reconstruction of urban graph structures. Experimental results demonstrate that MotifGPL effectively reveals the key motifs affecting urban social segregation and offer robust guidance for mitigating this issue.

[AI-4] GeFL: Model-Agnostic Federated Learning with Generative Models

链接: https://arxiv.org/abs/2412.18460
作者: Honggu Kang,Seohyeon Cha,Joonhyuk Kang
关键词: Federated learning, promising paradigm, paradigm in distributed, distributed learning, Model-Aided Federated Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages

点击查看摘要

Abstract:Federated learning (FL) is a promising paradigm in distributed learning while preserving the privacy of users. However, the increasing size of recent models makes it unaffordable for a few users to encompass the model. It leads the users to adopt heterogeneous models based on their diverse computing capabilities and network bandwidth. Correspondingly, FL with heterogeneous models should be addressed, given that FL typically involves training a single global model. In this paper, we propose Generative Model-Aided Federated Learning (GeFL), incorporating a generative model that aggregates global knowledge across users of heterogeneous models. Our experiments on various classification tasks demonstrate notable performance improvements of GeFL compared to baselines, as well as limitations in terms of privacy and scalability. To tackle these concerns, we introduce a novel framework, GeFL-F. It trains target networks aided by feature-generative models. We empirically demonstrate the consistent performance gains of GeFL-F, while demonstrating better privacy preservation and robustness to a large number of clients. Codes are available at [1].

[AI-5] Multi-Agent Norm Perception and Induction in Distributed Healthcare

链接: https://arxiv.org/abs/2412.18454
作者: Chao Li,Olga Petruchik,Elizaveta Grishanina,Sergey Kovalchuk
关键词: Induction Learning Model, Perception and Induction, Induction Learning, Learning Model aimed, distributed healthcare environments
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 15 pages,8 figures,152 conferences,3 tables

点击查看摘要

Abstract:This paper presents a Multi-Agent Norm Perception and Induction Learning Model aimed at facilitating the integration of autonomous agent systems into distributed healthcare environments through dynamic interaction processes. The nature of the medical norm system and its sharing channels necessitates distinct approaches for Multi-Agent Systems to learn two types of norms. Building on this foundation, the model enables agents to simultaneously learn descriptive norms, which capture collective tendencies, and prescriptive norms, which dictate ideal behaviors. Through parameterized mixed probability density models and practice-enhanced Markov games, the multi-agent system perceives descriptive norms in dynamic interactions and captures emergent prescriptive norms. We conducted experiments using a dataset from a neurological medical center spanning from 2016 to 2020.

[AI-6] SoK: On the Offensive Potential of AI

链接: https://arxiv.org/abs/2412.18442
作者: Saskia Laura Schröer,Giovanni Apruzzese,Soheil Human,Pavel Laskov,Hyrum S. Anderson,Edward W. N. Bernroider,Aurore Fass,Ben Nassi,Vera Rimmer,Fabio Roli,Samer Salam,Ashley Shen,Ali Sunyaev,Tim Wadwha-Brown,Isabel Wagner,Gang Wang
关键词: society increasingly benefits, increasingly benefits, Artificial Intelligence, society increasingly, offensive
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Systemization of Knowledge (SoK) paper

点击查看摘要

Abstract:Our society increasingly benefits from Artificial Intelligence (AI). Unfortunately, more and more evidence shows that AI is also used for offensive purposes. Prior works have revealed various examples of use cases in which the deployment of AI can lead to violation of security and privacy objectives. No extant work, however, has been able to draw a holistic picture of the offensive potential of AI. In this SoK paper we seek to lay the ground for a systematic analysis of the heterogeneous capabilities of offensive AI. In particular we (i) account for AI risks to both humans and systems while (ii) consolidating and distilling knowledge from academic literature, expert opinions, industrial venues, as well as laymen – all of which being valuable sources of information on offensive AI. To enable alignment of such diverse sources of knowledge, we devise a common set of criteria reflecting essential technological factors related to offensive AI. With the help of such criteria, we systematically analyze: 95 research papers; 38 InfoSec briefings (from, e.g., BlackHat); the responses of a user study (N=549) entailing individuals with diverse backgrounds and expertise; and the opinion of 12 experts. Our contributions not only reveal concerning ways (some of which overlooked by prior work) in which AI can be offensively used today, but also represent a foothold to address this threat in the years to come. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2412.18442 [cs.CR] (or arXiv:2412.18442v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.18442 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-7] GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

链接: https://arxiv.org/abs/2412.18426
作者: Kangjia Zhao,Jiahui Song,Leigang Sha,Haozhan Shen,Zhi Chen,Tiancheng Zhao,Xiubo Liang,Jianwei Yin
关键词: automated GUI Testing, GUI Testing, GUI, Autonomous GUI Testing, hot topic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Nowadays, research on GUI agents is a hot topic in the AI community. However, current research focuses on GUI task automation, limiting the scope of applications in various GUI scenarios. In this paper, we propose a formalized and comprehensive environment to evaluate the entire process of automated GUI Testing (GTArena), offering a fair, standardized environment for consistent operation of diverse multimodal large language models. We divide the testing process into three key subtasks: test intention generation, test task execution, and GUI defect detection, and construct a benchmark dataset based on these to conduct a comprehensive evaluation. It evaluates the performance of different models using three data types: real mobile applications, mobile applications with artificially injected defects, and synthetic data, thoroughly assessing their capabilities in this relevant task. Additionally, we propose a method that helps researchers explore the correlation between the performance of multimodal language large models in specific scenarios and their general capabilities in standard benchmark tests. Experimental results indicate that even the most advanced models struggle to perform well across all sub-tasks of automated GUI Testing, highlighting a significant gap between the current capabilities of Autonomous GUI Testing and its practical, real-world applicability. This gap provides guidance for the future direction of GUI Agent development. Our code is available at this https URL.

[AI-8] Research on the Proximity Relationships of Psychosomatic Disease Knowledge Graph Modules Extracted by Large Language Models

链接: https://arxiv.org/abs/2412.18419
作者: Zihan Zhou,Ziyi Zeng,Wenhao Jiang,Yihui Zhu,Jiaxin Mao,Yonggui Yuan,Min Xia,Shubin Zhao,Mengyu Yao,Yunqian Chen
关键词: global health issues, social changes accelerate, significantly increased, major challenge, challenge in global
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As social changes accelerate, the incidence of psychosomatic disorders has significantly increased, becoming a major challenge in global health issues. This necessitates an innovative knowledge system and analytical methods to aid in diagnosis and treatment. Here, we establish the ontology model and entity types, using the BERT model and LoRA-tuned LLM for named entity recognition, constructing the knowledge graph with 9668 triples. Next, by analyzing the network distances between disease, symptom, and drug modules, it was found that closer network distances among diseases can predict greater similarities in their clinical manifestations, treatment approaches, and psychological mechanisms, and closer distances between symptoms indicate that they are more likely to co-occur. Lastly, by comparing the proximity d and proximity z score, it was shown that symptom-disease pairs in primary diagnostic relationships have a stronger association and are of higher referential value than those in diagnostic relationships. The research results revealed the potential connections between diseases, co-occurring symptoms, and similarities in treatment strategies, providing new perspectives for the diagnosis and treatment of psychosomatic disorders and valuable information for future mental health research and practice.

[AI-9] Exploring Flexible Scenario Generation in Godot Simulator

链接: https://arxiv.org/abs/2412.18408
作者: Daniel Peraltai,Xin Qin
关键词: Cyber-physical systems, physical components engineered, combine cyber, cyber and physical, physical components
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cyber-physical systems (CPS) combine cyber and physical components engineered to make decisions and interact within dynamic environments. Ensuring the safety of CPS is of great importance, requiring extensive testing across diverse and complex scenarios. To generate as many testing scenarios as possible, previous efforts have focused on describing scenarios using formal languages to generate scenes. In this paper, we introduce an alternative approach: reconstructing scenes inside the open-source game engine, Godot. We have developed a pipeline that enables the reconstruction of testing scenes directly from provided images of scenarios. These reconstructed scenes can then be deployed within simulated environments to assess a CPS. This approach offers a scalable and flexible solution for testing CPS in realistic environments.

[AI-10] PAoI: Ensuring Fresh Service Status at the Network Edge in Compute-First Networking

链接: https://arxiv.org/abs/2412.18391
作者: Haosheng He,Jianpeng Qi,Chao Liu,Junyu Dong,Yanwei Yu
关键词: accurate status information, compute-first networking, maintaining fresh, fresh and accurate, crucial for effective
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In compute-first networking, maintaining fresh and accurate status information at the network edge is crucial for effective access to remote services. This process typically involves three phases: Status updating, user accessing, and user requesting. However, current studies on status effectiveness, such as Age of Information at Query (QAoI), do not comprehensively cover all these phases. Therefore, this paper introduces a novel metric, TPAoI, aimed at optimizing update decisions by measuring the freshness of service status. The stochastic nature of edge environments, characterized by unpredictable communication delays in updating, requesting, and user access times, poses a significant challenge when modeling. To address this, we model the problem as a Markov Decision Process (MDP) and employ a Dueling Double Deep Q-Network (D3QN) algorithm for optimization. Extensive experiments demonstrate that the proposed TPAoI metric effectively minimizes AoI, ensuring timely and reliable service updates in dynamic edge environments. Results indicate that TPAoI reduces AoI by an average of 47% compared to QAoI metrics and decreases update frequency by an average of 48% relative to conventional AoI metrics, showing significant improvement.

[AI-11] Weak Scaling Capability in Token Space: An Observation from Large Vision Language Model

链接: https://arxiv.org/abs/2412.18387
作者: Tenghui Li,Guoxu Zhou,Xuyang Zhao,Qibin Zhao
关键词: training data, widely validated, size of training, scaling capability, vision tokens
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The scaling capability has been widely validated with respect to the number of parameters and the size of training data. One important question that is unexplored is that does scaling capability also exists similarly with respect to the number of vision tokens? This study fills the gap by investigating the relationship between the number of vision tokens and the performance of vision-language models. Our theoretical analysis and empirical evaluations reveal that the model exhibits weak scaling capabilities on the length (N_l), with performance approximately (S(N_l) \approx (c/N_l)^\alpha), where (c, \alpha) are hyperparameters. Interestingly, this scaling behavior remains largely unaffected by the inclusion or exclusion of the user’s question in the input. Furthermore, fusing the user’s question with the vision token can enhance model performance when the question is relevant to the task. To address the computational challenges associated with large-scale vision tokens, we propose a novel architecture that efficiently reduces the token count while integrating user question tokens into the representation. Our findings may offer insights for developing more efficient and effective vision-language models under specific task constraints.

[AI-12] A Many Objective Problem Where Crossover is Provably Indispensable AAAI2025

链接: https://arxiv.org/abs/2412.18375
作者: Andre Opris
关键词: paper addresses theory, evolutionary multiobjective optimisation, paper addresses, addresses theory, theory in evolutionary
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注: To appear in the proceedings of AAAI 2025

点击查看摘要

Abstract:This paper addresses theory in evolutionary multiobjective optimisation (EMO) and focuses on the role of crossover operators in many-objective optimisation. The advantages of using crossover are hardly understood and rigorous runtime analyses with crossover are lagging far behind its use in practice, specifically in the case of more than two objectives. We present a many-objective problem class together with a theoretical runtime analysis of the widely used NSGA-III to demonstrate that crossover can yield an exponential speedup on the runtime. In particular, this algorithm can find the Pareto set in expected polynomial time when using crossover while without crossover it requires exponential time to even find a single Pareto-optimal point. To our knowledge, this is the first rigorous runtime analysis in many-objective optimisation demonstrating an exponential performance gap when using crossover for more than two objectives.

[AI-13] Unveiling the Threat of Fraud Gangs to Graph Neural Networks: Multi-Target Graph Injection Attacks against GNN-Based Fraud Detectors AAAI AAAI2025

链接: https://arxiv.org/abs/2412.18370
作者: Jinhyeok Choi,Heehyeon Kim,Joyce Jiyoung Whang
关键词: identifying fraudulent users, uncovering malicious behaviors, Graph neural networks, graph injection attack, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 19 pages, 5 figures, 12 tables, The 39th AAAI Conference on Artificial Intelligence (AAAI 2025)

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as an effective tool for fraud detection, identifying fraudulent users, and uncovering malicious behaviors. However, attacks against GNN-based fraud detectors and their risks have rarely been studied, thereby leaving potential threats unaddressed. Recent findings suggest that frauds are increasingly organized as gangs or groups. In this work, we design attack scenarios where fraud gangs aim to make their fraud nodes misclassified as benign by camouflaging their illicit activities in collusion. Based on these scenarios, we study adversarial attacks against GNN-based fraud detectors by simulating attacks of fraud gangs in three real-world fraud cases: spam reviews, fake news, and medical insurance frauds. We define these attacks as multi-target graph injection attacks and propose MonTi, a transformer-based Multi-target one-Time graph injection attack model. MonTi simultaneously generates attributes and edges of all attack nodes with a transformer encoder, capturing interdependencies between attributes and edges more effectively than most existing graph injection attack methods that generate these elements sequentially. Additionally, MonTi adaptively allocates the degree budget for each attack node to explore diverse injection structures involving target, candidate, and attack nodes, unlike existing methods that fix the degree budget across all attack nodes. Experiments show that MonTi outperforms the state-of-the-art graph injection attack methods on five real-world graphs.

[AI-14] Hypergraph Attacks via Injecting Homogeneous Nodes into Elite Hyperedges AAAI

链接: https://arxiv.org/abs/2412.18365
作者: Meixia He,Peican Zhu,Keke Tang,Yangming Guo
关键词: Hypergraph Neural Networks, Neural Networks, Recent studies, Hypergraph Neural, Elite Hyperedges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, The 39th Annual AAAI Conference on Artificial Intelligence(2025)

点击查看摘要

Abstract:Recent studies have shown that Hypergraph Neural Networks (HGNNs) are vulnerable to adversarial attacks. Existing approaches focus on hypergraph modification attacks guided by gradients, overlooking node spanning in the hypergraph and the group identity of hyperedges, thereby resulting in limited attack performance and detectable attacks. In this manuscript, we present a novel framework, i.e., Hypergraph Attacks via Injecting Homogeneous Nodes into Elite Hyperedges (IE-Attack), to tackle these challenges. Initially, utilizing the node spanning in the hypergraph, we propose the elite hyperedges sampler to identify hyperedges to be injected. Subsequently, a node generator utilizing Kernel Density Estimation (KDE) is proposed to generate the homogeneous node with the group identity of hyperedges. Finally, by injecting the homogeneous node into elite hyperedges, IE-Attack improves the attack performance and enhances the imperceptibility of attacks. Extensive experiments are conducted on five authentic datasets to validate the effectiveness of IE-Attack and the corresponding superiority to state-of-the-art methods.

[AI-15] Point-DeepONet: A Deep Operator Network Integrating PointNet for Nonlinear Analysis of Non-Parametric 3D Geometries and Load Conditions

链接: https://arxiv.org/abs/2412.18362
作者: Jangseop Park,Namwoo Kang
关键词: uncertainty quantification, limiting their applicability, real-time control, finite element simulations, physics-informed neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, 16 figures, and 5 tables

点击查看摘要

Abstract:Nonlinear structural analyses in engineering often require extensive finite element simulations, limiting their applicability in design optimization, uncertainty quantification, and real-time control. Conventional deep learning surrogates, such as convolutional neural networks (CNNs), physics-informed neural networks (PINNs), and fourier neural operators (FNOs), face challenges with complex non-parametric three-dimensional (3D) geometries, directionally varying loads, and high-fidelity predictions on unstructured meshes. This work presents Point-DeepONet, an operator-learning-based surrogate that integrates PointNet into the DeepONet framework. By directly processing non-parametric point clouds and incorporating signed distance functions (SDF) for geometric context, Point-DeepONet accurately predicts three-dimensional displacement and von Mises stress fields without mesh parameterization or retraining. Trained using only about 5,000 nodes (2.5% of the original 200,000-node mesh), Point-DeepONet can still predict the entire mesh at high fidelity, achieving a coefficient of determination reaching 0.987 for displacement and 0.923 for von Mises stress under a horizontal load case. Compared to nonlinear finite element analyses that require about 19.32 minutes per case, Point-DeepONet provides predictions in mere seconds-approximately 400 times faster-while maintaining excellent scalability and accuracy with increasing dataset sizes. These findings highlight the potential of Point-DeepONet to enable rapid, high-fidelity structural analyses, ultimately supporting more effective design exploration and informed decision-making in complex engineering workflows.

[AI-16] he Thousand Brains Project: A New Paradigm for Sensorimotor Intelligence

链接: https://arxiv.org/abs/2412.18354
作者: Viviane Clay,Niels Leadholm,Jeff Hawkins
关键词: Artificial intelligence, driven primarily, Thousand Brains Project, intelligence has advanced, advanced rapidly
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Artificial intelligence has advanced rapidly in the last decade, driven primarily by progress in the scale of deep-learning systems. Despite these advances, the creation of intelligent systems that can operate effectively in diverse, real-world environments remains a significant challenge. In this white paper, we outline the Thousand Brains Project, an ongoing research effort to develop an alternative, complementary form of AI, derived from the operating principles of the neocortex. We present an early version of a thousand-brains system, a sensorimotor agent that is uniquely suited to quickly learn a wide range of tasks and eventually implement any capabilities the human neocortex has. Core to its design is the use of a repeating computational unit, the learning module, modeled on the cortical columns found in mammalian brains. Each learning module operates as a semi-independent unit that can model entire objects, represents information through spatially structured reference frames, and both estimates and is able to effect movement in the world. Learning is a quick, associative process, similar to Hebbian learning in the brain, and leverages inductive biases around the spatial structure of the world to enable rapid and continual learning. Multiple learning modules can interact with one another both hierarchically and non-hierarchically via a “cortical messaging protocol” (CMP), creating more abstract representations and supporting multimodal integration. We outline the key principles motivating the design of thousand-brains systems and provide details about the implementation of Monty, our first instantiation of such a system. Code can be found at this https URL, along with more detailed documentation at this https URL.

[AI-17] Exploring Graph Mamba: A Comprehensive Survey on State-Space Models for Graph Learning

链接: https://arxiv.org/abs/2412.18322
作者: Safa Ben Atitallah,Chaima Ben Rabah,Maha Driss,Wadii Boulila,Anis Koubaa
关键词: Graph Mamba, graph embedding technique, powerful graph embedding, Graph, Mamba
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Mamba, a powerful graph embedding technique, has emerged as a cornerstone in various domains, including bioinformatics, social networks, and recommendation systems. This survey represents the first comprehensive study devoted to Graph Mamba, to address the critical gaps in understanding its applications, challenges, and future potential. We start by offering a detailed explanation of the original Graph Mamba architecture, highlighting its key components and underlying mechanisms. Subsequently, we explore the most recent modifications and enhancements proposed to improve its performance and applicability. To demonstrate the versatility of Graph Mamba, we examine its applications across diverse domains. A comparative analysis of Graph Mamba and its variants is conducted to shed light on their unique characteristics and potential use cases. Furthermore, we identify potential areas where Graph Mamba can be applied in the future, highlighting its potential to revolutionize data analysis in these fields. Finally, we address the current limitations and open research questions associated with Graph Mamba. By acknowledging these challenges, we aim to stimulate further research and development in this promising area. This survey serves as a valuable resource for both newcomers and experienced researchers seeking to understand and leverage the power of Graph Mamba.

[AI-18] Data-Driven Self-Supervised Graph Representation Learning

链接: https://arxiv.org/abs/2412.18316
作者: Ahmed E. Samy,Zekarias T. Kefatoa,Sarunas Girdzijauskasa
关键词: avoid manual labeling, Self-supervised graph representation, Self-supervised graph, manual labeling, reduce or avoid
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-supervised graph representation learning (SSGRL) is a representation learning paradigm used to reduce or avoid manual labeling. An essential part of SSGRL is graph data augmentation. Existing methods usually rely on heuristics commonly identified through trial and error and are effective only within some application domains. Also, it is not clear why one heuristic is better than another. Moreover, recent studies have argued against some techniques (e.g., dropout: that can change the properties of molecular graphs or destroy relevant signals for graph-based document classification tasks). In this study, we propose a novel data-driven SSGRL approach that automatically learns a suitable graph augmentation from the signal encoded in the graph (i.e., the nodes’ predictive feature and topological information). We propose two complementary approaches that produce learnable feature and topological augmentations. The former learns multi-view augmentation of node features, and the latter learns a high-order view of the topology. Moreover, the augmentations are jointly learned with the representation. Our approach is general that it can be applied to homogeneous and heterogeneous graphs. We perform extensive experiments on node classification (using nine homogeneous and heterogeneous datasets) and graph property prediction (using another eight datasets). The results show that the proposed method matches or outperforms the SOTA SSGRL baselines and performs similarly to semi-supervised methods. The anonymised source code is available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.18316 [cs.LG] (or arXiv:2412.18316v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.18316 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ECAI 2023. IOS Press, 2023. 629-636 Related DOI: https://doi.org/10.3233/FAIA230325 Focus to learn more DOI(s) linking to related resources

[AI-19] Navigating Data Corruption in Machine Learning: Balancing Quality Quantity and Imputation Strategies

链接: https://arxiv.org/abs/2412.18296
作者: Qi Liu,Wanjing Ma
关键词: poses significant challenges, Data corruption, Data, corruption, poses significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data corruption, including missing and noisy data, poses significant challenges in real-world machine learning. This study investigates the effects of data corruption on model performance and explores strategies to mitigate these effects through two experimental setups: supervised learning with NLP tasks (NLP-SL) and deep reinforcement learning for traffic signal optimization (Signal-RL). We analyze the relationship between data corruption levels and model performance, evaluate the effectiveness of data imputation methods, and assess the utility of enlarging datasets to address data corruption. Our results show that model performance under data corruption follows a diminishing return curve, modeled by the exponential function. Missing data, while detrimental, is less harmful than noisy data, which causes severe performance degradation and training instability, particularly in sequential decision-making tasks like Signal-RL. Imputation strategies involve a trade-off: they recover missing information but may introduce noise. Their effectiveness depends on imputation accuracy and corruption ratio. We identify distinct regions in the imputation advantage heatmap, including an “imputation advantageous corner” and an “imputation disadvantageous edge” and classify tasks as “noise-sensitive” or “noise-insensitive” based on their decision boundaries. Furthermore, we find that increasing dataset size mitigates but cannot fully overcome the effects of data corruption. The marginal utility of additional data diminishes as corruption increases. An empirical rule emerges: approximately 30% of the data is critical for determining performance, while the remaining 70% has minimal impact. These findings provide actionable insights into data preprocessing, imputation strategies, and data collection practices, guiding the development of robust machine learning systems in noisy environments. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.18296 [cs.LG] (or arXiv:2412.18296v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.18296 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Qi Liu [view email] [v1] Tue, 24 Dec 2024 09:04:06 UTC (12,417 KB)

[AI-20] Pirates of the RAG: Adaptively Attacking LLM s to Leak Knowledge Bases

链接: https://arxiv.org/abs/2412.18295
作者: Christian Di Maio,Cristian Cosci,Marco Maggini,Valentina Poggioni,Stefano Melacci
关键词: real-world services triggers, services triggers severe, triggers severe concerns, Large Language Models, private knowledge base
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The growing ubiquity of Retrieval-Augmented Generation (RAG) systems in several real-world services triggers severe concerns about their security. A RAG system improves the generative capabilities of a Large Language Models (LLM) by a retrieval mechanism which operates on a private knowledge base, whose unintended exposure could lead to severe consequences, including breaches of private and sensitive information. This paper presents a black-box attack to force a RAG system to leak its private knowledge base which, differently from existing approaches, is adaptive and automatic. A relevance-based mechanism and an attacker-side open-source LLM favor the generation of effective queries to leak most of the (hidden) knowledge base. Extensive experimentation proves the quality of the proposed algorithm in different RAG pipelines and domains, comparing to very recent related approaches, which turn out to be either not fully black-box, not adaptive, or not based on open-source models. The findings from our study remark the urgent need for more robust privacy safeguards in the design and deployment of RAG systems.

[AI-21] MinsStudio: A Streamlined Package for Minecraft AI Agent Development

链接: https://arxiv.org/abs/2412.18293
作者: Shaofei Cai,Zhancun Mu,Kaichen He,Bowei Zhang,Xinyue Zheng,Anji Liu,Yitao Liang
关键词: sequential decision-making research, agents remains hindered, significant engineering challenges, decision-making research, valuable testbed
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Minecraft has emerged as a valuable testbed for embodied intelligence and sequential decision-making research, yet the development and validation of novel agents remains hindered by significant engineering challenges. This paper presents MineStudio, an open-source software package designed to streamline embodied policy development in Minecraft. MineStudio represents the first comprehensive integration of seven critical engineering components: simulator, data, model, offline pretraining, online finetuning, inference, and benchmark, thereby allowing users to concentrate their efforts on algorithm innovation. We provide a user-friendly API design accompanied by comprehensive documentation and tutorials. The complete codebase is publicly available at this https URL.

[AI-22] Semi-supervised Credit Card Fraud Detection via Attribute-Driven Graph Representation AAAI2023

链接: https://arxiv.org/abs/2412.18287
作者: Sheng Xiang,Mingzhi Zhu,Dawei Cheng,Enxia Li,Ruihui Zhao,Yi Ouyang,Ling Chen,Yefeng Zheng
关键词: Credit card fraud, Credit card, card fraud incurs, issuing banks, incurs a considerable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 9 pages, 5 figures, AAAI 2023, code: this https URL

点击查看摘要

Abstract:Credit card fraud incurs a considerable cost for both cardholders and issuing banks. Contemporary methods apply machine learning-based classifiers to detect fraudulent behavior from labeled transaction records. But labeled data are usually a small proportion of billions of real transactions due to expensive labeling costs, which implies that they do not well exploit many natural features from unlabeled data. Therefore, we propose a semi-supervised graph neural network for fraud detection. Specifically, we leverage transaction records to construct a temporal transaction graph, which is composed of temporal transactions (nodes) and interactions (edges) among them. Then we pass messages among the nodes through a Gated Temporal Attention Network (GTAN) to learn the transaction representation. We further model the fraud patterns through risk propagation among transactions. The extensive experiments are conducted on a real-world transaction dataset and two publicly available fraud detection datasets. The result shows that our proposed method, namely GTAN, outperforms other state-of-the-art baselines on three fraud detection datasets. Semi-supervised experiments demonstrate the excellent fraud detection performance of our model with only a tiny proportion of labeled data.

[AI-23] Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization

链接: https://arxiv.org/abs/2412.18279
作者: Jiacai Liu,Chaojie Wang,Chris Yuhao Liu,Liang Zeng,Rui Yan,Yiwen Sun,Yang Liu,Yahui Zhou
关键词: reinforcement learning, increasingly significant, role of reinforcement, large language models, enhancing the reasoning
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The role of reinforcement learning (RL) in enhancing the reasoning of large language models (LLMs) is becoming increasingly significant. Despite the success of RL in many scenarios, there are still many challenges in improving the reasoning of LLMs. One challenge is the sparse reward, which makes optimization difficult for RL and necessitates a large amount of data samples. Another challenge stems from the inherent instability of RL, particularly when using Actor-Critic (AC) methods to derive optimal policies, which often leads to unstable training processes. To address these issues, we introduce Direct Advantage Policy Optimization (DAPO), an novel step-level offline RL algorithm. Unlike standard alignment that rely solely outcome rewards to optimize policies (such as DPO), DAPO employs a critic function to predict the reasoning accuracy at each step, thereby generating dense signals to refine the generation strategy. Additionally, the Actor and Critic components in DAPO are trained independently, avoiding the co-training instability observed in standard AC algorithms like PPO. We train DAPO on mathematical and code query datasets and then evaluate its performance on multiple benchmarks. Our results show that DAPO can effectively enhance the mathematical and code capabilities on both SFT models and RL models, demonstrating the effectiveness of DAPO.

[AI-24] Annotating References to Mythological Entities in French Literature

链接: https://arxiv.org/abs/2412.18270
作者: Thierry Poibeau(Lattice)
关键词: contemporary French literature, Roman and Greek, Greek mythological entities, French literature, contemporary French
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we explore the relevance of large language models (LLMs) for annotating references to Roman and Greek mythological entities in modern and contemporary French literature. We present an annotation scheme and demonstrate that recent LLMs can be directly applied to follow this scheme effectively, although not without occasionally making significant analytical errors. Additionally, we show that LLMs (and, more specifically, ChatGPT) are capable of offering interpretative insights into the use of mythological references by literary authors. However, we also find that LLMs struggle to accurately identify relevant passages in novels (when used as an information retrieval engine), often hallucinating and generating fabricated examples-an issue that raises significant ethical concerns. Nonetheless, when used carefully, LLMs remain valuable tools for performing annotations with high accuracy, especially for tasks that would be difficult to annotate comprehensively on a large scale through manual methods alone.

[AI-25] Robust Semi-Supervised Learning in Open Environments

链接: https://arxiv.org/abs/2412.18256
作者: Lan-Zhe Guo,Lin-Han Jia,Jie-Jing Shao,Yu-Feng Li
关键词: unlabeled data, Semi-supervised learning, aims to improve, unlabeled, inconsistent unlabeled data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Semi-supervised learning (SSL) aims to improve performance by exploiting unlabeled data when labels are scarce. Conventional SSL studies typically assume close environments where important factors (e.g., label, feature, distribution) between labeled and unlabeled data are consistent. However, more practical tasks involve open environments where important factors between labeled and unlabeled data are inconsistent. It has been reported that exploiting inconsistent unlabeled data causes severe performance degradation, even worse than the simple supervised learning baseline. Manually verifying the quality of unlabeled data is not desirable, therefore, it is important to study robust SSL with inconsistent unlabeled data in open environments. This paper briefly introduces some advances in this line of research, focusing on techniques concerning label, feature, and data distribution inconsistency in SSL, and presents the evaluation benchmarks. Open research problems are also discussed for reference purposes.

[AI-26] Detection and Forecasting of Parkinson Disease Progression from Speech Signal Features Using MultiLayer Perceptron and LSTM

链接: https://arxiv.org/abs/2412.18248
作者: Majid Ali,Hina Shakir,Asia Samreen,Sohaib Ahmed
关键词: Accurate diagnosis, Parkinson disease, Parkinson disease detection, Long Short Term, Term Memory LSTM
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Accurate diagnosis of Parkinson disease, especially in its early stages, can be a challenging task. The application of machine learning techniques helps improve the diagnostic accuracy of Parkinson disease detection but only few studies have presented work towards the prediction of disease progression. In this research work, Long Short Term Memory LSTM was trained using the diagnostic features on Parkinson patients speech signals, to predict the disease progression while a Multilayer Perceptron MLP was trained on the same diagnostic features to detect the disease. Diagnostic features selected using two well-known feature selection methods named Relief-F and Sequential Forward Selection and applied on LSTM and MLP have shown to accurately predict the disease progression as stage 2 and 3 and its existence respectively.

[AI-27] An Automatic Graph Construction Framework based on Large Language Models for Recommendation

链接: https://arxiv.org/abs/2412.18241
作者: Rong Shan,Jianghao Lin,Chenxu Zhu,Bo Chen,Menghui Zhu,Kangning Zhang,Jieming Zhu,Ruiming Tang,Yong Yu,Weinan Zhang
关键词: Graph neural networks, graph construction, neural networks, learn from graph-structured, graph-structured data
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as state-of-the-art methods to learn from graph-structured data for recommendation. However, most existing GNN-based recommendation methods focus on the optimization of model structures and learning strategies based on pre-defined graphs, neglecting the importance of the graph construction stage. Earlier works for graph construction usually rely on speciffic rules or crowdsourcing, which are either too simplistic or too labor-intensive. Recent works start to utilize large language models (LLMs) to automate the graph construction, in view of their abundant open-world knowledge and remarkable reasoning capabilities. Nevertheless, they generally suffer from two limitations: (1) invisibility of global view (e.g., overlooking contextual information) and (2) construction inefficiency. To this end, we introduce AutoGraph, an automatic graph construction framework based on LLMs for recommendation. Specifically, we first use LLMs to infer the user preference and item knowledge, which is encoded as semantic vectors. Next, we employ vector quantization to extract the latent factors from the semantic vectors. The latent factors are then incorporated as extra nodes to link the user/item nodes, resulting in a graph with in-depth global-view semantics. We further design metapath-based message aggregation to effectively aggregate the semantic and collaborative information. The framework is model-agnostic and compatible with different backbone models. Extensive experiments on three real-world datasets demonstrate the efficacy and efffciency of AutoGraph compared to existing baseline methods. We have deployed AutoGraph in Huawei advertising platform, and gain a 2.69% improvement on RPM and a 7.31% improvement on eCPM in the online A/B test. Currently AutoGraph has been used as the main trafffc model, serving hundreds of millions of people.

[AI-28] Sharper Error Bounds in Late Fusion Multi-view Clustering Using Eigenvalue Proportion

链接: https://arxiv.org/abs/2412.18207
作者: Liang Du,Henghui Jiang,Xiaodong Li,Yiqing Guo,Yan Chen,Feijiang Li,Peng Zhou,Yuhua Qian
关键词: integrate complementary information, Late Fusion Multi-View, Fusion Multi-View Clustering, aims to integrate, integrate complementary
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-view clustering (MVC) aims to integrate complementary information from multiple views to enhance clustering performance. Late Fusion Multi-View Clustering (LFMVC) has shown promise by synthesizing diverse clustering results into a unified consensus. However, current LFMVC methods struggle with noisy and redundant partitions and often fail to capture high-order correlations across views. To address these limitations, we present a novel theoretical framework for analyzing the generalization error bounds of multiple kernel k -means, leveraging local Rademacher complexity and principal eigenvalue proportions. Our analysis establishes a convergence rate of \mathcalO(1/n) , significantly improving upon the existing rate in the order of \mathcalO(\sqrtk/n) . Building on this insight, we propose a low-pass graph filtering strategy within a multiple linear k -means framework to mitigate noise and redundancy, further refining the principal eigenvalue proportion and enhancing clustering accuracy. Experimental results on benchmark datasets confirm that our approach outperforms state-of-the-art methods in clustering performance and robustness. The related codes is available at this https URL .

[AI-29] Molar: Multimodal LLM s with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

链接: https://arxiv.org/abs/2412.18176
作者: Yucong Luo,Qitao Qin,Hao Zhang,Mingyue Cheng,Ruiran Yan,Kefan Wang,Jie Ouyang
关键词: deep learning approaches, systems have evolved, past decade, collaborative filtering, Sequential recommendation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sequential recommendation (SR) systems have evolved significantly over the past decade, transitioning from traditional collaborative filtering to deep learning approaches and, more recently, to large language models (LLMs). While the adoption of LLMs has driven substantial advancements, these models inherently lack collaborative filtering information, relying primarily on textual content data neglecting other modalities and thus failing to achieve optimal recommendation performance. To address this limitation, we propose Molar, a Multimodal large language sequential recommendation framework that integrates multiple content modalities with ID information to capture collaborative signals effectively. Molar employs an MLLM to generate unified item representations from both textual and non-textual data, facilitating comprehensive multimodal modeling and enriching item embeddings. Additionally, it incorporates collaborative filtering signals through a post-alignment mechanism, which aligns user representations from content-based and ID-based models, ensuring precise personalization and robust performance. By seamlessly combining multimodal content with collaborative filtering insights, Molar captures both user interests and contextual semantics, leading to superior recommendation accuracy. Extensive experiments validate that Molar significantly outperforms traditional and LLM-based baselines, highlighting its strength in utilizing multimodal data and collaborative signals for sequential recommendation tasks. The source code is available at this https URL.

[AI-30] INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM -based Agent

链接: https://arxiv.org/abs/2412.18174
作者: Haohang Li,Yupeng Cao,Yangyang Yu,Shashidhar Reddy Javaji,Zhiyang Deng,Yueru He,Yuechen Jiang,Zining Zhu,Koduvayur Subbalakshmi,Guojun Xiong,Jimin Huang,Lingfei Qian,Xueqing Peng,Qianqian Xie,Jordan W. Suchow
关键词: Recent advancements, large language model, advancements have underscored, underscored the potential, potential of large
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:Recent advancements have underscored the potential of large language model (LLM)-based agents in financial decision-making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce \textscInvestorBench, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts. InvestorBench enhances the versatility of LLM-enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks, cryptocurrencies and exchange-traded funds (ETFs). Additionally, we assess the reasoning and decision-making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open-source, multi-modal datasets and developed a comprehensive suite of environments for financial decision-making. This establishes a highly accessible platform for evaluating financial agents’ performance across various scenarios.

[AI-31] KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management

链接: https://arxiv.org/abs/2412.18169
作者: Rongxin Cheng,Yifan Peng,Yuxin Lai,Xingda Wei,Rong Chen,Haibo Chen
关键词: servingcan easily throttle, easily throttle precious, load burstor long-generation, reasoning,causing latency spikes, latency spikes due
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The stateful nature of large language model (LLM) servingcan easily throttle precious GPU memory under load burstor long-generation requests like chain-of-thought reasoning,causing latency spikes due to queuing incoming requests. However, state-of-the-art KVCache centric approaches handleload spikes by dropping, migrating, or swapping KVCache,which faces an essential tradeoff between the performance ofongoing vs. incoming requests and thus still severely this http URL paper makes a key observation such that model param-eters are independent of the requests and are replicated acrossGPUs, and thus proposes a parameter-centric approach byselectively dropping replicated parameters to leave preciousmemory for requests. However, LLM requires KVCache tobe saved in bound with model parameters and thus droppingparameters can cause either huge computation waste or longnetwork delay, affecting all ongoing requests. Based on the ob-servation that attention operators can be decoupled from otheroperators, this paper further proposes a novel remote attentionmechanism through pipeline parallelism so as to serve up-coming requests with the additional memory borrowed fromparameters on remote GPUs. This paper further addresses sev-eral other challenges including lively exchanging KVCachewith incomplete parameters, generating an appropriate planthat balances memory requirements with cooperative exe-cution overhead, and seamlessly restoring parameters whenthe throttling has gone. Evaluations show thatKUNSERVEreduces the tail TTFT of requests under throttling by up to 27.3x compared to the state-of-the-art.

[AI-32] VISION: A Modular AI Assistant for Natural Human-Instrument Interaction at Scientific User Facilities

链接: https://arxiv.org/abs/2412.18161
作者: Shray Mathur,Noah van der Vleuten,Kevin Yager,Esther Tsai
关键词: Scientific user facilities, wide array, array of hardware, hardware and software, software tools
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scientific user facilities, such as synchrotron beamlines, are equipped with a wide array of hardware and software tools that require a codebase for human-computer-interaction. This often necessitates developers to be involved to establish connection between users/researchers and the complex instrumentation. The advent of generative AI presents an opportunity to bridge this knowledge gap, enabling seamless communication and efficient experimental workflows. Here we present a modular architecture for the Virtual Scientific Companion (VISION) by assembling multiple AI-enabled cognitive blocks that each scaffolds large language models (LLMs) for a specialized task. With VISION, we performed LLM-based operation on the beamline workstation with low latency and demonstrated the first voice-controlled experiment at an X-ray scattering beamline. The modular and scalable architecture allows for easy adaptation to new instrument and capabilities. Development on natural language-based scientific experimentation is a building block for an impending future where a science exocortex – a synthetic extension to the cognition of scientists – may radically transform scientific practice and discovery.

[AI-33] Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance

链接: https://arxiv.org/abs/2412.18157
作者: Yaoyun Zhang,Xuenan Xu,Mengyue Wu
关键词: producing Foley sound, producing Foley, Foley sound, task has drawn, drawn attention
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The video-to-audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the generation model to indicate sound events and temporal occurrence. Recent studies on synthesizing immersive and synchronized audio are faced with challenges on videos with moving visual presence. The temporal condition is not accurate enough while low-resolution semantic condition exacerbates the problem. To tackle these challenges, we propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. Two adapters are trained to leverage pre-trained text-to-audio generation models. A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels. The incorporation of semantic guidance from textual labels achieves precise audio-video alignment. We conduct extensive quantitative and qualitative experiments. Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios. With semantic guidance, the audio generated by Smooth-Foley exhibits higher quality and better adherence to physical laws.

[AI-34] Exact Acceleration of Subgraph Graph Neural Networks by Eliminating Computation Redundancy

链接: https://arxiv.org/abs/2412.18125
作者: Qian Tao,Xiyuan Wang,Muhan Zhang,Shuxian Hu,Wenyuan Yu,Jingren Zhou
关键词: Graph neural networks, subgraph GNNs, neural networks, Graph, subgraph
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have become a prevalent framework for graph tasks. Many recent studies have proposed the use of graph convolution methods over the numerous subgraphs of each graph, a concept known as subgraph graph neural networks (subgraph GNNs), to enhance GNNs’ ability to distinguish non-isomorphic graphs. To maximize the expressiveness, subgraph GNNs often require each subgraph to have equal size to the original graph. Despite their impressive performance, subgraph GNNs face challenges due to the vast number and large size of subgraphs which lead to a surge in training data, resulting in both storage and computational inefficiencies. In response to this problem, this paper introduces Ego-Nets-Fit-All (ENFA), a model that uniformly takes the smaller ego nets as subgraphs, thereby providing greater storage and computational efficiency, while at the same time guarantees identical outputs to the original subgraph GNNs even taking the whole graph as subgraphs. The key is to identify and eliminate the redundant computation among subgraphs. For example, a node v_i may appear in multiple subgraphs but is far away from all of their centers (the unsymmetric part between subgraphs). Therefore, its first few rounds of message passing within each subgraph can be computed once in the original graph instead of being computed multiple times within each subgraph. Such strategy enables our ENFA to accelerate subgraph GNNs in an exact way, unlike previous sampling approaches that often lose the performance. Extensive experiments across various datasets reveal that compared with the conventional subgraph GNNs, ENFA can reduce storage space by 29.0% to 84.5% and improve training efficiency by up to 1.66x.

[AI-35] AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation

链接: https://arxiv.org/abs/2412.18116
作者: Hao Wen,Shizuo Tian,Borislav Pavlov,Wenjie Du,Yixuan Li,Ge Chang,Shanhui Zhao,Jiacheng Liu,Yunxin Liu,Ya-Qin Zhang,Yuanchun Li
关键词: long-standing research field, arbitrary natural language, complete arbitrary natural, Large language models, powerful large models
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) have brought exciting new advances to mobile UI agents, a long-standing research field that aims to complete arbitrary natural language tasks through mobile UI interactions. However, existing UI agents usually demand high reasoning capabilities of powerful large models that are difficult to be deployed locally on end-users’ devices, which raises huge concerns about user privacy and centralized serving cost. One way to reduce the required model size is to customize a smaller domain-specific model with high-quality training data, e.g. large-scale human demonstrations of diverse types of apps and tasks, while such datasets are extremely difficult to obtain. Inspired by the remarkable coding abilities of recent small language models (SLMs), we propose to convert the UI task automation problem to a code generation problem, which can be effectively solved by an on-device SLM and efficiently executed with an on-device code interpreter. Unlike normal coding tasks that can be extensively pretrained with public datasets, generating UI automation code is challenging due to the diversity, complexity, and variability of target apps. Therefore, we adopt a document-centered approach that automatically builds fine-grained API documentation for each app and generates diverse task samples based on this documentation. By guiding the agent with the synthetic documents and task samples, it learns to generate precise and efficient scripts to complete unseen tasks. Based on detailed comparisons with state-of-the-art mobile UI agents, our approach effectively improves the mobile task automation with significantly higher success rates and lower latency/token consumption. Code will be open-sourced.

[AI-36] AIGT: AI Generative Table Based on Prompt

链接: https://arxiv.org/abs/2412.18111
作者: Mingming Zhang,Zhiqing Xiao,Guoshan Lu,Sai Wu,Weiqiang Wang,Xing Fu,Can Yi,Junbo Zhao
关键词: enterprise data assets, Tabular data, data, data assets, Tabular
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular data, which accounts for over 80% of enterprise data assets, is vital in various fields. With growing concerns about privacy protection and data-sharing restrictions, generating high-quality synthetic tabular data has become essential. Recent advancements show that large language models (LLMs) can effectively gener-ate realistic tabular data by leveraging semantic information and overcoming the challenges of high-dimensional data that arise from one-hot encoding. However, current methods do not fully utilize the rich information available in tables. To address this, we introduce AI Generative Table (AIGT) based on prompt enhancement, a novel approach that utilizes meta data information, such as table descriptions and schemas, as prompts to generate ultra-high quality synthetic data. To overcome the token limit constraints of LLMs, we propose long-token partitioning algorithms that enable AIGT to model tables of any scale. AIGT achieves state-of-the-art performance on 14 out of 20 public datasets and two real industry datasets within the Alipay risk control system.

[AI-37] SlimGPT: Layer-wise Structured Pruning for Large Language Models

链接: https://arxiv.org/abs/2412.18110
作者: Gui Ling,Ziyang Wang,Yuliang Yan,Qingwen Liu
关键词: Large language models, garnered significant attention, vast parameter scales, Large language, parameter scales present
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have garnered significant attention for their remarkable capabilities across various domains, whose vast parameter scales present challenges for practical deployment. Structured pruning is an effective method to balance model performance with efficiency, but performance restoration under computational resource constraints is a principal challenge in pruning LLMs. Therefore, we present a low-cost and fast structured pruning method for LLMs named SlimGPT based on the Optimal Brain Surgeon framework. We propose Batched Greedy Pruning for rapid and near-optimal pruning, which enhances the accuracy of head-wise pruning error estimation through grouped Cholesky decomposition and improves the pruning efficiency of FFN via Dynamic Group Size, thereby achieving approximate local optimal pruning results within one hour. Besides, we explore the limitations of layer-wise pruning from the perspective of error accumulation and propose Incremental Pruning Ratio, a non-uniform pruning strategy to reduce performance degradation. Experimental results on the LLaMA benchmark show that SlimGPT outperforms other methods and achieves state-of-the-art results.

[AI-38] ackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

链接: https://arxiv.org/abs/2412.18106
作者: Mingcong Song,Xinru Tang,Fengfan Hou,Jing Li,Wei Wei,Yipeng Ma,Runqiu Xiao,Hongjie Si,Dingcheng Jiang,Shouyi Yin,Yang Hu,Guoping Long
关键词: Meeting growing demands, production-grade large language, requires integrating advanced, large language model, advanced optimization techniques
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Meeting growing demands for low latency and cost efficiency in production-grade large language model (LLM) serving systems requires integrating advanced optimization techniques. However, dynamic and unpredictable input-output lengths of LLM, compounded by these optimizations, exacerbate the issues of workload variability, making it difficult to maintain high efficiency on AI accelerators, especially DSAs with tile-based programming models. To address this challenge, we introduce XY-Serve, a versatile, Ascend native, end-to-end production LLM-serving system. The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into unified, hardware-friendly, fine-grained meta primitives. For attention, we propose a meta-kernel that computes the basic pattern of matmul-softmax-matmul with architectural-aware tile sizes. For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes. XY-Serve sits harmoniously with vLLM. Experimental results show up to 89% end-to-end throughput improvement compared with current publicly available baselines on Ascend NPUs. Additionally, our approach outperforms existing GEMM (average 14.6% faster) and attention (average 21.5% faster) kernels relative to existing libraries. While the work is Ascend native, we believe the approach can be readily applicable to SIMT architectures as well.

[AI-39] EvoPat: A Multi-LLM -based Patents Summarization and Analysis Agent

链接: https://arxiv.org/abs/2412.18100
作者: Suyuan Wang,Xueqian Yin,Menghao Wang,Ruofeng Guo,Kai Nan
关键词: patents filed annually, filed annually, rapid growth, techniques and knowledge, knowledge is reflected
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
*备注: 15 pages,2 figures,8 tables

点击查看摘要

Abstract:The rapid growth of scientific techniques and knowledge is reflected in the exponential increase in new patents filed annually. While these patents drive innovation, they also present significant burden for researchers and engineers, especially newcomers. To avoid the tedious work of navigating a vast and complex landscape to identify trends and breakthroughs, researchers urgently need efficient tools to summarize, evaluate, and contextualize patents, revealing their innovative contributions and underlying scientific this http URL address this need, we present EvoPat, a multi-LLM-based patent agent designed to assist users in analyzing patents through Retrieval-Augmented Generation (RAG) and advanced search strategies. EvoPat leverages multiple Large Language Models (LLMs), each performing specialized roles such as planning, identifying innovations, and conducting comparative evaluations. The system integrates data from local databases, including patents, literature, product catalogous, and company repositories, and online searches to provide up-to-date insights. The ability to collect information not included in original database automatically is also implemented. Through extensive testing in the natural language processing (NLP) domain, we demonstrate that EvoPat outperforms GPT-4 in tasks such as patent summarization, comparative analysis, and technical evaluation. EvoPat represents a significant step toward creating AI-powered tools that empower researchers and engineers to efficiently navigate the complexities of the patent landscape.

[AI-40] An Attention-based Framework with Multistation Information for Earthquake Early Warnings

链接: https://arxiv.org/abs/2412.18099
作者: Yu-Ming Huang,Kuan-Yu Chen,Wen-Wei Lin,Da-Yi Chen
关键词: play crucial roles, systems play crucial, warning systems play, early warning systems, Earthquake early warning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Earthquake early warning systems play crucial roles in reducing the risk of seismic disasters. Previously, the dominant modeling system was the single-station models. Such models digest signal data received at a given station and predict earth-quake parameters, such as the p-phase arrival time, intensity, and magnitude at that location. Various methods have demonstrated adequate performance. However, most of these methods present the challenges of the difficulty of speeding up the alarm time, providing early warning for distant areas, and considering global information to enhance performance. Recently, deep learning has significantly impacted many fields, including seismology. Thus, this paper proposes a deep learning-based framework, called SENSE, for the intensity prediction task of earthquake early warning systems. To explicitly consider global information from a regional or national perspective, the input to SENSE comprises statistics from a set of stations in a given region or country. The SENSE model is designed to learn the relationships among the set of input stations and the locality-specific characteristics of each station. Thus, SENSE is not only expected to provide more reliable forecasts by considering multistation data but also has the ability to provide early warnings to distant areas that have not yet received signals. This study conducted extensive experiments on datasets from Taiwan and Japan. The results revealed that SENSE can deliver competitive or even better performances compared with other state-of-the-art methods.

[AI-41] Real-world Deployment and Evaluation of PErioperative AI CHatbot (PEACH) – a Large Language Model Chatbot for Perioperative Medicine

链接: https://arxiv.org/abs/2412.18096
作者: Yu He Ke,Liyuan Jin,Kabilan Elangovan,Bryan Wen Xi Ong,Chin Yang Oh,Jacqueline Sim,Kenny Wei-Tsen Loh,Chai Rick Soh,Jonathan Ming Hua Cheng,Aaron Kwang Yang Lee,Daniel Shu Wei Ting,Nan Liu,Hairil Rizal Abdullah
关键词: Large Language Models, Large Language, Language Models, domain-specific tasks, Technology Acceptance Model
类目: Artificial Intelligence (cs.AI)
*备注: 21 pages, 3 figures, 1 graphical abstract

点击查看摘要

Abstract:Large Language Models (LLMs) are emerging as powerful tools in healthcare, particularly for complex, domain-specific tasks. This study describes the development and evaluation of the PErioperative AI CHatbot (PEACH), a secure LLM-based system integrated with local perioperative guidelines to support preoperative clinical decision-making. PEACH was embedded with 35 institutional perioperative protocols in the secure Claude 3.5 Sonet LLM framework within Pair Chat (developed by Singapore Government) and tested in a silent deployment with real-world data. Accuracy, safety, and usability were assessed. Deviations and hallucinations were categorized based on potential harm, and user feedback was evaluated using the Technology Acceptance Model (TAM). Updates were made after the initial silent deployment to amend one protocol. In 240 real-world clinical iterations, PEACH achieved a first-generation accuracy of 97.5% (78/80) and an overall accuracy of 96.7% (232/240) across three iterations. The updated PEACH demonstrated improved accuracy of 97.9% (235/240), with a statistically significant difference from the null hypothesis of 95% accuracy (p = 0.018, 95% CI: 0.952-0.991). Minimal hallucinations and deviations were observed (both 1/240 and 2/240, respectively). Clinicians reported that PEACH expedited decisions in 95% of cases, and inter-rater reliability ranged from kappa 0.772-0.893 within PEACH and 0.610-0.784 among attendings. PEACH is an accurate, adaptable tool that enhances consistency and efficiency in perioperative decision-making. Future research should explore its scalability across specialties and its impact on clinical outcomes. Comments: 21 pages, 3 figures, 1 graphical abstract Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.18096 [cs.AI] (or arXiv:2412.18096v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.18096 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-42] BRIDGE: Bundle Recommendation via Instruction-Driven Generation

链接: https://arxiv.org/abs/2412.18092
作者: Tuan-Nghia Bui,Huy-Son Nguyen,Cam-Van Nguyen Thi,Hoang-Quynh Le,Duc-Trong Le
关键词: aims to suggest, suggest a set, set of interconnected, Bundle recommendation aims, Bundle recommendation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bundle recommendation aims to suggest a set of interconnected items to users. However, diverse interaction types and sparse interaction matrices often pose challenges for previous approaches in accurately predicting user-bundle adoptions. Inspired by the distant supervision strategy and generative paradigm, we propose BRIDGE, a novel framework for bundle recommendation. It consists of two main components namely the correlation-based item clustering and the pseudo bundle generation modules. Inspired by the distant supervision approach, the former is to generate more auxiliary information, e.g., instructive item clusters, for training without using external data. This information is subsequently aggregated with collaborative signals from user historical interactions to create pseudo `ideal’ bundles. This capability allows BRIDGE to explore all aspects of bundles, rather than being limited to existing real-world bundles. It effectively bridging the gap between user imagination and predefined bundles, hence improving the bundle recommendation performance. Experimental results validate the superiority of our models over state-of-the-art ranking-based methods across five benchmark datasets.

[AI-43] AutoSculpt: A Pattern-based Model Auto-pruning Framework Using Reinforcement Learning and Graph Learning

链接: https://arxiv.org/abs/2412.18091
作者: Lixian Jing,Jianpeng Qi,Junyu Dong,Yanwei Yu
关键词: constrained computational resources, deep neural networks, neural networks, edge devices, resources is critical
类目: Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:As deep neural networks (DNNs) are increasingly deployed on edge devices, optimizing models for constrained computational resources is critical. Existing auto-pruning methods face challenges due to the diversity of DNN models, various operators (e.g., filters), and the difficulty in balancing pruning granularity with model accuracy. To address these limitations, we introduce AutoSculpt, a pattern-based automated pruning framework designed to enhance efficiency and accuracy by leveraging graph learning and deep reinforcement learning (DRL). AutoSculpt automatically identifies and prunes regular patterns within DNN architectures that can be recognized by existing inference engines, enabling runtime acceleration. Three key steps in AutoSculpt include: (1) Constructing DNNs as graphs to encode their topology and parameter dependencies, (2) embedding computationally efficient pruning patterns, and (3) utilizing DRL to iteratively refine auto-pruning strategies until the optimal balance between compression and accuracy is achieved. Experimental results demonstrate the effectiveness of AutoSculpt across various architectures, including ResNet, MobileNet, VGG, and Vision Transformer, achieving pruning rates of up to 90% and nearly 18% improvement in FLOPs reduction, outperforming all baselines. The codes can be available at this https URL

[AI-44] Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models

链接: https://arxiv.org/abs/2412.18084
作者: Xuan Lin,Long Chen,Yile Wang,Xiangxiang Zeng,Philip S. Yu
关键词: Large language models, natural language processing, Large language, language processing tasks, natural language
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are widely applied in various natural language processing tasks such as question answering and machine translation. However, due to the lack of labeled data and the difficulty of manual annotation for biochemical properties, the performance for molecule generation tasks is still limited, especially for tasks involving multi-properties constraints. In this work, we present a two-step framework PEIT (Property Enhanced Instruction Tuning) to improve LLMs for molecular-related tasks. In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN, by aligning multi-modal representations to synthesize instruction data. In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks. Experimental results show that our pre-trained PEIT-GEN outperforms MolT5 and BioT5 in molecule captioning, demonstrating modalities align well between textual descriptions, structures, and biochemical properties. Furthermore, PEIT-LLM shows promising improvements in multi-task molecule generation, proving the scalability of the PEIT framework for various molecular tasks. We release the code, constructed instruction data, and model checkpoints in this https URL.

[AI-45] Prompt Tuning for Item Cold-start Recommendation

链接: https://arxiv.org/abs/2412.18082
作者: Yuezihan Jiang,Gaode Chen,Wenhan Zhang,Jingchi Wang,Yinjie Jiang,Qi Zhang,Jingjian Lin,Peng Jiang,Kaigui Bian
关键词: cold-start phase determines, online recommender systems, positive feedback, recommender systems, crucial for online
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The item cold-start problem is crucial for online recommender systems, as the success of the cold-start phase determines whether items can transition into popular ones. Prompt learning, a powerful technique used in natural language processing (NLP) to address zero- or few-shot problems, has been adapted for recommender systems to tackle similar challenges. However, existing methods typically rely on content-based properties or text descriptions for prompting, which we argue may be suboptimal for cold-start recommendations due to 1) semantic gaps with recommender tasks, 2) model bias caused by warm-up items contribute most of the positive feedback to the model, which is the core of the cold-start problem that hinders the recommender quality on cold-start items. We propose to leverage high-value positive feedback, termed pinnacle feedback as prompt information, to simultaneously resolve the above two problems. We experimentally prove that compared to the content description proposed in existing works, the positive feedback is more suitable to serve as prompt information by bridging the semantic gaps. Besides, we propose item-wise personalized prompt networks to encode pinnaclce feedback to relieve the model bias by the positive feedback dominance problem. Extensive experiments on four real-world datasets demonstrate the superiority of our model over state-of-the-art methods. Moreover, PROMO has been successfully deployed on a popular short-video sharing platform, a billion-user scale commercial short-video application, achieving remarkable performance gains across various commercial metrics within cold-start scenarios

[AI-46] Understanding Artificial Neural Networks Behavior from Neuron Activation Perspective

链接: https://arxiv.org/abs/2412.18073
作者: Yizhou Zhang,Yang Sui
关键词: neuron activation dynamics, neuron activation, deep neural networks, paper explores, explores the intricate
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper explores the intricate behavior of deep neural networks (DNNs) through the lens of neuron activation dynamics. We propose a probabilistic framework that can analyze models’ neuron activation patterns as a stochastic process, uncovering theoretical insights into neural scaling laws, such as over-parameterization and the power-law decay of loss with respect to dataset size. By deriving key mathematical relationships, we present that the number of activated neurons increases in the form of N(1-(\fracbND+bN)^b) , and the neuron activation should follows power-law distribution. Based on these two mathematical results, we demonstrate how DNNs maintain generalization capabilities even under over-parameterization, and we elucidate the phase transition phenomenon observed in loss curves as dataset size plotted in log-axis (i.e. the data magnitude increases linearly). Moreover, by combining the above two phenomenons and the power-law distribution of neuron activation, we derived the power-law decay of neural network’s loss function as the data size scale increases. Furthermore, our analysis bridges the gap between empirical observations and theoretical underpinnings, offering experimentally testable predictions regarding parameter efficiency and model compressibility. These findings provide a foundation for understanding neural network scaling and present new directions for optimizing DNN performance.

[AI-47] Beyond Gradient Averaging in Parallel Optimization: Improved Robustness through Gradient Agreement Filtering

链接: https://arxiv.org/abs/2412.18052
作者: Francois Chaubard,Duncan Eddy,Mykel J. Kochenderfer
关键词: deep learning optimization, Gradient Agreement Filtering, distributed deep learning, introduce Gradient Agreement, Agreement Filtering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Gradient Agreement Filtering (GAF) to improve on gradient averaging in distributed deep learning optimization. Traditional distributed data-parallel stochastic gradient descent involves averaging gradients of microbatches to calculate a macrobatch gradient that is then used to update model parameters. We find that gradients across microbatches are often orthogonal or negatively correlated, especially in late stages of training, which leads to memorization of the training set, reducing generalization. In this paper, we introduce a simple, computationally effective way to reduce gradient variance by computing the cosine distance between micro-gradients during training and filtering out conflicting updates prior to averaging. We improve validation accuracy with significantly smaller microbatch sizes. We also show this reduces memorizing noisy labels. We demonstrate the effectiveness of this technique on standard image classification benchmarks including CIFAR-100 and CIFAR-100N-Fine. We show this technique consistently outperforms validation accuracy, in some cases by up to 18.2% compared to traditional training approaches while reducing the computation required nearly an order of magnitude because we can now rely on smaller microbatch sizes without destabilizing training.

[AI-48] Fair Knowledge Tracing in Second Language Acquisition

链接: https://arxiv.org/abs/2412.18048
作者: Weitao Tang,Guanliang Chen,Shuaishuai Zu,Jiangyi Luo
关键词: attracting significant research, significant research attention, modeling aids educators, implementing diverse teaching, predictive modeling aids
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In second-language acquisition, predictive modeling aids educators in implementing diverse teaching strategies, attracting significant research attention. However, while model accuracy is widely explored, model fairness remains under-examined. Model fairness ensures equitable treatment of groups, preventing unintentional biases based on attributes such as gender, ethnicity, or economic background. A fair model should produce impartial outcomes that do not systematically disadvantage any group. This study evaluates the fairness of two predictive models using the Duolingo dataset’s en_es (English learners speaking Spanish), es_en (Spanish learners speaking English), and fr_en (French learners speaking English) tracks. We analyze: 1. Algorithmic fairness across platforms (iOS, Android, Web). 2. Algorithmic fairness between developed and developing countries. Key findings include: 1. Deep learning outperforms machine learning in second-language knowledge tracing due to improved accuracy and fairness. 2. Both models favor mobile users over non-mobile users. 3. Machine learning exhibits stronger bias against developing countries compared to deep learning. 4. Deep learning strikes a better balance of fairness and accuracy in the en_es and es_en tracks, while machine learning is more suitable for fr_en. This study highlights the importance of addressing fairness in predictive models to ensure equitable educational strategies across platforms and regions. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2412.18048 [cs.HC] (or arXiv:2412.18048v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2412.18048 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Weitao Tang [view email] [v1] Mon, 23 Dec 2024 23:47:40 UTC (6,752 KB)

[AI-49] Uncertainty-Aware Critic Augmentation for Hierarchical Multi-Agent EV Charging Control

链接: https://arxiv.org/abs/2412.18047
作者: Lo Pang-Yun Ting,Ali Şenol,Huan-Yang Wang,Hsu-Chao Lai,Kun-Ta Chuang,Huan Liu
关键词: supporting grid stability, discharging technology, aimed at supporting, emergency operations, workplace applications
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The advanced bidirectional EV charging and discharging technology, aimed at supporting grid stability and emergency operations, has driven a growing interest in workplace applications. It not only effectively reduces electricity expenses but also enhances the resilience of handling practical issues, such as peak power limitation, fluctuating energy prices, and unpredictable EV departures. However, existing EV charging strategies have yet to fully consider these factors in a way that benefits both office buildings and EV users simultaneously. To address these issues, we propose HUCA, a novel real-time charging control for regulating energy demands for both the building and electric vehicles. HUCA employs hierarchical actor-critic networks to dynamically reduce electricity costs in buildings, accounting for the needs of EV charging in the dynamic pricing scenario. To tackle the uncertain EV departures, a new critic augmentation is introduced to account for departure uncertainties in evaluating the charging decisions, while maintaining the robustness of the charging control. Experiments on real-world electricity datasets under both simulated certain and uncertain departure scenarios demonstrate that HUCA outperforms baselines in terms of total electricity costs while maintaining competitive performance in fulfilling EV charging requirements. A case study also manifests that HUCA effectively balances energy supply between the building and EVs based on real-time information.

[AI-50] More than Chit-Chat: Developing Robots for Small-Talk Interactions

链接: https://arxiv.org/abs/2412.18023
作者: Rebecca Ramnauth,Dražen Brščić,Brian Scassellati
关键词: small talk plays, small talk, Large Language Models, mere formality, rapport and understanding
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Beyond mere formality, small talk plays a pivotal role in social dynamics, serving as a verbal handshake for building rapport and understanding. For conversational AI and social robots, the ability to engage in small talk enhances their perceived sociability, leading to more comfortable and natural user interactions. In this study, we evaluate the capacity of current Large Language Models (LLMs) to drive the small talk of a social robot and identify key areas for improvement. We introduce a novel method that autonomously generates feedback and ensures LLM-generated responses align with small talk conventions. Through several evaluations – involving chatbot interactions and human-robot interactions – we demonstrate the system’s effectiveness in guiding LLM-generated responses toward realistic, human-like, and natural small-talk exchanges.

[AI-51] rustworthy and Efficient LLM s Meet Databases

链接: https://arxiv.org/abs/2412.18022
作者: Kyoungmin Kim,Anastasia Ailamaki
关键词: large language models, gained significant attention, language models, trustworthy and efficient, significant attention
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the rapidly evolving AI era with large language models (LLMs) at the core, making LLMs more trustworthy and efficient, especially in output generation (inference), has gained significant attention. This is to reduce plausible but faulty LLM outputs (a.k.a hallucinations) and meet the highly increased inference demands. This tutorial explores such efforts and makes them transparent to the database community. Understanding these efforts is essential in harnessing LLMs in database tasks and adapting database techniques to LLMs. Furthermore, we delve into the synergy between LLMs and databases, highlighting new opportunities and challenges in their intersection. This tutorial aims to share with database researchers and practitioners essential concepts and strategies around LLMs, reduce the unfamiliarity of LLMs, and inspire joining in the intersection between LLMs and databases.

[AI-52] Integrated Learning and Optimization for Congestion Management and Profit Maximization in Real-Time Electricity Market

链接: https://arxiv.org/abs/2412.18003
作者: Imran Pervez,Ricardo Pinto Lima,Omar Knio
关键词: optimal power flow, DCOPF, power, solve economic dispatch, PTDF
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We develop novel integrated learning and optimization (ILO) methodologies to solve economic dispatch (ED) and DC optimal power flow (DCOPF) problems for better economic operation. The optimization problem for ED is formulated with load being an unknown parameter while DCOPF consists of load and power transfer distribution factor (PTDF) matrix as unknown parameters. PTDF represents the incremental variations of real power on transmission lines which occur due to real power transfers between two regions. These values represent a linearized approximation of power flows over the transmission lines. We develop novel ILO formulations to solve post-hoc penalties in electricity market and line congestion problems using ED and DCOPF optimization formulations. Our proposed methodologies capture the real-time electricity market and line congestion behavior to train the regret function which eventually train unknown loads at different buses and line PTDF matrix to achieve the afore-mentioned post-hoc goals. The proposed methodology is compared to sequential learning and optimization (SLO) which train load and PTDF forecasts for accuracy rather than economic operation. Our experimentation prove the superiority of ILO in minimizing the post-hoc penalties in electricity markets and minimizing the line congestion thereby improving the economic operation with noticeable amount.

[AI-53] WavePulse: Real-time Content Analytics of Radio Livestreams

链接: https://arxiv.org/abs/2412.17998
作者: Govind Mittal,Sarthak Gupta,Shruti Wagle,Chirag Chopra,Anthony J DeMattee,Nasir Memon,Mustaque Ahamad,Chinmay Hegde
关键词: smartphone-based social networking, reaching more Americans, mass information dissemination, live television, remains a pervasive
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 22 Pages: 10 main + 12 appendix, 24 figures. Access code and dataset at this https URL

点击查看摘要

Abstract:Radio remains a pervasive medium for mass information dissemination, with AM/FM stations reaching more Americans than either smartphone-based social networking or live television. Increasingly, radio broadcasts are also streamed online and accessed over the Internet. We present WavePulse, a framework that records, documents, and analyzes radio content in real-time. While our framework is generally applicable, we showcase the efficacy of WavePulse in a collaborative project with a team of political scientists focusing on the 2024 Presidential Elections. We use WavePulse to monitor livestreams of 396 news radio stations over a period of three months, processing close to 500,000 hours of audio streams. These streams were converted into time-stamped, diarized transcripts and analyzed to track answer key political science questions at both the national and state levels. Our analysis revealed how local issues interacted with national trends, providing insights into information flow. Our results demonstrate WavePulse’s efficacy in capturing and analyzing content from radio livestreams sourced from the Web. Code and dataset can be accessed at \urlthis https URL.

[AI-54] Multi-Agent Path Finding in Continuous Spaces with Projected Diffusion Models

链接: https://arxiv.org/abs/2412.17993
作者: Jinhao Liang,Jacob K. Christopher,Sven Koenig,Ferdinando Fioretto
关键词: Multi-Agent Path Finding, multiple agents moving, Path Finding, problem in robotics, requiring the computation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-Agent Path Finding (MAPF) is a fundamental problem in robotics, requiring the computation of collision-free paths for multiple agents moving from their respective start to goal positions. Coordinating multiple agents in a shared environment poses significant challenges, especially in continuous spaces where traditional optimization algorithms struggle with scalability. Moreover, these algorithms often depend on discretized representations of the environment, which can be impractical in image-based or high-dimensional settings. Recently, diffusion models have shown promise in single-agent path planning, capturing complex trajectory distributions and generating smooth paths that navigate continuous, high-dimensional spaces. However, directly extending diffusion models to MAPF introduces new challenges since these models struggle to ensure constraint feasibility, such as inter-agent collision avoidance. To overcome this limitation, this work proposes a novel approach that integrates constrained optimization with diffusion models for MAPF in continuous spaces. This unique combination directly produces feasible multi-agent trajectories that respect collision avoidance and kinematic constraints. The effectiveness of our approach is demonstrated across various challenging simulated scenarios of varying dimensionality.

[AI-55] NNGen: Automated Design of Neuromorphic Sensory Processing Units for Time-Series Clustering

链接: https://arxiv.org/abs/2412.17977
作者: Prabhu Vellaisamy,Harideep Nair,Vamsikrishna Ratnakaram,Dhruv Gupta,John Paul Shen
关键词: Temporal Neural Networks, spiking neural networks, Neural Networks, Temporal Neural, spiking neural
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Published in IEEE Transactions on Circuits and Systems II: Express Briefs, May 2024

点击查看摘要

Abstract:Temporal Neural Networks (TNNs), a special class of spiking neural networks, draw inspiration from the neocortex in utilizing spike-timings for information processing. Recent works proposed a microarchitecture framework and custom macro suite for designing highly energy-efficient application-specific TNNs. These recent works rely on manual hardware design, a labor-intensive and time-consuming process. Further, there is no open-source functional simulation framework for TNNs. This paper introduces TNNGen, a pioneering effort towards the automated design of TNNs from PyTorch software models to post-layout netlists. TNNGen comprises a novel PyTorch functional simulator (for TNN modeling and application exploration) coupled with a Python-based hardware generator (for PyTorch-to-RTL and RTL-to-Layout conversions). Seven representative TNN designs for time-series signal clustering across diverse sensory modalities are simulated and their post-layout hardware complexity and design runtimes are assessed to demonstrate the effectiveness of TNNGen. We also highlight TNNGen’s ability to accurately forecast silicon metrics without running hardware process flow.

[AI-56] owards Cognitive Service Delivery on B5G through AIaaS Architecture

链接: https://arxiv.org/abs/2412.17967
作者: Larissa F. Rodrigues Moreira,Rodrigo Moreira,Flávio de Oliveira Silva,André R. Backes
关键词: Artificial Intelligence, mobile network systems, advancing mobile network, facilitating smart capabilities, Data Analytics Function
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:Artificial Intelligence (AI) is pivotal in advancing mobile network systems by facilitating smart capabilities and automation. The transition from 4G to 5G has substantial implications for AI in consolidating a network predominantly geared towards business verticals. In this context, 3GPP has specified and introduced the Network Data Analytics Function (NWDAF) entity at the network’s core to provide insights based on AI algorithms to benefit network orchestration. This paper proposes a framework for evolving NWDAF that presents the interfaces necessary to further empower the core network with AI capabilities B5G and 6G. In addition, we identify a set of research directions for realizing a distributed e-NWDAF.

[AI-57] uGEMM: Area-Power-Efficient Temporal Unary GEMM Architecture for Low-Precision Edge AI ISCAS

链接: https://arxiv.org/abs/2412.17966
作者: Harideep Nair,Prabhu Vellaisamy,Albert Chen,Joseph Finn,Anna Li,Manav Trivedi,John Paul Shen
关键词: General matrix multiplication, including artificial intelligence, ubiquitous computing kernel, General matrix, GEMM architectures based
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published in 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 2023

点击查看摘要

Abstract:General matrix multiplication (GEMM) is a ubiquitous computing kernel/algorithm for data processing in diverse applications, including artificial intelligence (AI) and deep learning (DL). Recent shift towards edge computing has inspired GEMM architectures based on unary computing, which are predominantly stochastic and rate-coded systems. This paper proposes a novel GEMM architecture based on temporal-coding, called tuGEMM, that performs exact computation. We introduce two variants of tuGEMM, serial and parallel, with distinct area/power-latency trade-offs. Post-synthesis Power-Performance-Area (PPA) in 45 nm CMOS are reported for 2-bit, 4-bit, and 8-bit computations. The designs illustrate significant advantages in area-power efficiency over state-of-the-art stochastic unary systems especially at low precisions, e.g. incurring just 0.03 mm^2 and 9 mW for 4 bits, and 0.01 mm^2 and 4 mW for 2 bits. This makes tuGEMM ideal for power constrained mobile and edge devices performing always-on real-time sensory processing.

[AI-58] LMV-RPA: Large Model Voting-based Robotic Process Automation

链接: https://arxiv.org/abs/2412.17965
作者: Osama Abdellatif,Ahmed Ayman,Ali Hamdi
关键词: high-volume unstructured data, Optical Character Recognition, Automating high-volume unstructured, OCR, unstructured data processing
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 10 pages, 1 figures, 1 algorithm

点击查看摘要

Abstract:Automating high-volume unstructured data processing is essential for operational efficiency. Optical Character Recognition (OCR) is critical but often struggles with accuracy and efficiency in complex layouts and ambiguous text. These challenges are especially pronounced in large-scale tasks requiring both speed and precision. This paper introduces LMV-RPA, a Large Model Voting-based Robotic Process Automation system to enhance OCR workflows. LMV-RPA integrates outputs from OCR engines such as Paddle OCR, Tesseract OCR, Easy OCR, and DocTR with Large Language Models (LLMs) like LLaMA 3 and Gemini-1.5-pro. Using a majority voting mechanism, it processes OCR outputs into structured JSON formats, improving accuracy, particularly in complex layouts. The multi-phase pipeline processes text extracted by OCR engines through LLMs, combining results to ensure the most accurate outputs. LMV-RPA achieves 99 percent accuracy in OCR tasks, surpassing baseline models with 94 percent, while reducing processing time by 80 percent. Benchmark evaluations confirm its scalability and demonstrate that LMV-RPA offers a faster, more reliable, and efficient solution for automating large-scale document processing tasks.

[AI-59] Dynamic Multi-Agent Orchestration and Retrieval for Multi-Source Question-Answer Systems using Large Language Models ICSE2024

链接: https://arxiv.org/abs/2412.17964
作者: Antony Seabra,Claudio Cavalcante,Joao Nepomuceno,Lucas Lago,Nicolaas Ruberg,Sergio Lifschitz
关键词: Large Language Model, Language Model, Large Language, techniques in Large, development of robust
类目: Artificial Intelligence (cs.AI)
*备注: International Conference on NLP, AI, Computer Science Engineering (NLAICSE 2024)

点击查看摘要

Abstract:We propose a methodology that combines several advanced techniques in Large Language Model (LLM) retrieval to support the development of robust, multi-source question-answer systems. This methodology is designed to integrate information from diverse data sources, including unstructured documents (PDFs) and structured databases, through a coordinated multi-agent orchestration and dynamic retrieval approach. Our methodology leverages specialized agents-such as SQL agents, Retrieval-Augmented Generation (RAG) agents, and router agents - that dynamically select the most appropriate retrieval strategy based on the nature of each query. To further improve accuracy and contextual relevance, we employ dynamic prompt engineering, which adapts in real time to query-specific contexts. The methodology’s effectiveness is demonstrated within the domain of Contract Management, where complex queries often require seamless interaction between unstructured and structured data. Our results indicate that this approach enhances response accuracy and relevance, offering a versatile and scalable framework for developing question-answer systems that can operate across various domains and data sources.

[AI-60] Study of the Proper NNUE Dataset

链接: https://arxiv.org/abs/2412.17948
作者: Daniel Tan,Neftali Watkinson Medina
关键词: Efficiently Updatable Neural, Updatable Neural Networks, Efficiently Updatable, Neural Networks, Updatable Neural
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:NNUE (Efficiently Updatable Neural Networks) has revolutionized chess engine development, with nearly all top engines adopting NNUE models to maintain competitive performance. A key challenge in NNUE training is the creation of high-quality datasets, particularly in complex domains like chess, where tactical and strategic evaluations are essential. However, methods for constructing effective datasets remain poorly understood and under-documented. In this paper, we propose an algorithm for generating and filtering datasets composed of “quiet” positions that are stable and free from tactical volatility. Our approach provides a clear methodology for dataset creation, which can be replicated and generalized across various evaluation functions. Testing demonstrates significant improvements in engine performance, confirming the effectiveness of our method.

[AI-61] Surveillance Capitalism Revealed: Tracing The Hidden World Of Web Data Collection

链接: https://arxiv.org/abs/2412.17944
作者: Antony Seabra de Medeiros,Luiz Afonso Glatzl Junior,Sergio Lifschitz
关键词: Surveillance Capitalism, mechanisms of Surveillance, personal data transfer, focusing on personal, navigation and searching
类目: Artificial Intelligence (cs.AI)
*备注: SBBD 2024 - Simpósio Brasileiro de Banco de Dados

点击查看摘要

Abstract:This study investigates the mechanisms of Surveillance Capitalism, focusing on personal data transfer during web navigation and searching. Analyzing network traffic reveals how various entities track and harvest digital footprints. The research reveals specific data types exchanged between users and web services, emphasizing the sophisticated algorithms involved in these processes. We present concrete evidence of data harvesting practices and propose strategies for enhancing data protection and transparency. Our findings highlight the need for robust data protection frameworks and ethical data usage to address privacy concerns in the digital age.

[AI-62] Contrato360 2.0: A Document and Database-Driven Question-Answer System using Large Language Models and Agents

链接: https://arxiv.org/abs/2412.17942
作者: Antony Seabra,Claudio Cavalcante,Joao Nepomuceno,Lucas Lago,Nicolaas Ruberg,Sergio Lifschitz
关键词: contract management process, contract management systems, contract management, leveraging combined information, contract documents
类目: Artificial Intelligence (cs.AI)
*备注: KDIR 2024 - Knowledge Discovery and Information Retrieval

点击查看摘要

Abstract:We present a question-and-answer (Q\A) application designed to support the contract management process by leveraging combined information from contract documents (PDFs) and data retrieved from contract management systems (database). This data is processed by a large language model (LLM) to provide precise and relevant answers. The accuracy of these responses is further enhanced through the use of Retrieval-Augmented Generation (RAG), text-to-SQL techniques, and agents that dynamically orchestrate the workflow. These techniques eliminate the need to retrain the language model. Additionally, we employed Prompt Engineering to fine-tune the focus of responses. Our findings demonstrate that this multi-agent orchestration and combination of techniques significantly improve the relevance and accuracy of the answers, offering a promising direction for future information systems.

[AI-63] Causal Composition Diffusion Model for Closed-loop Traffic Generation

链接: https://arxiv.org/abs/2412.17920
作者: Haohong Lin,Xin Huang,Tung Phan-Minh,David S. Hayden,Huan Zhang,Ding Zhao,Siddhartha Srinivasa,Eric M. Wolff,Hongge Chen
关键词: complex interactive behaviors, capturing complex interactive, autonomous driving, interactive behaviors, critical for safety
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Simulation is critical for safety evaluation in autonomous driving, particularly in capturing complex interactive behaviors. However, generating realistic and controllable traffic scenarios in long-tail situations remains a significant challenge. Existing generative models suffer from the conflicting objective between user-defined controllability and realism constraints, which is amplified in safety-critical contexts. In this work, we introduce the Causal Compositional Diffusion Model (CCDiff), a structure-guided diffusion framework to address these challenges. We first formulate the learning of controllable and realistic closed-loop simulation as a constrained optimization problem. Then, CCDiff maximizes controllability while adhering to realism by automatically identifying and injecting causal structures directly into the diffusion process, providing structured guidance to enhance both realism and controllability. Through rigorous evaluations on benchmark datasets and in a closed-loop simulator, CCDiff demonstrates substantial gains over state-of-the-art approaches in generating realistic and user-preferred trajectories. Our results show CCDiff’s effectiveness in extracting and leveraging causal structures, showing improved closed-loop performance based on key metrics such as collision rate, off-road rate, FDE, and comfort.

[AI-64] A Novel Approach to Balance Convenience and Nutrition in Meals With Long-Term Group Recommendations and Reasoning on Multimodal Recipes and its Implementation in BEACON

链接: https://arxiv.org/abs/2412.17910
作者: Vansh Nagpal,Siva Likitha Valluru,Kausik Lakkaraju,Nitin Gupta,Zach Abdulrahman,Andrew Davison,Biplav Srivastava
关键词: common decision made, side dishes, made by people, health conditions, comprising combinations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2406.13714

点击查看摘要

Abstract:“A common decision made by people, whether healthy or with health conditions, is choosing meals like breakfast, lunch, and dinner, comprising combinations of foods for appetizer, main course, side dishes, desserts, and beverages. Often, this decision involves tradeoffs between nutritious choices (e.g., salt and sugar levels, nutrition content) and convenience (e.g., cost and accessibility, cuisine type, food source type). We present a data-driven solution for meal recommendations that considers customizable meal configurations and time horizons. This solution balances user preferences while accounting for food constituents and cooking processes. Our contributions include introducing goodness measures, a recipe conversion method from text to the recently introduced multimodal rich recipe representation (R3) format, learning methods using contextual bandits that show promising preliminary results, and the prototype, usage-inspired, BEACON system.”

[AI-65] In Defence of Post-hoc Explainability NEURIPS2024

链接: https://arxiv.org/abs/2412.17883
作者: Nick Oh
关键词: introduce Computational Interpretabilism, widespread adoption, adoption of machine, machine learning, research has created
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Presented at the Interpretable AI: Past, Present, and Future Workshop at NeurIPS 2024 (non-archival)

点击查看摘要

Abstract:The widespread adoption of machine learning in scientific research has created a fundamental tension between model opacity and scientific understanding. Whilst some advocate for intrinsically interpretable models, we introduce Computational Interpretabilism (CI) as a philosophical framework for post-hoc interpretability in scientific AI. Drawing parallels with human expertise, where post-hoc rationalisation coexists with reliable performance, CI establishes that scientific knowledge emerges through structured model interpretation when properly bounded by empirical validation. Through mediated understanding and bounded factivity, we demonstrate how post-hoc methods achieve epistemically justified insights without requiring complete mechanical transparency, resolving tensions between model complexity and scientific comprehension.

[AI-66] he Unreasonable Effectiveness of Open Science in AI: A Replication Study AAAI2025

链接: https://arxiv.org/abs/2412.17859
作者: Odd Erik Gundersen,Odd Cappelen,Martin Mølnå,Nicklas Grimstad Nilsen
关键词: affects AI research, data, articles, code, replication
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: This paper has been accepted at AAAI 2025

点击查看摘要

Abstract:A reproducibility crisis has been reported in science, but the extent to which it affects AI research is not yet fully understood. Therefore, we performed a systematic replication study including 30 highly cited AI studies relying on original materials when available. In the end, eight articles were rejected because they required access to data or hardware that was practically impossible to acquire as part of the project. Six articles were successfully reproduced, while five were partially reproduced. In total, 50% of the articles included was reproduced to some extent. The availability of code and data correlate strongly with reproducibility, as 86% of articles that shared code and data were fully or partly reproduced, while this was true for 33% of articles that shared only data. The quality of the data documentation correlates with successful replication. Poorly documented or miss-specified data will probably result in unsuccessful replication. Surprisingly, the quality of the code documentation does not correlate with successful replication. Whether the code is poorly documented, partially missing, or not versioned is not important for successful replication, as long as the code is shared. This study emphasizes the effectiveness of open science and the importance of properly documenting data work.

[AI-67] Active Geospatial Search for Efficient Tenant Eviction Outreach AAAI2025

链接: https://arxiv.org/abs/2412.17854
作者: Anindya Sarkar,Alex DiChristofano,Sanmay Das,Patrick J. Fowler,Nathan Jacobs,Yevgeniy Vorobeychik
关键词: threaten housing stability, evictions threaten housing, Tenant evictions threaten, threaten housing, housing stability
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted to AAAI 2025 (AI for Social Impact Track)

点击查看摘要

Abstract:Tenant evictions threaten housing stability and are a major concern for many cities. An open question concerns whether data-driven methods enhance outreach programs that target at-risk tenants to mitigate their risk of eviction. We propose a novel active geospatial search (AGS) modeling framework for this problem. AGS integrates property-level information in a search policy that identifies a sequence of rental units to canvas to both determine their eviction risk and provide support if needed. We propose a hierarchical reinforcement learning approach to learn a search policy for AGS that scales to large urban areas containing thousands of parcels, balancing exploration and exploitation and accounting for travel costs and a budget constraint. Crucially, the search policy adapts online to newly discovered information about evictions. Evaluation using eviction data for a large urban area demonstrates that the proposed framework and algorithmic approach are considerably more effective at sequentially identifying eviction cases than baseline methods.

[AI-68] LaMI-GO: Latent Mixture Integration for Goal-Oriented Communications Achieving High Spectrum Efficiency

链接: https://arxiv.org/abs/2412.17839
作者: Achintha Wijesinghe,Suchinthaka Wanninayaka,Weiwei Wang,Yu-Chieh Chao,Songyang Zhang,Zhi Ding
关键词: remarkably efficient multimedia, multimedia information transmissions, semantic-style communications includes, recent rise, rise of semantic-style
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: Under review

点击查看摘要

Abstract:The recent rise of semantic-style communications includes the development of goal-oriented communications (GOCOMs) remarkably efficient multimedia information transmissions. The concept of GO-COMS leverages advanced artificial intelligence (AI) tools to address the rising demand for bandwidth efficiency in applications, such as edge computing and Internet-of-Things (IoT). Unlike traditional communication systems focusing on source data accuracy, GO-COMs provide intelligent message delivery catering to the special needs critical to accomplishing downstream tasks at the receiver. In this work, we present a novel GO-COM framework, namely LaMI-GO that utilizes emerging generative AI for better quality-of-service (QoS) with ultra-high communication efficiency. Specifically, we design our LaMI-GO system backbone based on a latent diffusion model followed by a vector-quantized generative adversarial network (VQGAN) for efficient latent embedding and information representation. The system trains a common feature codebook the receiver side. Our experimental results demonstrate substantial improvement in perceptual quality, accuracy of downstream tasks, and bandwidth consumption over the state-of-the-art GOCOM systems and establish the power of our proposed LaMI-GO communication framework.

[AI-69] Coordinated Power Smoothing Control for Wind Storage Integrated System with Physics-informed Deep Reinforcement Learning

链接: https://arxiv.org/abs/2412.17838
作者: Shuyi Wang,Huan Zhao,Yuji Cao,Zibin Pan,Guolong Liu,Gaoqi Liang,Junhua Zhao
关键词: Storage Integrated System, Wind Storage Integrated, Power Smoothing Control, Storage Integrated, Integrated System
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Wind Storage Integrated System with Power Smoothing Control (PSC) has emerged as a promising solution to ensure both efficient and reliable wind energy generation. However, existing PSC strategies overlook the intricate interplay and distinct control frequencies between batteries and wind turbines, and lack consideration of wake effect and battery degradation cost. In this paper, a novel coordinated control framework with hierarchical levels is devised to address these challenges effectively, which integrates the wake model and battery degradation model. In addition, after reformulating the problem as a Markov decision process, the multi-agent reinforcement learning method is introduced to overcome the bi-level characteristic of the problem. Moreover, a Physics-informed Neural Network-assisted Multi-agent Deep Deterministic Policy Gradient (PAMA-DDPG) algorithm is proposed to incorporate the power fluctuation differential equation and expedite the learning process. The effectiveness of the proposed methodology is evaluated through simulations conducted in four distinct scenarios using WindFarmSimulator (WFSim). The results demonstrate that the proposed algorithm facilitates approximately an 11% increase in total profit and a 19% decrease in power fluctuation compared to the traditional methods, thereby addressing the dual objectives of economic efficiency and grid-connected energy reliability.

[AI-70] KG4Diagnosis: A Hierarchical Multi-Agent LLM Framework with Knowledge Graph Enhancement for Medical Diagnosis AAAI-25

链接: https://arxiv.org/abs/2412.16833
作者: Kaiwen Zuo,Yirui Jiang,Fan Mo,Pietro Lio
关键词: Large Language Models, Integrating Large Language, Integrating Large, Language Models, Large Language
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages,5 figures,published to AAAI-25 Bridge Program

点击查看摘要

Abstract:Integrating Large Language Models (LLMs) in healthcare diagnosis demands systematic frameworks that can handle complex medical scenarios while maintaining specialized expertise. We present KG4Diagnosis, a novel hierarchical multi-agent framework that combines LLMs with automated knowledge graph construction, encompassing 362 common diseases across medical specialties. Our framework mirrors real-world medical systems through a two-tier architecture: a general practitioner (GP) agent for initial assessment and triage, coordinating with specialized agents for in-depth diagnosis in specific domains. The core innovation lies in our end-to-end knowledge graph generation methodology, incorporating: (1) semantic-driven entity and relation extraction optimized for medical terminology, (2) multi-dimensional decision relationship reconstruction from unstructured medical texts, and (3) human-guided reasoning for knowledge expansion. KG4Diagnosis serves as an extensible foundation for specialized medical diagnosis systems, with capabilities to incorporate new diseases and medical knowledge. The framework’s modular design enables seamless integration of domain-specific enhancements, making it valuable for developing targeted medical diagnosis systems. We provide architectural guidelines and protocols to facilitate adoption across medical contexts.

[AI-71] Self-supervised Spatio-Temporal Graph Mask-Passing Attention Network for Perceptual Importance Prediction of Multi-point Tactility

链接: https://arxiv.org/abs/2410.03434
作者: Dazhong He,Qian Liu
关键词: modern multimedia systems, visual and auditory, prevalent in modern, form of human, multimedia systems
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Published as a conference paper at Eurohaptics 2024

点击查看摘要

Abstract:While visual and auditory information are prevalent in modern multimedia systems, haptic interaction, e.g., tactile and kinesthetic interaction, provides a unique form of human perception. However, multimedia technology for contact interaction is less mature than non-contact multimedia technologies and requires further development. Specialized haptic media technologies, requiring low latency and bitrates, are essential to enable haptic interaction, necessitating haptic information compression. Existing vibrotactile signal compression methods, based on the perceptual model, do not consider the characteristics of fused tactile perception at multiple spatially distributed interaction points. In fact, differences in tactile perceptual importance are not limited to conventional frequency and time domains, but also encompass differences in the spatial locations on the skin unique to tactile perception. For the most frequently used tactile information, vibrotactile texture perception, we have developed a model to predict its perceptual importance at multiple points, based on self-supervised learning and Spatio-Temporal Graph Neural Network. Current experimental results indicate that this model can effectively predict the perceptual importance of various points in multi-point tactile perception scenarios.

[AI-72] GOPT: Generalizable Online 3D Bin Packing via Transformer-based Deep Reinforcement Learning

链接: https://arxiv.org/abs/2409.05344
作者: Heng Xiong,Changrong Guo,Jian Peng,Kai Ding,Wenjie Chen,Xuchong Qiu,Long Bai,Jianfeng Xu
关键词: Robotic object packing, Bin Packing Problem, Robotic object, Packing Problem, broad practical applications
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures. This paper has been accepted by IEEE Robotics and Automation Letters

点击查看摘要

Abstract:Robotic object packing has broad practical applications in the logistics and automation industry, often formulated by researchers as the online 3D Bin Packing Problem (3D-BPP). However, existing DRL-based methods primarily focus on enhancing performance in limited packing environments while neglecting the ability to generalize across multiple environments characterized by different bin dimensions. To this end, we propose GOPT, a generalizable online 3D Bin Packing approach via Transformer-based deep reinforcement learning (DRL). First, we design a Placement Generator module to yield finite subspaces as placement candidates and the representation of the bin. Second, we propose a Packing Transformer, which fuses the features of the items and bin, to identify the spatial correlation between the item to be packed and available sub-spaces within the bin. Coupling these two components enables GOPT’s ability to perform inference on bins of varying dimensions. We conduct extensive experiments and demonstrate that GOPT not only achieves superior performance against the baselines, but also exhibits excellent generalization capabilities. Furthermore, the deployment with a robot showcases the practical applicability of our method in the real world. The source code will be publicly available at this https URL.

[AI-73] Joint Adaptive OFDM and Reinforcement Learning Design for Autonomous Vehicles: Leveraging Age of Updates

链接: https://arxiv.org/abs/2412.18500
作者: Mamady Delamou,Ahmed Naeem,Huseyin Arslan,El Mehdi Amhoud
关键词: orthogonal frequency-division multiplexing, Millimeter wave, based orthogonal frequency-division, frequency-division multiplexing, orthogonal frequency-division
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: 15 pages, 17 Figures

点击查看摘要

Abstract:Millimeter wave (mmWave)-based orthogonal frequency-division multiplexing (OFDM) stands out as a suitable alternative for high-resolution sensing and high-speed data transmission. To meet communication and sensing requirements, many works propose a static configuration where the wave’s hyperparameters such as the number of symbols in a frame and the number of frames in a communication slot are already predefined. However, two facts oblige us to redefine the problem, (1) the environment is often dynamic and uncertain, and (2) mmWave is severely impacted by wireless environments. A striking example where this challenge is very prominent is autonomous vehicle (AV). Such a system leverages integrated sensing and communication (ISAC) using mmWave to manage data transmission and the dynamism of the environment. In this work, we consider an autonomous vehicle network where an AV utilizes its queue state information (QSI) and channel state information (CSI) in conjunction with reinforcement learning techniques to manage communication and sensing. This enables the AV to achieve two primary objectives: establishing a stable communication link with other AVs and accurately estimating the velocities of surrounding objects with high resolution. The communication performance is therefore evaluated based on the queue state, the effective data rate, and the discarded packets rate. In contrast, the effectiveness of the sensing is assessed using the velocity resolution. In addition, we exploit adaptive OFDM techniques for dynamic modulation, and we suggest a reward function that leverages the age of updates to handle the communication buffer and improve sensing. The system is validated using advantage actor-critic (A2C) and proximal policy optimization (PPO). Furthermore, we compare our solution with the existing design and demonstrate its superior performance by computer simulations.

[AI-74] A Statistical Framework for Ranking LLM -Based Chatbots

链接: https://arxiv.org/abs/2412.18407
作者: Siavash Ameli,Siyuan Zhuang,Ion Stoica,Michael W. Mahoney
关键词: Chatbot Arena providing, Arena providing pioneering, natural language processing, Large language models, transformed natural language
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed natural language processing, with frameworks like Chatbot Arena providing pioneering platforms for evaluating these models. By facilitating millions of pairwise comparisons based on human judgments, Chatbot Arena has become a cornerstone in LLM evaluation, offering rich datasets for ranking models in open-ended conversational tasks. Building upon this foundation, we propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis. First, we introduce a factored tie model that enhances the ability to handle ties – an integral aspect of human-judged comparisons – significantly improving the model’s fit to observed data. Second, we extend the framework to model covariance between competitors, enabling deeper insights into performance relationships and facilitating intuitive groupings into performance tiers. Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints, ensuring stable and interpretable parameter estimation. Through rigorous evaluation and extensive experimentation, our framework demonstrates substantial improvements over existing methods in modeling pairwise comparison data. To support reproducibility and practical adoption, we release leaderbot, an open-source Python package implementing our models and analyses.

[AI-75] he Value of AI-Generated Metadata for UGC Platforms: Evidence from a Large-scale Field Experiment

链接: https://arxiv.org/abs/2412.18337
作者: Xinyi Zhang,Chenshuo Sun,Renyu Zhang,Khim-Yong Goh
关键词: social media posts, AI-generated, product descriptions, advertisement copy, media posts
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:AI-generated content (AIGC), such as advertisement copy, product descriptions, and social media posts, is becoming ubiquitous in business practices. However, the value of AI-generated metadata, such as titles, remains unclear on user-generated content (UGC) platforms. To address this gap, we conducted a large-scale field experiment on a leading short-video platform in Asia to provide about 1 million users access to AI-generated titles for their uploaded videos. Our findings show that the provision of AI-generated titles significantly boosted content consumption, increasing valid watches by 1.6% and watch duration by 0.9%. When producers adopted these titles, these increases jumped to 7.1% and 4.1%, respectively. This viewership-boost effect was largely attributed to the use of this generative AI (GAI) tool increasing the likelihood of videos having a title by 41.4%. The effect was more pronounced for groups more affected by metadata sparsity. Mechanism analysis revealed that AI-generated metadata improved user-video matching accuracy in the platform’s recommender system. Interestingly, for a video for which the producer would have posted a title anyway, adopting the AI-generated title decreased its viewership on average, implying that AI-generated titles may be of lower quality than human-generated ones. However, when producers chose to co-create with GAI and significantly revised the AI-generated titles, the videos outperformed their counterparts with either fully AI-generated or human-generated titles, showcasing the benefits of human-AI co-creation. This study highlights the value of AI-generated metadata and human-AI metadata co-creation in enhancing user-content matching and content consumption for UGC platforms.

[AI-76] Frechet regression for multi-label feature selection with implicit regularization

链接: https://arxiv.org/abs/2412.18247
作者: Dou El Kefel Mansouri,Seif-Eddine Benkabou,Khalid Benabdeslem
关键词: Fréchet regression extends, Toggle, Fréchet regression, Machine Learning, Code
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fréchet regression extends linear regression to model complex responses in metric spaces, making it particularly relevant for multi-label regression, where each instance can have multiple associated labels. However, variable selection within this framework remains underexplored. In this paper, we pro pose a novel variable selection method that employs implicit regularization instead of traditional explicit regularization approaches, which can introduce bias. Our method effectively captures nonlinear interactions between predic tors and responses while promoting model sparsity. We provide theoretical results demonstrating selection consistency and illustrate the performance of our approach through numerical examples Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.18247 [stat.ML] (or arXiv:2412.18247v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2412.18247 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dou El Kefel Mansouri [view email] [v1] Tue, 24 Dec 2024 08:02:28 UTC (229 KB) Full-text links: Access Paper: View a PDF of the paper titled Fr’echet regression for multi-label feature selection with implicit regularization, by Dou El Kefel Mansouri and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: stat.ML prev | next new | recent | 2024-12 Change to browse by: cs cs.AI cs.LG stat References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-77] xt-Aware Adapter for Few-Shot Keyword Spotting ICASSP2025

链接: https://arxiv.org/abs/2412.18142
作者: Youngmoon Jung,Jinyoung Lee,Seungjin Lee,Myunghun Jung,Yong-Hyeok Lee,Hoon-Young Cho
关键词: Recent advances, flexible keyword spotting, users to personalize, pre-trained flexible KWS, flexible KWS model
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 5 pages, 3 figures, Accepted by ICASSP 2025

点击查看摘要

Abstract:Recent advances in flexible keyword spotting (KWS) with text enrollment allow users to personalize keywords without uttering them during enrollment. However, there is still room for improvement in target keyword performance. In this work, we propose a novel few-shot transfer learning method, called text-aware adapter (TA-adapter), designed to enhance a pre-trained flexible KWS model for specific keywords with limited speech samples. To adapt the acoustic encoder, we leverage a jointly pre-trained text encoder to generate a text embedding that acts as a representative vector for the keyword. By fine-tuning only a small portion of the network while keeping the core components’ weights intact, the TA-adapter proves highly efficient for few-shot KWS, enabling a seamless return to the original pre-trained model. In our experiments, the TA-adapter demonstrated significant performance improvements across 35 distinct keywords from the Google Speech Commands V2 dataset, with only a 0.14% increase in the total number of parameters.

[AI-78] SongGLM: Lyric-to-Melody Generation with 2D Alignment Encoding and Multi-Task Pre-Training AAAI2025

链接: https://arxiv.org/abs/2412.18107
作者: Jiaxing Yu,Xinda Wu,Yunfei Xu,Tieyao Zhang,Songruoyao Wu,Le Ma,Kejun Zhang
关键词: automatically create melodies, General Language Model, create melodies based, aims to automatically, automatically create
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
*备注: Extended version of paper accepted to AAAI 2025

点击查看摘要

Abstract:Lyric-to-melody generation aims to automatically create melodies based on given lyrics, requiring the capture of complex and subtle correlations between them. However, previous works usually suffer from two main challenges: 1) lyric-melody alignment modeling, which is often simplified to one-syllable/word-to-one-note alignment, while others have the problem of low alignment accuracy; 2) lyric-melody harmony modeling, which usually relies heavily on intermediates or strict rules, limiting model’s capabilities and generative diversity. In this paper, we propose SongGLM, a lyric-to-melody generation system that leverages 2D alignment encoding and multi-task pre-training based on the General Language Model (GLM) to guarantee the alignment and harmony between lyrics and melodies. Specifically, 1) we introduce a unified symbolic song representation for lyrics and melodies with word-level and phrase-level (2D) alignment encoding to capture the lyric-melody alignment; 2) we design a multi-task pre-training framework with hierarchical blank infilling objectives (n-gram, phrase, and long span), and incorporate lyric-melody relationships into the extraction of harmonized n-grams to ensure the lyric-melody harmony. We also construct a large-scale lyric-melody paired dataset comprising over 200,000 English song pieces for pre-training and fine-tuning. The objective and subjective results indicate that SongGLM can generate melodies from lyrics with significant improvements in both alignment and harmony, outperforming all the previous baseline methods.

[AI-79] LangYa: Revolutionizing Cross-Spatiotemporal Ocean Forecasting

链接: https://arxiv.org/abs/2412.18097
作者: Nan Yang,Chong Wang,Meihua Zhao,Zimeng Zhao,Huiling Zheng,Bin Zhang,Jianing Wang,Xiaofeng Li
关键词: Ocean forecasting, Ocean, forecasting, ocean forecasting systems, societal benefits
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
*备注: 18pages, 5 figures

点击查看摘要

Abstract:Ocean forecasting is crucial for both scientific research and societal benefits. Currently, the most accurate forecasting systems are global ocean forecasting systems (GOFSs), which represent the ocean state variables (OSVs) as discrete grids and solve partial differential equations (PDEs) governing the transitions of oceanic state variables using numerical methods. However, GOFSs processes are computationally expensive and prone to cumulative errors. Recently, large artificial intelligence (AI)-based models significantly boosted forecasting speed and accuracy. Unfortunately, building a large AI ocean forecasting system that can be considered cross-spatiotemporal and air-sea coupled forecasts remains a significant challenge. Here, we introduce LangYa, a cross-spatiotemporal and air-sea coupled ocean forecasting system. Results demonstrate that the time embedding module in LangYa enables a single model to make forecasts with lead times ranging from 1 to 7 days. The air-sea coupled module effectively simulates air-sea interactions. The ocean self-attention module improves network stability and accelerates convergence during training, and the adaptive thermocline loss function improves the accuracy of thermocline forecasting. Compared to existing numerical and AI-based ocean forecasting systems, LangYa uses 27 years of global ocean data from the Global Ocean Reanalysis and Simulation version 12 (GLORYS12) for training and achieves more reliable deterministic forecasting results for OSVs. LangYa forecasting system provides global ocean researchers with access to a powerful software tool for accurate ocean forecasting and opens a new paradigm for ocean science.

[AI-80] Automated Materials Discovery Platform Realized: Scanning Probe Microscopy of Combinatorial Libraries

链接: https://arxiv.org/abs/2412.18067
作者: Yu Liu,Rohit Pant,Ichiro Takeuchi,R. Jackson Spurling,Jon-Paul Maria,Maxim Ziatdinov,Sergei V. Kalinin
关键词: multicomponent phase diagrams, Combinatorial libraries, phase diagrams, powerful approach, approach for exploring
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Artificial Intelligence (cs.AI)
*备注: 19 pages, 8 figures

点击查看摘要

Abstract:Combinatorial libraries are a powerful approach for exploring the evolution of physical properties across binary and ternary cross-sections in multicomponent phase diagrams. Although the synthesis of these libraries has been developed since the 1960s and expedited with advanced laboratory automation, the broader application of combinatorial libraries relies on fast, reliable measurements of concentration-dependent structures and functionalities. Scanning Probe Microscopies (SPM), including piezoresponse force microscopy (PFM), offer significant potential for quantitative, functionally relevant combi-library readouts. Here we demonstrate the implementation of fully automated SPM to explore the evolution of ferroelectric properties in combinatorial libraries, focusing on Sm-doped BiFeO3 and ZnxMg1-xO systems. We also present and compare Gaussian Process-based Bayesian Optimization models for fully automated exploration, emphasizing local reproducibility (effective noise) as an essential factor in optimal experiment workflows. Automated SPM, when coupled with upstream synthesis controls, plays a pivotal role in bridging materials synthesis and characterization.

[AI-81] Stability Bounds for the Unfolded Forward-Backward Algorithm

链接: https://arxiv.org/abs/2412.17888
作者: Emilie Chouzenoux,Cecile Della Valle,Jean-Christophe Pesquet
关键词: network architecture designed, solve inverse problems, Tikhonov-type regularization term, designed to solve, degradation operator
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2105.15044

点击查看摘要

Abstract:We consider a neural network architecture designed to solve inverse problems where the degradation operator is linear and known. This architecture is constructed by unrolling a forward-backward algorithm derived from the minimization of an objective function that combines a data-fidelity term, a Tikhonov-type regularization term, and a potentially nonsmooth convex penalty. The robustness of this inversion method to input perturbations is analyzed theoretically. Ensuring robustness complies with the principles of inverse problem theory, as it ensures both the continuity of the inversion method and the resilience to small noise - a critical property given the known vulnerability of deep neural networks to adversarial perturbations. A key novelty of our work lies in examining the robustness of the proposed network to perturbations in its bias, which represents the observed data in the inverse problem. Additionally, we provide numerical illustrations of the analytical Lipschitz bounds derived in our analysis.

[AI-82] ransfer Learning with Active Sampling for Rapid Training and Calibration in BCI-P300 Across Health States and Multi-centre Data

链接: https://arxiv.org/abs/2412.17833
作者: Christian Flores,Marcelo Contreras,Ichiro Macedo,Javier Andreu-Perez
关键词: boosted Brain-Computer Interface, cultural differences affecting, Brain-Computer Interface, differences affecting neural, deep learning advancements
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning and deep learning advancements have boosted Brain-Computer Interface (BCI) performance, but their wide-scale applicability is limited due to factors like individual health, hardware variations, and cultural differences affecting neural data. Studies often focus on uniform single-site experiments in uniform settings, leading to high performance that may not translate well to real-world diversity. Deep learning models aim to enhance BCI classification accuracy, and transfer learning has been suggested to adapt models to individual neural patterns using a base model trained on others’ data. This approach promises better generalizability and reduced overfitting, yet challenges remain in handling diverse and imbalanced datasets from different equipment, subjects, multiple centres in different countries, and both healthy and patient populations for effective model transfer and tuning. In a setting characterized by maximal heterogeneity, we proposed P300 wave detection in BCIs employing a convolutional neural network fitted with adaptive transfer learning based on Poison Sampling Disk (PDS) called Active Sampling (AS), which flexibly adjusts the transition from source data to the target domain. Our results reported for subject adaptive with 40% of adaptive fine-tuning that the averaged classification accuracy improved by 5.36% and standard deviation reduced by 12.22% using two distinct, internationally replicated datasets. These results outperformed in classification accuracy, computational time, and training efficiency, mainly due to the proposed Active Sampling (AS) method for transfer learning. Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 68 ACMclasses: I.2; J.6 Cite as: arXiv:2412.17833 [eess.SP] (or arXiv:2412.17833v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2412.17833 Focus to learn more arXiv-issued DOI via DataCite Journalreference: IEEE Trans Neural Syst Rehabil Eng. 2024;32:3794-3803 Related DOI: https://doi.org/10.1109/TNSRE.2024.3420960 Focus to learn more DOI(s) linking to related resources

[AI-83] MANGO: Multimodal Acuity traNsformer for intelliGent ICU Outcomes

链接: https://arxiv.org/abs/2412.17832
作者: Jiaqing Zhang,Miguel Contreras,Sabyasachi Bandyopadhyay,Andrea Davidson,Jessica Sena,Yuanfang Ren,Ziyuan Guan,Tezcan Ozrazgat-Baslanti,Tyler J. Loftus,Subhash Nerella,Azra Bihorac,Parisa Rashidi
关键词: Intensive Care Unit, Care Unit, Intensive Care, Estimation of patient, acuity
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimation of patient acuity in the Intensive Care Unit (ICU) is vital to ensure timely and appropriate interventions. Advances in artificial intelligence (AI) technologies have significantly improved the accuracy of acuity predictions. However, prior studies using machine learning for acuity prediction have predominantly relied on electronic health records (EHR) data, often overlooking other critical aspects of ICU stay, such as patient mobility, environmental factors, and facial cues indicating pain or agitation. To address this gap, we present MANGO: the Multimodal Acuity traNsformer for intelliGent ICU Outcomes, designed to enhance the prediction of patient acuity states, transitions, and the need for life-sustaining therapy. We collected a multimodal dataset ICU-Multimodal, incorporating four key modalities, EHR data, wearable sensor data, video of patient’s facial cues, and ambient sensor data, which we utilized to train MANGO. The MANGO model employs a multimodal feature fusion network powered by Transformer masked self-attention method, enabling it to capture and learn complex interactions across these diverse data modalities even when some modalities are absent. Our results demonstrated that integrating multiple modalities significantly improved the model’s ability to predict acuity status, transitions, and the need for life-sustaining therapy. The best-performing models achieved an area under the receiver operating characteristic curve (AUROC) of 0.76 (95% CI: 0.72-0.79) for predicting transitions in acuity status and the need for life-sustaining therapy, while 0.82 (95% CI: 0.69-0.89) for acuity status prediction…

[AI-84] RUL forecasting for wind turbine predictive maintenance based on deep learning

链接: https://arxiv.org/abs/2412.17823
作者: Syed Shazaib Shah,Tan Daoliang,Sah Chandan Kumar
关键词: strategically scheduling maintenance, reduce wind farm, wind farm operation, Predictive maintenance, remaining useful life
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 19 pages, 16 figures, Journal Paper

点击查看摘要

Abstract:Predictive maintenance (PdM) is increasingly pursued to reduce wind farm operation and maintenance costs by accurately predicting the remaining useful life (RUL) and strategically scheduling maintenance. However, the remoteness of wind farms often renders current methodologies ineffective, as they fail to provide a sufficiently reliable advance time window for maintenance planning, limiting PdM’s practicality. This study introduces a novel deep learning (DL) methodology for future RUL forecasting. By employing a multi-parametric attention-based DL approach that bypasses feature engineering, thereby minimizing the risk of human error, two models: ForeNet-2d and ForeNet-3d are proposed. These models successfully forecast the RUL for seven multifaceted wind turbine (WT) failures with a 2-week forecast window. The most precise forecast deviated by only 10 minutes from the actual RUL, while the least accurate prediction deviated by 1.8 days, with most predictions being off by only a few hours. This methodology offers a substantial time frame to access remote WTs and perform necessary maintenance, thereby enabling the practical implementation of PdM.

机器学习

[LG-0] Structure Learning in Gaussian Graphical Models from Glauber Dynamics

链接: https://arxiv.org/abs/2412.18594
作者: Vignesh Tirukkonda,Anirudh Rayas,Gautam Dasarathy
关键词: biological network modeling, Gaussian graphical model, including biological network, graphical model selection, financial network modeling
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian graphical model selection is an important paradigm with numerous applications, including biological network modeling, financial network modeling, and social network analysis. Traditional approaches assume access to independent and identically distributed (i.i.d) samples, which is often impractical in real-world scenarios. In this paper, we address Gaussian graphical model selection under observations from a more realistic dependent stochastic process known as Glauber dynamics. Glauber dynamics, also called the Gibbs sampler, is a Markov chain that sequentially updates the variables of the underlying model based on the statistics of the remaining model. Such models, aside from frequently being employed to generate samples from complex multivariate distributions, naturally arise in various settings, such as opinion consensus in social networks and clearing/stock-price dynamics in financial networks. In contrast to the extensive body of existing work, we present the first algorithm for Gaussian graphical model selection when data are sampled according to the Glauber dynamics. We provide theoretical guarantees on the computational and statistical complexity of the proposed algorithm’s structure learning performance. Additionally, we provide information-theoretic lower bounds on the statistical complexity and show that our algorithm is nearly minimax optimal for a broad class of problems. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2412.18594 [cs.LG] (or arXiv:2412.18594v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.18594 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] ReducedLUT: Table Decomposition with “Dont Care” Conditions

链接: https://arxiv.org/abs/2412.18579
作者: Oliver Cassidy,Marta Andronic,Samuel Coward,George A. Constantinides
关键词: complex mathematical computations, efficiently store arrays, mathematical computations, efficiently store, complex mathematical
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Lookup tables (LUTs) are frequently used to efficiently store arrays of precomputed values for complex mathematical computations. When used in the context of neural networks, these functions exhibit a lack of recognizable patterns which presents an unusual challenge for conventional logic synthesis techniques. Several approaches are known to break down a single large lookup table into multiple smaller ones that can be recombined. Traditional methods, such as plain tabulation, piecewise linear approximation, and multipartite table methods, often yield inefficient hardware solutions when applied to LUT-based NNs. This paper introduces ReducedLUT, a novel method to reduce the footprint of the LUTs by injecting don’t cares into the compression process. This additional freedom introduces more self-similarities which can be exploited using known decomposition techniques. We then demonstrate a particular application to machine learning; by replacing unobserved patterns within the training data of neural network models with don’t cares, we enable greater compression with minimal model accuracy degradation. In practice, we achieve up to 1.63\times reduction in Physical LUT utilization, with a test accuracy drop of no more than 0.01 accuracy points. Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2412.18579 [cs.AR] (or arXiv:2412.18579v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2412.18579 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA '25), February 27–March 1, 2025, Monterey, CA, USA Related DOI: https://doi.org/10.1145/3706628.3708823 Focus to learn more DOI(s) linking to related resources

[LG-2] Efficient Aircraft Design Optimization Using Multi-Fidelity Models and Multi-fidelity Physics Informed Neural Networks

链接: https://arxiv.org/abs/2412.18564
作者: Apurba Sarker
关键词: Finite Element Method, Finite Volume Method, Finite Element, Finite Volume, optimization traditionally relies
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:Aircraft design optimization traditionally relies on computationally expensive simulation techniques such as Finite Element Method (FEM) and Finite Volume Method (FVM), which, while accurate, can significantly slow down the design iteration process. The challenge lies in reducing the computational complexity while maintaining high accuracy for quick evaluations of multiple design alternatives. This research explores advanced methods, including surrogate models, reduced-order models (ROM), and multi-fidelity machine learning techniques, to achieve more efficient aircraft design evaluations. Specifically, the study investigates the application of Multi-fidelity Physics-Informed Neural Networks (MPINN) and autoencoders for manifold alignment, alongside the potential of Generative Adversarial Networks (GANs) for refining design geometries. Through a proof-of-concept task, the research demonstrates the ability to predict high-fidelity results from low-fidelity simulations, offering a path toward faster and more cost effective aircraft design iterations.

[LG-3] FedVCK: Non-IID Robust and Communication-Efficient Federated Learning via Valuable Condensed Knowledge for Medical Image Analysis AAAI2025

链接: https://arxiv.org/abs/2412.18557
作者: Guochen Yan,Luyuan Xie,Xinyi Gao,Wentao Zhang,Qingni Shen,Yuejian Fang,Zhonghai Wu
关键词: Federated learning, promising solution, solution for collaboration, federated learning methods, textbf
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Federated learning has become a promising solution for collaboration among medical institutions. However, data owned by each institution would be highly heterogeneous and the distribution is always non-independent and identical distribution (non-IID), resulting in client drift and unsatisfactory performance. Despite existing federated learning methods attempting to solve the non-IID problems, they still show marginal advantages but rely on frequent communication which would incur high costs and privacy concerns. In this paper, we propose a novel federated learning method: \textbfFederated learning via \textbfValuable \textbfCondensed \textbfKnowledge (FedVCK). We enhance the quality of condensed knowledge and select the most necessary knowledge guided by models, to tackle the non-IID problem within limited communication budgets effectively. Specifically, on the client side, we condense the knowledge of each client into a small dataset and further enhance the condensation procedure with latent distribution constraints, facilitating the effective capture of high-quality knowledge. During each round, we specifically target and condense knowledge that has not been assimilated by the current model, thereby preventing unnecessary repetition of homogeneous knowledge and minimizing the frequency of communications required. On the server side, we propose relational supervised contrastive learning to provide more supervision signals to aid the global model updating. Comprehensive experiments across various medical tasks show that FedVCK can outperform state-of-the-art methods, demonstrating that it’s non-IID robust and communication-efficient.

[LG-4] Graph Structure Learning for Spatial-Temporal Imputation: Adapting to Node and Feature Scales AAAI2025

链接: https://arxiv.org/abs/2412.18535
作者: Xinyu Yang,Yu Sun,Xinyang Chen,Ying Zhang,Xiaojie Yuan
关键词: Graph Structure Learning, posing challenges, Structure Learning, Graph Structure, spatial
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: This paper has been accepted as a full paper at AAAI 2025

点击查看摘要

Abstract:Spatial-temporal data collected across different geographic locations often suffer from missing values, posing challenges to data analysis. Existing methods primarily leverage fixed spatial graphs to impute missing values, which implicitly assume that the spatial relationship is roughly the same for all features across different locations. However, they may overlook the different spatial relationships of diverse features recorded by sensors in different locations. To address this, we introduce the multi-scale Graph Structure Learning framework for spatial-temporal Imputation (GSLI) that dynamically adapts to the heterogeneous spatial correlations. Our framework encompasses node-scale graph structure learning to cater to the distinct global spatial correlations of different features, and feature-scale graph structure learning to unveil common spatial correlation across features within all stations. Integrated with prominence modeling, our framework emphasizes nodes and features with greater significance in the imputation process. Furthermore, GSLI incorporates cross-feature and cross-temporal representation learning to capture spatial-temporal dependencies. Evaluated on six real incomplete spatial-temporal datasets, GSLI showcases the improvement in data imputation.

[LG-5] GCN-ABFT: Low-Cost Online Error Checking for Graph Convolutional Networks

链接: https://arxiv.org/abs/2412.18534
作者: Christodoulos Peltekis,Giorgos Dimitrakopoulos
关键词: Graph convolutional networks, building machine-learning application, GCN, convolutional networks, graph-structured data
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted for publication at IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)

点击查看摘要

Abstract:Graph convolutional networks (GCNs) are popular for building machine-learning application for graph-structured data. This widespread adoption led to the development of specialized GCN hardware accelerators. In this work, we address a key architectural challenge for GCN accelerators: how to detect errors in GCN computations arising from random hardware faults with the least computation cost. Each GCN layer performs a graph convolution, mathematically equivalent to multiplying three matrices, computed through two separate matrix multiplications. Existing Algorithm-based Fault Tolerance(ABFT) techniques can check the results of individual matrix multiplications. However, for a GCN layer, this check should be performed twice. To avoid this overhead, this work introduces GCN-ABFT that directly calculates a checksum for the entire three-matrix product within a single GCN layer, providing a cost-effective approach for error detection in GCN accelerators. Experimental results demonstrate that GCN-ABFT reduces the number of operations needed for checksum computation by over 21% on average for representative GCN applications. These savings are achieved without sacrificing fault-detection accuracy, as evidenced by the presented fault-injection analysis.

[LG-6] Accelerating process control and optimization via machine learning: A review

链接: https://arxiv.org/abs/2412.18529
作者: Ilias Mitrai,Prodromos Daoutidis
关键词: chemical engineering applications, chemical engineering, solve decision-making problems, solve decision-making, Machine learning
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Process control and optimization have been widely used to solve decision-making problems in chemical engineering applications. However, identifying and tuning the best solution algorithm is challenging and time-consuming. Machine learning tools can be used to automate these steps by learning the behavior of a numerical solver from data. In this paper, we discuss recent advances in (i) the representation of decision-making problems for machine learning tasks, (ii) algorithm selection, and (iii) algorithm configuration for monolithic and decomposition-based algorithms. Finally, we discuss open problems related to the application of machine learning for accelerating process optimization and control.

[LG-7] Bayesian Optimization of Bilevel Problems

链接: https://arxiv.org/abs/2412.18518
作者: Omer Ekmekcioglu,Nursen Aydin,Juergen Branke
关键词: modeling complex decision-making, hierarchical mathematical framework, complex decision-making processes, Bilevel optimization, machine learning
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Bilevel optimization, a hierarchical mathematical framework where one optimization problem is nested within another, has emerged as a powerful tool for modeling complex decision-making processes in various fields such as economics, engineering, and machine learning. This paper focuses on bilevel optimization where both upper-level and lower-level functions are black boxes and expensive to evaluate. We propose a Bayesian Optimization framework that models the upper and lower-level functions as Gaussian processes over the combined space of upper and lower-level decisions, allowing us to exploit knowledge transfer between different sub-problems. Additionally, we propose a novel acquisition function for this model. Our experimental results demonstrate that the proposed algorithm is highly sample-efficient and outperforms existing methods in finding high-quality solutions.

[LG-8] FedGIG: Graph Inversion from Gradient in Federated Learning

链接: https://arxiv.org/abs/2412.18513
作者: Tianzhe Xiao,Yichen Li,Yining Qi,Haozhao Wang,Ruixuan Li
关键词: Gradient Inversion Attacks, recover private training, Recent studies, Inversion Attacks, Federated Graph Learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Recent studies have shown that Federated learning (FL) is vulnerable to Gradient Inversion Attacks (GIA), which can recover private training data from shared gradients. However, existing methods are designed for dense, continuous data such as images or vectorized texts, and cannot be directly applied to sparse and discrete graph data. This paper first explores GIA’s impact on Federated Graph Learning (FGL) and introduces Graph Inversion from Gradient in Federated Learning (FedGIG), a novel GIA method specifically designed for graph-structured data. FedGIG includes the adjacency matrix constraining module, which ensures the sparsity and discreteness of the reconstructed graph data, and the subgraph reconstruction module, which is designed to complete missing common subgraph structures. Extensive experiments on molecular datasets demonstrate FedGIG’s superior accuracy over existing GIA techniques.

[LG-9] An Empirical Analysis of Federated Learning Models Subject to Label-Flipping Adversarial Attack

链接: https://arxiv.org/abs/2412.18507
作者: Kunal Bhatnagar,Sagana Chattanathan,Angela Dang,Bhargav Eranki,Ronnit Rana,Charan Sridhar,Siddharth Vedam,Angie Yao,Mark Stamp
关键词: Convolution Neural Network, Recurrent Neural Network, Multinominal Logistic Regression, Neural Network, Support Vector Classifier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we empirically analyze adversarial attacks on selected federated learning models. The specific learning models considered are Multinominal Logistic Regression (MLR), Support Vector Classifier (SVC), Multilayer Perceptron (MLP), Convolution Neural Network (CNN), %Recurrent Neural Network (RNN), Random Forest, XGBoost, and Long Short-Term Memory (LSTM). For each model, we simulate label-flipping attacks, experimenting extensively with 10 federated clients and 100 federated clients. We vary the percentage of adversarial clients from 10% to 100% and, simultaneously, the percentage of labels flipped by each adversarial client is also varied from 10% to 100%. Among other results, we find that models differ in their inherent robustness to the two vectors in our label-flipping attack, i.e., the percentage of adversarial clients, and the percentage of labels flipped by each adversarial client. We discuss the potential practical implications of our results.

[LG-10] MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning

链接: https://arxiv.org/abs/2412.18437
作者: Abdelmadjid Chergui,Grigor Bezirganyan,Sana Sellami,Laure Berti-Équille,Sébastien Fournier
关键词: diverse data types, Choosing a suitable, suitable deep learning, data types, structures and characteristics
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Choosing a suitable deep learning architecture for multimodal data fusion is a challenging task, as it requires the effective integration and processing of diverse data types, each with distinct structures and characteristics. In this paper, we introduce MixMAS, a novel framework for sampling-based mixer architecture search tailored to multimodal learning. Our approach automatically selects the optimal MLP-based architecture for a given multimodal machine learning (MML) task. Specifically, MixMAS utilizes a sampling-based micro-benchmarking strategy to explore various combinations of modality-specific encoders, fusion functions, and fusion networks, systematically identifying the architecture that best meets the task’s performance metrics.

[LG-11] Learning to Play Against Unknown Opponents

链接: https://arxiv.org/abs/2412.18297
作者: Eshwar Ram Arunachaleswaran,Natalie Collina,Jon Schneider
关键词: general sum game, optimal learning algorithm, learning algorithm, learning, learning agent
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of a learning agent who has to repeatedly play a general sum game against a strategic opponent who acts to maximize their own payoff by optimally responding against the learner’s algorithm. The learning agent knows their own payoff function, but is uncertain about the payoff of their opponent (knowing only that it is drawn from some distribution \mathcalD ). What learning algorithm should the agent run in order to maximize their own total utility? We demonstrate how to construct an \varepsilon -optimal learning algorithm (obtaining average utility within \varepsilon of the optimal utility) for this problem in time polynomial in the size of the input and 1/\varepsilon when either the size of the game or the support of \mathcalD is constant. When the learning algorithm is further constrained to be a no-regret algorithm, we demonstrate how to efficiently construct an optimal learning algorithm (asymptotically achieving the optimal utility) in polynomial time, independent of any other assumptions. Both results make use of recently developed machinery that converts the analysis of learning algorithms to the study of the class of corresponding geometric objects known as menus. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2412.18297 [cs.GT] (or arXiv:2412.18297v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2412.18297 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] On the Local Complexity of Linear Regions in Deep ReLU Networks

链接: https://arxiv.org/abs/2412.18283
作者: Niket Patel,Guido Montufar
关键词: continuous piecewise linear, piecewise linear activations, local complexity, lower local complexity, piecewise linear
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We define the local complexity of a neural network with continuous piecewise linear activations as a measure of the density of linear regions over an input data distribution. We show theoretically that ReLU networks that learn low-dimensional feature representations have a lower local complexity. This allows us to connect recent empirical observations on feature learning at the level of the weight matrices with concrete properties of the learned functions. In particular, we show that the local complexity serves as an upper bound on the total variation of the function over the input data distribution and thus that feature learning can be related to adversarial robustness. Lastly, we consider how optimization drives ReLU networks towards solutions with lower local complexity. Overall, this work contributes a theoretical framework towards relating geometric properties of ReLU networks to different aspects of learning such as feature learning and representation cost.

[LG-13] GDM4MMIMO: Generative Diffusion Models for Massive MIMO Communications

链接: https://arxiv.org/abs/2412.18281
作者: Zhenzhou Jin,Li You,Huibin Zhou,Yuanshuo Wang,Xiaofeng Liu,Xinrui Gong,Xiqi Gao,Derrick Wing Kwan Ng,Xiang-Gen Xia
关键词: offers significant advantages, burgeoning data demands, data demands anticipated, Massive multiple-input multiple-output, wireless communication systems
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Massive multiple-input multiple-output (MIMO) offers significant advantages in spectral and energy efficiencies, positioning it as a cornerstone technology of fifth-generation (5G) wireless communication systems and a promising solution for the burgeoning data demands anticipated in sixth-generation (6G) networks. In recent years, with the continuous advancement of artificial intelligence (AI), a multitude of task-oriented generative foundation models (GFMs) have emerged, achieving remarkable performance in various fields such as computer vision (CV), natural language processing (NLP), and autonomous driving. As a pioneering force, these models are driving the paradigm shift in AI towards generative AI (GenAI). Among them, the generative diffusion model (GDM), as one of state-of-the-art families of generative models, demonstrates an exceptional capability to learn implicit prior knowledge and robust generalization capabilities, thereby enhancing its versatility and effectiveness across diverse applications. In this paper, we delve into the potential applications of GDM in massive MIMO communications. Specifically, we first provide an overview of massive MIMO communication, the framework of GFMs, and the working mechanism of GDM. Following this, we discuss recent research advancements in the field and present a case study of near-field channel estimation based on GDM, demonstrating its promising potential for facilitating efficient ultra-dimensional channel statement information (CSI) acquisition in the context of massive MIMO communications. Finally, we highlight several pressing challenges in future mobile communications and identify promising research directions surrounding GDM.

[LG-14] NoiseHGNN: Synthesized Similarity Graph-Based Neural Network For Noised Heterogeneous Graph Representation Learning AAAI2025

链接: https://arxiv.org/abs/2412.18267
作者: Xiong Zhang,Cheng Xie,Haoran Duan,Beibei Yu
关键词: data environments intrinsically, environments intrinsically exist, graph, graph data environments, intrinsically exist noise
类目: Machine Learning (cs.LG)
*备注: AAAI2025

点击查看摘要

Abstract:Real-world graph data environments intrinsically exist noise (e.g., link and structure errors) that inevitably disturb the effectiveness of graph representation and downstream learning tasks. For homogeneous graphs, the latest works use original node features to synthesize a similarity graph that can correct the structure of the noised graph. This idea is based on the homogeneity assumption, which states that similar nodes in the homogeneous graph tend to have direct links in the original graph. However, similar nodes in heterogeneous graphs usually do not have direct links, which can not be used to correct the original noise graph. This causes a significant challenge in noised heterogeneous graph learning. To this end, this paper proposes a novel synthesized similarity-based graph neural network compatible with noised heterogeneous graph learning. First, we calculate the original feature similarities of all nodes to synthesize a similarity-based high-order graph. Second, we propose a similarity-aware encoder to embed original and synthesized graphs with shared parameters. Then, instead of graph-to-graph supervising, we synchronously supervise the original and synthesized graph embeddings to predict the same labels. Meanwhile, a target-based graph extracted from the synthesized graph contrasts the structure of the metapath-based graph extracted from the original graph to learn the mutual information. Extensive experiments in numerous real-world datasets show the proposed method achieves state-of-the-art records in the noised heterogeneous graph learning tasks. In highlights, +5 \sim 6% improvements are observed in several noised datasets compared with previous SOTA methods. The code and datasets are available at this https URL.

[LG-15] Free the Design Space of Equivariant Graph Neural Networks: High-Rank Irreducible Cartesian Tensor Decomposition and Bases of Equivariant Spaces

链接: https://arxiv.org/abs/2412.18263
作者: Shihao Shao,Yikang Li,Zhouchen Lin,Qinghua Cui
关键词: Irreducible Cartesian tensors, graph neural networks, Irreducible Cartesian, equivariant graph neural, ICT decomposition
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注: 46 pages, 4 code snippets

点击查看摘要

Abstract:Irreducible Cartesian tensors (ICTs) play a crucial role in the design of equivariant graph neural networks, as well as in theoretical chemistry and chemical physics. Meanwhile, the design space of available linear operations on tensors that preserve symmetry presents a significant challenge. The ICT decomposition and a basis of this equivariant space are difficult to obtain for high-order tensors. After decades of research, we recently achieve an explicit ICT decomposition for n=5 \citepbonvicini2024irreducible with factorial time/space complexity. This work, for the first time, obtains decomposition matrices for ICTs up to rank n=9 with reduced and affordable complexity, by constructing what we call path matrices. The path matrices are obtained via performing chain-like contraction with Clebsch-Gordan matrices following the parentage scheme. We prove and leverage that the concatenation of path matrices is an orthonormal change-of-basis matrix between the Cartesian tensor product space and the spherical direct sum spaces. Furthermore, we identify a complete orthogonal basis for the equivariant space, rather than a spanning set \citeppearce2023brauer, through this path matrices technique. We further extend our result to the arbitrary tensor product and direct sum spaces, enabling free design between different spaces while keeping symmetry. The Python code is available in the appendix where the n=6,\dots,9 ICT decomposition matrices are obtained in 0.1s, 0.5s, 1s, 3s, 11s, and 4m32s, respectively.

[LG-16] Efficient Contrastive Explanations on Demand

链接: https://arxiv.org/abs/2412.18262
作者: Yacine Izza,Joao Marques-Silva
关键词: Recent work revealed, Recent work, work revealed, revealed a tight, restricted forms
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.08297

点击查看摘要

Abstract:Recent work revealed a tight connection between adversarial robustness and restricted forms of symbolic explanations, namely distance-based (formal) explanations. This connection is significant because it represents a first step towards making the computation of symbolic explanations as efficient as deciding the existence of adversarial examples, especially for highly complex machine learning (ML) models. However, a major performance bottleneck remains, because of the very large number of features that ML models may possess, in particular for deep neural networks. This paper proposes novel algorithms to compute the so-called contrastive explanations for ML models with a large number of features, by leveraging on adversarial robustness. Furthermore, the paper also proposes novel algorithms for listing explanations and finding smallest contrastive explanations. The experimental results demonstrate the performance gains achieved by the novel algorithms proposed in this paper.

[LG-17] Sch"odinger Bridge Type Diffusion Models as an Extension of Variational Autoencoders

链接: https://arxiv.org/abs/2412.18237
作者: Kentaro Kaba,Reo Shimizu,Masayuki Ohzeki,Yuki Sughiyama
关键词: Generative diffusion models, stochastic differential equations, backward stochastic differential, Generative diffusion, diffusion models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative diffusion models use time-forward and backward stochastic differential equations to connect the data and prior distributions. While conventional diffusion models (e.g., score-based models) only learn the backward process, more flexible frameworks have been proposed to also learn the forward process by employing the Schrödinger bridge (SB). However, due to the complexity of the mathematical structure behind SB-type models, we can not easily give an intuitive understanding of their objective function. In this work, we propose a unified framework to construct diffusion models by reinterpreting the SB-type models as an extension of variational autoencoders. In this context, the data processing inequality plays a crucial role. As a result, we find that the objective function consists of the prior loss and drift matching parts.

[LG-18] Conditional Deep Canonical Time Warping

链接: https://arxiv.org/abs/2412.18234
作者: Afek Steinberg,Ran Eisenberg,Ofir Lindenbaum
关键词: local time shifting, vision and bioinformatics, computer vision, Temporal, Conditional Deep Canonical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal alignment of sequences is a fundamental challenge in many applications, such as computer vision and bioinformatics, where local time shifting needs to be accounted for. Misalignment can lead to poor model generalization, especially in high-dimensional sequences. Existing methods often struggle with optimization when dealing with high-dimensional sparse data, falling into poor alignments. Feature selection is frequently used to enhance model performance for sparse data. However, a fixed set of selected features would not generally work for dynamically changing sequences and would need to be modified based on the state of the sequence. Therefore, modifying the selected feature based on contextual input would result in better alignment. Our suggested method, Conditional Deep Canonical Temporal Time Warping (CDCTW), is designed for temporal alignment in sparse temporal data to address these challenges. CDCTW enhances alignment accuracy for high dimensional time-dependent views be performing dynamic time warping on data embedded in maximally correlated subspace which handles sparsity with novel feature selection method. We validate the effectiveness of CDCTW through extensive experiments on various datasets, demonstrating superior performance over previous techniques.

[LG-19] owards Macro-AUC oriented Imbalanced Multi-Label Continual Learning AAAI2025

链接: https://arxiv.org/abs/2412.18231
作者: Yan Zhang,Guoqiang Wu,Bingzheng Wang,Teng Pang,Haoliang Sun,Yilong Yin
关键词: Continual Learning, multi-class classification task, Multi-Label Learning, existing work primarily, work primarily focuses
类目: Machine Learning (cs.LG)
*备注: 7 pages of main text, 11 pages of appendix, accepted to AAAI 2025

点击查看摘要

Abstract:In Continual Learning (CL), while existing work primarily focuses on the multi-class classification task, there has been limited research on Multi-Label Learning (MLL). In practice, MLL datasets are often class-imbalanced, making it inherently challenging, a problem that is even more acute in CL. Due to its sensitivity to imbalance, Macro-AUC is an appropriate and widely used measure in MLL. However, there is no research to optimize Macro-AUC in MLCL specifically. To fill this gap, in this paper, we propose a new memory replay-based method to tackle the imbalance issue for Macro-AUC-oriented MLCL. Specifically, inspired by recent theory work, we propose a new Reweighted Label-Distribution-Aware Margin (RLDAM) loss. Furthermore, to be compatible with the RLDAM loss, a new memory-updating strategy named Weight Retain Updating (WRU) is proposed to maintain the numbers of positive and negative instances of the original dataset in memory. Theoretically, we provide superior generalization analyses of the RLDAM-based algorithm in terms of Macro-AUC, separately in batch MLL and MLCL settings. This is the first work to offer theoretical generalization analyses in MLCL to our knowledge. Finally, a series of experimental results illustrate the effectiveness of our method over several baselines. Our codes are available at this https URL.

[LG-20] On the Effectiveness of Adversarial Training on Malware Classifiers

链接: https://arxiv.org/abs/2412.18218
作者: Hamid Bostani,Jacopo Cortellazzi,Daniel Arp,Fabio Pierazzi,Veelasha Moonsamy,Lorenzo Cavallaro
关键词: harden learning-based classifiers, adversarial evasive attacks, widely applied, applied to harden, harden learning-based
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Adversarial Training (AT) has been widely applied to harden learning-based classifiers against adversarial evasive attacks. However, its effectiveness in identifying and strengthening vulnerable areas of the model’s decision space while maintaining high performance on clean data of malware classifiers remains an under-explored area. In this context, the robustness that AT achieves has often been assessed against unrealistic or weak adversarial attacks, which negatively affect performance on clean data and are arguably no longer threats. Previous work seems to suggest robustness is a task-dependent property of AT. We instead argue it is a more complex problem that requires exploring AT and the intertwined roles played by certain factors within data, feature representations, classifiers, and robust optimization settings, as well as proper evaluation factors, such as the realism of evasion attacks, to gain a true sense of AT’s effectiveness. In our paper, we address this gap by systematically exploring the role such factors have in hardening malware classifiers through AT. Contrary to recent prior work, a key observation of our research and extensive experiments confirm the hypotheses that all such factors influence the actual effectiveness of AT, as demonstrated by the varying degrees of success from our empirical analysis. We identify five evaluation pitfalls that affect state-of-the-art studies and summarize our insights in ten takeaways to draw promising research directions toward better understanding the factors’ settings under which adversarial training works at best.

[LG-21] U-Mamba-Net: A highly efficient Mamba-based U-net style network for noisy and reverberant speech separation

链接: https://arxiv.org/abs/2412.18217
作者: Shaoxiang Dang,Tetsuya Matsumoto,Yoshinori Takeuchi,Hiroaki Kudo
关键词: multiple overlapping speakers, involves separating mixed, separation involves separating, separating mixed speech, overlapping speakers
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The topic of speech separation involves separating mixed speech with multiple overlapping speakers into several streams, with each stream containing speech from only one speaker. Many highly effective models have emerged and proliferated rapidly over time. However, the size and computational load of these models have also increased accordingly. This is a disaster for the community, as researchers need more time and computational resources to reproduce and compare existing models. In this paper, we propose U-mamba-net: a lightweight Mamba-based U-style model for speech separation in complex environments. Mamba is a state space sequence model that incorporates feature selection capabilities. U-style network is a fully convolutional neural network whose symmetric contracting and expansive paths are able to learn multi-resolution features. In our work, Mamba serves as a feature filter, alternating with U-Net. We test the proposed model on Libri2mix. The results show that U-Mamba-Net achieves improved performance with quite low computational cost.

[LG-22] Accelerating AIGC Services with Latent Action Diffusion Scheduling in Edge Networks

链接: https://arxiv.org/abs/2412.18212
作者: Changfu Xu,Jianxiong Guo,Wanyu Lin,Haodong Zou,Wentao Fan,Tian Wang,Xiaowen Chu,Jiannong Cao
关键词: Artificial Intelligence Generated, Intelligence Generated Content, Artificial Intelligence, Intelligence Generated, gained significant popularity
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Under review

点击查看摘要

Abstract:Artificial Intelligence Generated Content (AIGC) has gained significant popularity for creating diverse content. Current AIGC models primarily focus on content quality within a centralized framework, resulting in a high service delay and negative user experiences. However, not only does the workload of an AIGC task depend on the AIGC model’s complexity rather than the amount of data, but the large model and its multi-layer encoder structure also result in a huge demand for computational and memory resources. These unique characteristics pose new challenges in its modeling, deployment, and scheduling at edge networks. Thus, we model an offloading problem among edges for providing real AIGC services and propose LAD-TS, a novel Latent Action Diffusion-based Task Scheduling method that orchestrates multiple edge servers for expedited AIGC services. The LAD-TS generates a near-optimal offloading decision by leveraging the diffusion model’s conditional generation capability and the reinforcement learning’s environment interaction ability, thereby minimizing the service delays under multiple resource constraints. Meanwhile, a latent action diffusion strategy is designed to guide decision generation by utilizing historical action probability, enabling rapid achievement of near-optimal decisions. Furthermore, we develop DEdgeAI, a prototype edge system with a refined AIGC model deployment to implement and evaluate our LAD-TS method. DEdgeAI provides a real AIGC service for users, demonstrating up to 29.18% shorter service delays than the current five representative AIGC platforms. We release our open-source code at this https URL.

[LG-23] Developing Cryptocurrency Trading Strategy Based on Autoencoder-CNN-GANs Algorithms

链接: https://arxiv.org/abs/2412.18202
作者: Zhuohuan Hu,Richard Yu,Zizhou Zhang,Haoran Zheng,Qianying Liu,Yining Zhou
关键词: financial time series, paper leverages machine, analyze financial time, time series, paper leverages
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注: The paper was accepted by 2024 4th International Conference on Artificial Intelligence, Robotics, and Communication(ICAIRC 2024)

点击查看摘要

Abstract:This paper leverages machine learning algorithms to forecast and analyze financial time series. The process begins with a denoising autoencoder to filter out random noise fluctuations from the main contract price data. Then, one-dimensional convolution reduces the dimensionality of the filtered data and extracts key information. The filtered and dimensionality-reduced price data is fed into a GANs network, and its output serve as input of a fully connected network. Through cross-validation, a model is trained to capture features that precede large price fluctuations. The model predicts the likelihood and direction of significant price changes in real-time price sequences, placing trades at moments of high prediction accuracy. Empirical results demonstrate that using autoencoders and convolution to filter and denoise financial data, combined with GANs, achieves a certain level of predictive performance, validating the capabilities of machine learning algorithms to discover underlying patterns in financial sequences. Keywords - CNN;GANs; Cryptocurrency; Prediction.

[LG-24] Learning Sign Language Representation using CNN LSTM 3DCNN CNN RNN LSTM and CCN TD

链接: https://arxiv.org/abs/2412.18187
作者: Nikita Louison,Wayne Goodridge,Koffka Khan
关键词: Learning applications focus, Language Learning applications, Existing Sign Language, Sign Language Learning, Sign Language
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Existing Sign Language Learning applications focus on the demonstration of the sign in the hope that the student will copy a sign correctly. In these cases, only a teacher can confirm that the sign was completed correctly, by reviewing a video captured manually. Sign Language Translation is a widely explored field in visual recognition. This paper seeks to explore the algorithms that will allow for real-time, video sign translation, and grading of sign language accuracy for new sign language users. This required algorithms capable of recognizing and processing spatial and temporal features. The aim of this paper is to evaluate and identify the best neural network algorithm that can facilitate a sign language tuition system of this nature. Modern popular algorithms including CNN and 3DCNN are compared on a dataset not yet explored, Trinidad and Tobago Sign Language as well as an American Sign Language dataset. The 3DCNN algorithm was found to be the best performing neural network algorithm from these systems with 91% accuracy in the TTSL dataset and 83% accuracy in the ASL dataset.

[LG-25] Unified Stochastic Framework for Neural Network Quantization and Pruning

链接: https://arxiv.org/abs/2412.18184
作者: Haoyu Zhang,Rayan Saab
关键词: compressing neural networks, limited theoretical analysis, theoretical analysis connecting, neural networks, treated independently
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注: 14 pages

点击查看摘要

Abstract:Quantization and pruning are two essential techniques for compressing neural networks, yet they are often treated independently, with limited theoretical analysis connecting them. This paper introduces a unified framework for post-training quantization and pruning using stochastic path-following algorithms. Our approach builds on the Stochastic Path Following Quantization (SPFQ) method, extending its applicability to pruning and low-bit quantization, including challenging 1-bit regimes. By incorporating a scaling parameter and generalizing the stochastic operator, the proposed method achieves robust error correction and yields rigorous theoretical error bounds for both quantization and pruning as well as their combination.

[LG-26] Stochastic Control for Fine-tuning Diffusion Models: Optimality Regularity and Convergence

链接: https://arxiv.org/abs/2412.18164
作者: Yinbin Han,Meisam Razaviyayn,Renyuan Xu
关键词: demonstrating exceptional capability, capturing target data, target data distributions, generative modeling, demonstrating exceptional
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 28 pages

点击查看摘要

Abstract:Diffusion models have emerged as powerful tools for generative modeling, demonstrating exceptional capability in capturing target data distributions from large datasets. However, fine-tuning these massive models for specific downstream tasks, constraints, and human preferences remains a critical challenge. While recent advances have leveraged reinforcement learning algorithms to tackle this problem, much of the progress has been empirical, with limited theoretical understanding. To bridge this gap, we propose a stochastic control framework for fine-tuning diffusion models. Building on denoising diffusion probabilistic models as the pre-trained reference dynamics, our approach integrates linear dynamics control with Kullback-Leibler regularization. We establish the well-posedness and regularity of the stochastic control problem and develop a policy iteration algorithm (PI-FT) for numerical solution. We show that PI-FT achieves global convergence at a linear rate. Unlike existing work that assumes regularities throughout training, we prove that the control and value sequences generated by the algorithm maintain the regularity. Additionally, we explore extensions of our framework to parametric settings and continuous-time formulations.

[LG-27] Neural Conformal Control for Time Series Forecasting

链接: https://arxiv.org/abs/2412.18144
作者: Ruipu Li,Alexander Rodríguez
关键词: neural network conformal, network conformal prediction, non-stationary environments, neural network, time series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a neural network conformal prediction method for time series that enhances adaptivity in non-stationary environments. Our approach acts as a neural controller designed to achieve desired target coverage, leveraging auxiliary multi-view data with neural network encoders in an end-to-end manner to further enhance adaptivity. Additionally, our model is designed to enhance the consistency of prediction intervals in different quantiles by integrating monotonicity constraints and leverages data from related tasks to boost few-shot learning performance. Using real-world datasets from epidemics, electric demand, weather, and others, we empirically demonstrate significant improvements in coverage and probabilistic accuracy, and find that our method is the only one that combines good calibration with consistency in prediction intervals.

[LG-28] An Instrumental Value for Data Production and its Application to Data Pricing

链接: https://arxiv.org/abs/2412.18140
作者: Rui Ai,Boxiang Lyu,Zhaoran Wang,Zhuoran Yang,Haifeng Xu
关键词: data, data production process, first-best revenue, production process, assist decision-making
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How much value does a dataset or a data production process have to an agent who wishes to use the data to assist decision-making? This is a fundamental question towards understanding the value of data as well as further pricing of data. This paper develops an approach for capturing the instrumental value of data production processes, which takes two key factors into account: (a) the context of the agent’s decision-making problem; (b) prior data or information the agent already possesses. We ‘‘micro-found’’ our valuation concepts by showing how they connect to classic notions of information design and signals in information economics. When instantiated in the domain of Bayesian linear regression, our value naturally corresponds to information gain. Based on our designed data value, we then study a basic monopoly pricing setting with a buyer looking to purchase from a seller some labeled data of a certain feature direction in order to improve a Bayesian regression model. We show that when the seller has the ability to fully customize any data request, she can extract the first-best revenue (i.e., full surplus) from any population of buyers, i.e., achieving first-degree price discrimination. If the seller can only sell data that are derived from an existing data pool, this limits her ability to customize, and achieving first-best revenue becomes generally impossible. However, we design a mechanism that achieves seller revenue at most \log (\kappa) less than the first-best revenue, where \kappa is the condition number associated with the data matrix. A corollary of this result is that the seller can extract the first-best revenue in the multi-armed bandits special case.

[LG-29] Fundamental Limits in the Search for Less Discriminatory Algorithms – and How to Avoid Them NEURIPS

链接: https://arxiv.org/abs/2412.18138
作者: Benjamin Laufer,Manisch Raghavan,Solon Barocas
关键词: Disparate impact doctrine, data-driven algorithmic decisions, important legal apparatus, targeting unfair data-driven, unfair data-driven algorithmic
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages, 4 figures, 1 table. Prior versions appeared at NeurIPS Algorithmic Fairness Through the Lens of Metrics and Evaluation Workshop (AFME 2024) and Regulatable ML Workshop (RegML 2024). Forthcoming at ACM CSLaw 2025

点击查看摘要

Abstract:Disparate impact doctrine offers an important legal apparatus for targeting unfair data-driven algorithmic decisions. A recent body of work has focused on conceptualizing and operationalizing one particular construct from this doctrine – the less discriminatory alternative, an alternative policy that reduces disparities while meeting the same business needs of a status quo or baseline policy. This paper puts forward four fundamental results, which each represent limits to searching for and using less discriminatory algorithms (LDAs). (1) Statistically, although LDAs are almost always identifiable in retrospect on fixed populations, making conclusions about how alternative classifiers perform on an unobserved distribution is more difficult. (2) Mathematically, a classifier can only exhibit certain combinations of accuracy and selection rate disparity between groups, given the size of each group and the base rate of the property or outcome of interest in each group. (3) Computationally, a search for a lower-disparity classifier at some baseline level of utility is NP-hard. (4) From a modeling and consumer welfare perspective, defining an LDA only in terms of business needs can lead to LDAs that leave consumers strictly worse off, including members of the disadvantaged group. These findings, which may seem on their face to give firms strong defenses against discrimination claims, only tell part of the story. For each of our negative results limiting what is attainable in this setting, we offer positive results demonstrating that there exist effective and low-cost strategies that are remarkably effective at identifying viable lower-disparity policies.

[LG-30] Learning Randomized Reductions and Program Properties

链接: https://arxiv.org/abs/2412.18134
作者: Ferhat Erata,Orr Paradise,Timos Antonopoulos,ThanhVu Nguyen,Shafi Goldwasser,Ruzica Piskac
关键词: traditional approaches relying, computer science, formal verification, correctness of computations, computations remains
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The correctness of computations remains a significant challenge in computer science, with traditional approaches relying on automated testing or formal verification. Self-testing/correcting programs introduce an alternative paradigm, allowing a program to verify and correct its own outputs via randomized reductions, a concept that previously required manual derivation. In this paper, we present Bitween, a method and tool for automated learning of randomized (self)-reductions and program properties in numerical programs. Bitween combines symbolic analysis and machine learning, with a surprising finding: polynomial-time linear regression, a basic optimization method, is not only sufficient but also highly effective for deriving complex randomized self-reductions and program invariants, often outperforming sophisticated mixed-integer linear programming solvers. We establish a theoretical framework for learning these reductions and introduce RSR-Bench, a benchmark suite for evaluating Bitween’s capabilities on scientific and machine learning functions. Our empirical results show that Bitween surpasses state-of-the-art tools in scalability, stability, and sample efficiency when evaluated on nonlinear invariant benchmarks like NLA-DigBench. Bitween is open-source as a Python package and accessible via a web interface that supports C language programs.

[LG-31] Age Optimal Sampling for Unreliable Channels under Unknown Channel Statistics

链接: https://arxiv.org/abs/2412.18119
作者: Hongyi He,Haoyue Tang,Jiayu Pan,Jintao Wang,Jian Song,Leandros Tassiulas
关键词: sensor forwards status, forwards status updates, sensor forwards, study a system, algorithm
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study a system in which a sensor forwards status updates to a receiver through an error-prone channel, while the receiver sends the transmission results back to the sensor via a reliable channel. Both channels are subject to random delays. To evaluate the timeliness of the status information at the receiver, we use the Age of Information (AoI) metric. The objective is to design a sampling policy that minimizes the expected time-average AoI, even when the channel statistics (e.g., delay distributions) are unknown. We first review the threshold structure of the optimal offline policy under known channel statistics and then reformulate the design of the online algorithm as a stochastic approximation problem. We propose a Robbins-Monro algorithm to solve this problem and demonstrate that the optimal threshold can be approximated almost surely. Moreover, we prove that the cumulative AoI regret of the online algorithm increases with rate \mathcalO(\ln K) , where K is the number of successful transmissions. In addition, our algorithm is shown to be minimax order optimal, in the sense that for any online learning algorithm, the cumulative AoI regret up to the K -th successful transmissions grows with the rate at least \Omega(\ln K) in the worst case delay distribution. Finally, we improve the stability of the proposed online learning algorithm through a momentum-based stochastic gradient descent algorithm. Simulation results validate the performance of our proposed algorithm.

[LG-32] Diverse Concept Proposals for Concept Bottleneck Models ICML2022

链接: https://arxiv.org/abs/2412.18059
作者: Katrina Brown,Marton Havasi,Finale Doshi-Velez
关键词: key priority, concepts, Concept bottleneck models, interpretable predictive models, Concept bottleneck
类目: Machine Learning (cs.LG)
*备注: Accepted to the ICML 2022 Workshop on Human-Machine Collaboration and Teaming

点击查看摘要

Abstract:Concept bottleneck models are interpretable predictive models that are often used in domains where model trust is a key priority, such as healthcare. They identify a small number of human-interpretable concepts in the data, which they then use to make predictions. Learning relevant concepts from data proves to be a challenging task. The most predictive concepts may not align with expert intuition, thus, failing interpretability with no recourse. Our proposed approach identifies a number of predictive concepts that explain the data. By offering multiple alternative explanations, we allow the human expert to choose the one that best aligns with their expectation. To demonstrate our method, we show that it is able discover all possible concept representations on a synthetic dataset. On EHR data, our model was able to identify 4 out of the 5 pre-defined concepts without supervision.

[LG-33] Multimodal Learning with Uncertainty Quantification based on Discounted Belief Fusion

链接: https://arxiv.org/abs/2412.18024
作者: Grigor Bezirganyan,Sana Sellami,Laure Berti-Équille,Sébastien Fournier
关键词: fields like healthcare, autonomous driving, information is drawn, multiple sources, evidence
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal AI models are increasingly used in fields like healthcare, finance, and autonomous driving, where information is drawn from multiple sources or modalities such as images, texts, audios, videos. However, effectively managing uncertainty - arising from noise, insufficient evidence, or conflicts between modalities - is crucial for reliable decision-making. Current uncertainty-aware ML methods leveraging, for example, evidence averaging, or evidence accumulation underestimate uncertainties in high-conflict scenarios. Moreover, the state-of-the-art evidence averaging strategy struggles with non-associativity and fails to scale to multiple modalities. To address these challenges, we propose a novel multimodal learning method with order-invariant evidence fusion and introduce a conflict-based discounting mechanism that reallocates uncertain mass when unreliable modalities are detected. We provide both theoretical analysis and experimental validation, demonstrating that unlike the previous work, the proposed approach effectively distinguishes between conflicting and non-conflicting samples based on the provided uncertainty estimates, and outperforms the previous models in uncertainty-based conflict detection.

[LG-34] Data-driven Modeling of Parameterized Nonlinear Fluid Dynamical Systems with a Dynamics-embedded Conditional Generative Adversarial Network

链接: https://arxiv.org/abs/2412.17978
作者: Abdolvahhab Rostamijavanani,Shanwu Li,Yongchao Yang
关键词: dynamics-generator conditional GAN, modified conditional GAN, conditional GAN, fluid dynamical systems, nonlinear fluid dynamical
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:This work presents a data-driven solution to accurately predict parameterized nonlinear fluid dynamical systems using a dynamics-generator conditional GAN (Dyn-cGAN) as a surrogate model. The Dyn-cGAN includes a dynamics block within a modified conditional GAN, enabling the simultaneous identification of temporal dynamics and their dependence on system parameters. The learned Dyn-cGAN model takes into account the system parameters to predict the flow fields of the system accurately. We evaluate the effectiveness and limitations of the developed Dyn-cGAN through numerical studies of various parameterized nonlinear fluid dynamical systems, including flow over a cylinder and a 2-D cavity problem, with different Reynolds numbers. Furthermore, we examine how Reynolds number affects the accuracy of the predictions for both case studies. Additionally, we investigate the impact of the number of time steps involved in the process of dynamics block training on the accuracy of predictions, and we find that an optimal value exists based on errors and mutual information relative to the ground truth.

[LG-35] Extending Graph Condensation to Multi-Label Datasets: A Benchmark Study

链接: https://arxiv.org/abs/2412.17961
作者: Liangliang Zhang,Haoran Bao,Yao Ma
关键词: including computational resource, grows increasingly complicate, presents significant challenges, computational resource constraints, training graph neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As graph data grows increasingly complicate, training graph neural networks (GNNs) on large-scale datasets presents significant challenges, including computational resource constraints, data redundancy, and transmission inefficiencies. While existing graph condensation techniques have shown promise in addressing these issues, they are predominantly designed for single-label datasets, where each node is associated with a single class label. However, many real-world applications, such as social network analysis and bioinformatics, involve multi-label graph datasets, where one node can have various related labels. To deal with this problem, we extends traditional graph condensation approaches to accommodate multi-label datasets by introducing modifications to synthetic dataset initialization and condensing optimization. Through experiments on eight real-world multi-label graph datasets, we prove the effectiveness of our method. In experiment, the GCond framework, combined with K-Center initialization and binary cross-entropy loss (BCELoss), achieves best performance in general. This benchmark for multi-label graph condensation not only enhances the scalability and efficiency of GNNs for multi-label graph data, but also offering substantial benefits for diverse real-world applications.

[LG-36] rading Devil RL: Backdoor attack via Stock market Bayesian Optimization and Reinforcement Learning

链接: https://arxiv.org/abs/2412.17908
作者: Orson Mengara
关键词: made significant progress, everyday applications, rapid development, development of generative, number of sub-fields
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph); Physics and Society (physics.soc-ph)
*备注: End of data poisoning research!: Navier-stokes equations (3D); Reinforcement Learning (RL); HFT (High Frequency Trading); Limit Order Markets and backdoor attack detection

点击查看摘要

Abstract:With the rapid development of generative artificial intelligence, particularly large language models, a number of sub-fields of deep learning have made significant progress and are now very useful in everyday applications. For example, well-known financial institutions simulate a wide range of scenarios for various models created by their research teams using reinforcement learning, both before production and after regular operations. In this work, we propose a backdoor attack that focuses solely on data poisoning. This particular backdoor attack is classified as an attack without prior consideration or trigger, and we name it FinanceLLMsBackRL. Our aim is to examine the potential effects of large language models that use reinforcement learning systems for text production or speech recognition, finance, physics, or the ecosystem of contemporary artificial intelligence models.

[LG-37] Graph Structure Refinement with Energy-based Contrastive Learning AAAI2025

链接: https://arxiv.org/abs/2412.17856
作者: Xianlin Zeng,Yufeng Wang,Yuqi Sun,Guodong Guo,Wenrui Ding,Baochang Zhang
关键词: Graph Neural Networks, Neural Networks, analyzing graph-structured data, recently gained widespread, gained widespread attention
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have recently gained widespread attention as a successful tool for analyzing graph-structured data. However, imperfect graph structure with noisy links lacks enough robustness and may damage graph representations, therefore limiting the GNNs’ performance in practical tasks. Moreover, existing generative architectures fail to fit discriminative graph-related tasks. To tackle these issues, we introduce an unsupervised method based on a joint of generative training and discriminative training to learn graph structure and representation, aiming to improve the discriminative performance of generative models. We propose an Energy-based Contrastive Learning (ECL) guided Graph Structure Refinement (GSR) framework, denoted as ECL-GSR. To our knowledge, this is the first work to combine energy-based models with contrastive learning for GSR. Specifically, we leverage ECL to approximate the joint distribution of sample pairs, which increases the similarity between representations of positive pairs while reducing the similarity between negative ones. Refined structure is produced by augmenting and removing edges according to the similarity metrics among node representations. Extensive experiments demonstrate that ECL-GSR outperforms \textitthe state-of-the-art on eight benchmark datasets in node classification. ECL-GSR achieves \textitfaster training with fewer samples and memories against the leading baseline, highlighting its simplicity and efficiency in downstream tasks.

[LG-38] Foxtsage vs. Adam: Revolution or Evolution in Optimization?

链接: https://arxiv.org/abs/2412.17855
作者: Sirwan A. Aula,Tarik A. Rashid
关键词: Stochastic Gradient Descent, convergence efficiency, techniques are pivotal, Foxtsage, Stochastic Gradient
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 41

点击查看摘要

Abstract:Optimization techniques are pivotal in neural network training, shaping both predictive performance and convergence efficiency. This study introduces Foxtsage, a novel hybrid optimisation approach that integrates the Hybrid FOX-TSA with Stochastic Gradient Descent for training Multi-Layer Perceptron models. The proposed Foxtsage method is benchmarked against the widely adopted Adam optimizer across multiple standard datasets, focusing on key performance metrics such as training loss, accuracy, precision, recall, F1-score, and computational time. Experimental results demonstrate that Foxtsage achieves a 42.03% reduction in loss mean (Foxtsage: 9.508, Adam: 16.402) and a 42.19% improvement in loss standard deviation (Foxtsage: 20.86, Adam: 36.085), reflecting enhanced consistency and robustness. Modest improvements in accuracy mean (0.78%), precision mean (0.91%), recall mean (1.02%), and F1-score mean (0.89%) further underscore its predictive performance. However, these gains are accompanied by an increased computational cost, with a 330.87% rise in time mean (Foxtsage: 39.541 seconds, Adam: 9.177 seconds). By effectively combining the global search capabilities of FOX-TSA with the stability and adaptability of SGD, Foxtsage presents itself as a robust and viable alternative for neural network optimization tasks.

[LG-39] Zero Shot Time Series Forecasting Using Kolmogorov Arnold Networks NEURIPS

链接: https://arxiv.org/abs/2412.17853
作者: Abhiroop Bhattacharya,Nandinee Haq
关键词: Accurate energy price, Accurate energy, decision-making processes, crucial for participants, significantly influences
类目: Machine Learning (cs.LG)
*备注: Published In: 2024 NeurIPS Workshop on Time Series in the Age of Large Models

点击查看摘要

Abstract:Accurate energy price forecasting is crucial for participants in day-ahead energy markets, as it significantly influences their decision-making processes. While machine learning-based approaches have shown promise in enhancing these forecasts, they often remain confined to the specific markets on which they are trained, thereby limiting their adaptability to new or unseen markets. In this paper, we introduce a cross-domain adaptation model designed to forecast energy prices by learning market-invariant representations across different markets during the training phase. We propose a doubly residual N-BEATS network with Kolmogorov Arnold networks at its core for time series forecasting. These networks, grounded in the Kolmogorov-Arnold representation theorem, offer a powerful way to approximate multivariate continuous functions. The cross domain adaptation model was generated with an adversarial framework. The model’s effectiveness was tested in predicting day-ahead electricity prices in a zero shot fashion. In comparison with baseline models, our proposed framework shows promising results. By leveraging the Kolmogorov-Arnold networks, our model can potentially enhance its ability to capture complex patterns in energy price data, thus improving forecast accuracy across diverse market conditions. This addition not only enriches the model’s representational capacity but also contributes to a more robust and flexible forecasting tool adaptable to various energy markets.

[LG-40] Single-Loop Federated Actor-Critic across Heterogeneous Environments AAAI’25

链接: https://arxiv.org/abs/2412.14555
作者: Ye Zhu,Xiaowen Gong
关键词: shared policy adaptable, Federated reinforcement learning, promising paradigm, enabling multiple agents, reinforcement learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注: Extended version of paper accepted at AAAI’25

点击查看摘要

Abstract:Federated reinforcement learning (FRL) has emerged as a promising paradigm, enabling multiple agents to collaborate and learn a shared policy adaptable across heterogeneous environments. Among the various reinforcement learning (RL) algorithms, the actor-critic (AC) algorithm stands out for its low variance and high sample efficiency. However, little to nothing is known theoretically about AC in a federated manner, especially each agent interacts with a potentially different environment. The lack of such results is attributed to various technical challenges: a two-level structure illustrating the coupling effect between the actor and the critic, heterogeneous environments, Markovian sampling and multiple local updates. In response, we study \textitSingle-loop Federated Actor Critic (SFAC) where agents perform actor-critic learning in a two-level federated manner while interacting with heterogeneous environments. We then provide bounds on the convergence error of SFAC. The results show that the convergence error asymptotically converges to a near-stationary point, with the extent proportional to environment heterogeneity. Moreover, the sample complexity exhibits a linear speed-up through the federation of agents. We evaluate the performance of SFAC through numerical experiments using common RL benchmarks, which demonstrate its effectiveness.

[LG-41] Scalable Quantum-Inspired Optimization through Dynamic Qubit Compression AAAI’25

链接: https://arxiv.org/abs/2412.18571
作者: Co Tran,Quoc-Bao Tran,Hy Truong Son,Thang N Dinh
关键词: Hard combinatorial optimization, Hard combinatorial, combinatorial optimization problems, promise potential solutions, limited qubit counts
类目: Quantum Physics (quant-ph); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注: Accepted to AAAI’25

点击查看摘要

Abstract:Hard combinatorial optimization problems, often mapped to Ising models, promise potential solutions with quantum advantage but are constrained by limited qubit counts in near-term devices. We present an innovative quantum-inspired framework that dynamically compresses large Ising models to fit available quantum hardware of different sizes. Thus, we aim to bridge the gap between large-scale optimization and current hardware capabilities. Our method leverages a physics-inspired GNN architecture to capture complex interactions in Ising models and accurately predict alignments among neighboring spins (aka qubits) at ground states. By progressively merging such aligned spins, we can reduce the model size while preserving the underlying optimization structure. It also provides a natural trade-off between the solution quality and size reduction, meeting different hardware constraints of quantum computing devices. Extensive numerical studies on Ising instances of diverse topologies show that our method can reduce instance size at multiple levels with virtually no losses in solution quality on the latest D-wave quantum annealers.

[LG-42] HNCI: High-Dimensional Network Causal Inference

链接: https://arxiv.org/abs/2412.18568
作者: Wenqin Du,Rundong Ding,Yingying Fan,Jinchi Lv
关键词: causal inference applications, network causal inference, high-dimensional network causal, causal inference, problem of evaluating
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 89 pages, 7 figures

点击查看摘要

Abstract:The problem of evaluating the effectiveness of a treatment or policy commonly appears in causal inference applications under network interference. In this paper, we suggest the new method of high-dimensional network causal inference (HNCI) that provides both valid confidence interval on the average direct treatment effect on the treated (ADET) and valid confidence set for the neighborhood size for interference effect. We exploit the model setting in Belloni et al. (2022) and allow certain type of heterogeneity in node interference neighborhood sizes. We propose a linear regression formulation of potential outcomes, where the regression coefficients correspond to the underlying true interference function values of nodes and exhibit a latent homogeneous structure. Such a formulation allows us to leverage existing literature from linear regression and homogeneity pursuit to conduct valid statistical inferences with theoretical guarantees. The resulting confidence intervals for the ADET are formally justified through asymptotic normalities with estimable variances. We further provide the confidence set for the neighborhood size with theoretical guarantees exploiting the repro samples approach. The practical utilities of the newly suggested methods are demonstrated through simulation and real data examples.

[LG-43] Convergence of Statistical Estimators via Mutual Information Bounds

链接: https://arxiv.org/abs/2412.18539
作者: El Mahdi Khribch,Pierre Alquier
关键词: Recent advances, revealed profound connections, Bayesian nonparametrics, revealed profound, Recent
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Recent advances in statistical learning theory have revealed profound connections between mutual information (MI) bounds, PAC-Bayesian theory, and Bayesian nonparametrics. This work introduces a novel mutual information bound for statistical models. The derived bound has wide-ranging applications in statistical inference. It yields improved contraction rates for fractional posteriors in Bayesian nonparametrics. It can also be used to study a wide range of estimation methods, such as variational inference or Maximum Likelihood Estimation (MLE). By bridging these diverse areas, this work advances our understanding of the fundamental limits of statistical inference and the role of information in learning from data. We hope that these results will not only clarify connections between statistical inference and information theory but also help to develop a new toolbox to study a wide range of estimators.

[LG-44] Subsampling aligning and averaging to find circular coordinates in recurrent time series

链接: https://arxiv.org/abs/2412.18515
作者: Andrew J. Blumberg,Mathieu Carrière,Jun Hou Fung,Michael A. Mandell
关键词: finding robust circular, exhibit recurrence, expected to exhibit, uneven sampling density, circular coordinates
类目: Machine Learning (stat.ML); Computational Geometry (cs.CG); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:We introduce a new algorithm for finding robust circular coordinates on data that is expected to exhibit recurrence, such as that which appears in neuronal recordings of C. elegans. Techniques exist to create circular coordinates on a simplicial complex from a dimension 1 cohomology class, and these can be applied to the Rips complex of a dataset when it has a prominent class in its dimension 1 cohomology. However, it is known this approach is extremely sensitive to uneven sampling density. Our algorithm comes with a new method to correct for uneven sampling density, adapting our prior work on averaging coordinates in manifold learning. We use rejection sampling to correct for inhomogeneous sampling and then apply Procrustes matching to align and average the subsamples. In addition to providing a more robust coordinate than other approaches, this subsampling and averaging approach has better efficiency. We validate our technique on both synthetic data sets and neuronal activity recordings. Our results reveal a topological model of neuronal trajectories for C. elegans that is constructed from loops in which different regions of the brain state space can be mapped to specific and interpretable macroscopic behaviors in the worm. Subjects: Machine Learning (stat.ML); Computational Geometry (cs.CG); Machine Learning (cs.LG); Algebraic Topology (math.AT) Cite as: arXiv:2412.18515 [stat.ML] (or arXiv:2412.18515v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2412.18515 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Gaussian entropic optimal transport: Schr"odinger bridges and the Sinkhorn algorithm

链接: https://arxiv.org/abs/2412.18432
作者: O. Deniz Akyildiz,Pierre Del Moral,Joaquín Miguez
关键词: optimal transport problems, optimal transport, Entropic optimal transport, regularized versions, Entropic optimal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO)
*备注: 68 pages

点击查看摘要

Abstract:Entropic optimal transport problems are regularized versions of optimal transport problems. These models play an increasingly important role in machine learning and generative modelling. For finite spaces, these problems are commonly solved using Sinkhorn algorithm (a.k.a. iterative proportional fitting procedure). However, in more general settings the Sinkhorn iterations are based on nonlinear conditional/conjugate transformations and exact finite-dimensional solutions cannot be computed. This article presents a finite-dimensional recursive formulation of the iterative proportional fitting procedure for general Gaussian multivariate models. As expected, this recursive formulation is closely related to the celebrated Kalman filter and related Riccati matrix difference equations, and it yields algorithms that can be implemented in practical settings without further approximations. We extend this filtering methodology to develop a refined and self-contained convergence analysis of Gaussian Sinkhorn algorithms, including closed form expressions of entropic transport maps and Schrödinger bridges.

[LG-46] Discovery of 2D Materials via Symmetry-Constrained Diffusion Model

链接: https://arxiv.org/abs/2412.18414
作者: Shihang Xu,Shibing Chu,Rami Mrad,Zhejun Zhang,Zhelin Li,Runxian Jiao,Yuanping Chen
关键词: shown significant promise, shown significant, significant promise, promise in accelerating, materials
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Generative model for 2D materials has shown significant promise in accelerating the material discovery process. The stability and performance of these materials are strongly influenced by their underlying symmetry. However, existing generative models for 2D materials often neglect symmetry constraints, which limits both the diversity and quality of the generated structures. Here, we introduce a symmetry-constrained diffusion model (SCDM) that integrates space group symmetry into the generative process. By incorporating Wyckoff positions, the model ensures adherence to symmetry principles, leading to the generation of 2,000 candidate structures. DFT calculations were conducted to evaluate the convex hull energies of these structures after structural relaxation. From the generated samples, 843 materials that met the energy stability criteria (Ehull 0.6 eV/atom) were identified. Among these, six candidates were selected for further stability analysis, including phonon band structure evaluations and electronic properties investigations, all of which exhibited phonon spectrum stability. To benchmark the performance of SCDM, a symmetry-unconstrained diffusion model was also evaluated via crystal structure prediction model. The results highlight that incorporating symmetry constraints enhances the effectiveness of generated 2D materials, making a contribution to the discovery of 2D materials through generative modeling.

[LG-47] Predator Prey Scavenger Model using Hollings Functional Response of Type III and Physics-Informed Deep Neural Networks

链接: https://arxiv.org/abs/2412.18344
作者: Aneesh Panchal,Kirti Beniwal,Vivek Kumar
关键词: Nonlinear mathematical models, mathematical models introduce, Nonlinear mathematical, biological interactions present, present in nature
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nonlinear mathematical models introduce the relation between various physical and biological interactions present in nature. One of the most famous models is the Lotka-Volterra model which defined the interaction between predator and prey species present in nature. However, predators, scavengers, and prey populations coexist in a natural system where scavengers can additionally rely on the dead bodies of predators present in the system. Keeping this in mind, the formulation and simulation of the predator prey scavenger model is introduced in this paper. For the predation response, respective prey species are assumed to have Holling’s functional response of type III. The proposed model is tested for various simulations and is found to be showing satisfactory results in different scenarios. After simulations, the American forest dataset is taken for parameter estimation which imitates the real-world case. For parameter estimation, a physics-informed deep neural network is used with the Adam backpropagation method which prevents the avalanche effect in trainable parameters updation. For neural networks, mean square error and physics-informed informed error are considered. After the neural network, the hence-found parameters are fine-tuned using the Broyden-Fletcher-Goldfarb-Shanno algorithm. Finally, the hence-found parameters using a natural dataset are tested for stability using Jacobian stability analysis. Future research work includes minimization of error induced by parameters, bifurcation analysis, and sensitivity analysis of the parameters.

[LG-48] Dissipation alters modes of information encoding in small quantum reservoirs near criticality

链接: https://arxiv.org/abs/2412.18290
作者: Krai Cheamsawat,Thiparat Chotibut
关键词: machine learning tasks, tackle temporal machine, temporal machine learning, harnessing near-term quantum, near-term quantum devices
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 30 pages, 12 figures

点击查看摘要

Abstract:Quantum reservoir computing (QRC) has emerged as a promising paradigm for harnessing near-term quantum devices to tackle temporal machine learning tasks. Yet identifying the mechanisms that underlie enhanced performance remains challenging, particularly in many-body open systems where nonlinear interactions and dissipation intertwine in complex ways. Here, we investigate a minimal model of a driven-dissipative quantum reservoir described by two coupled Kerr-nonlinear oscillators, an experimentally realizable platform that features controllable coupling, intrinsic nonlinearity, and tunable photon loss. Using Partial Information Decomposition (PID), we examine how different dynamical regimes encode input drive signals in terms of redundancy (information shared by each oscillator) and synergy (information accessible only through their joint observation). Our key results show that, near a critical point marking a dynamical bifurcation, the system transitions from predominantly redundant to synergistic encoding. We further demonstrate that synergy amplifies short-term responsiveness, thereby enhancing immediate memory retention, whereas strong dissipation leads to more redundant encoding that supports long-term memory retention. These findings elucidate how the interplay of instability and dissipation shapes information processing in small quantum systems, providing a fine-grained, information-theoretic perspective for analyzing and designing QRC platforms.

[LG-49] OMG-HD: A High-Resolution AI Weather Model for End-to-End Forecasts from Observations

链接: https://arxiv.org/abs/2412.18239
作者: Pengcheng Zhao,Jiang Bian,Zekun Ni,Weixin Jin,Jonathan Weyn,Zuliang Fang,Siqi Xiang,Haiyu Dong,Bin Zhang,Hongyu Sun,Kit Thambiratnam,Qi Zhang
关键词: Artificial Intelligence Weather, Artificial Intelligence, traditional Numerical Weather, Intelligence Weather Prediction, Numerical Weather Prediction
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Artificial Intelligence Weather Prediction (AIWP) models have achieved performance comparable to, or even surpassing, traditional Numerical Weather Prediction (NWP) models by leveraging reanalysis data. However, a less-explored approach involves training AIWP models directly on observational data, enhancing computational efficiency and improving forecast accuracy by reducing the uncertainties introduced through data assimilation processes. In this study, we propose OMG-HD, a novel AI-based regional high-resolution weather forecasting model designed to make predictions directly from observational data sources, including surface stations, radar, and satellite, thereby removing the need for operational data assimilation. Our evaluation shows that OMG-HD outperforms both the European Centre for Medium-Range Weather Forecasts (ECMWF)'s high-resolution operational forecasting system, IFS-HRES, and the High-Resolution Rapid Refresh (HRRR) model at lead times of up to 12 hours across the contiguous United States (CONUS) region. We achieve up to a 13% improvement on RMSE for 2-meter temperature, 17% on 10-meter wind speed, 48% on 2-meter specific humidity, and 32% on surface pressure compared to HRRR. Our method shows that it is possible to use AI-driven approaches for rapid weather predictions without relying on NWP-derived weather fields as model input. This is a promising step towards using observational data directly to make operational forecasts with AIWP models.

[LG-50] Leveraging Convolutional Neural Network-Transformer Synergy for Predictive Modeling in Risk-Based Applications

链接: https://arxiv.org/abs/2412.18222
作者: Yuhan Wang,Zhen Xu,Yue Yao,Jinsong Liu,Jiating Lin
关键词: credit default prediction, received increasing attention, credit default, default prediction, default prediction methods
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the development of the financial industry, credit default prediction, as an important task in financial risk management, has received increasing attention. Traditional credit default prediction methods mostly rely on machine learning models, such as decision trees and random forests, but these methods have certain limitations in processing complex data and capturing potential risk patterns. To this end, this paper proposes a deep learning model based on the combination of convolutional neural networks (CNN) and Transformer for credit user default prediction. The model combines the advantages of CNN in local feature extraction with the ability of Transformer in global dependency modeling, effectively improving the accuracy and robustness of credit default prediction. Through experiments on public credit default datasets, the results show that the CNN+Transformer model outperforms traditional machine learning models, such as random forests and XGBoost, in multiple evaluation indicators such as accuracy, AUC, and KS value, demonstrating its powerful ability in complex financial data modeling. Further experimental analysis shows that appropriate optimizer selection and learning rate adjustment play a vital role in improving model performance. In addition, the ablation experiment of the model verifies the advantages of the combination of CNN and Transformer and proves the complementarity of the two in credit default prediction. This study provides a new idea for credit default prediction and provides strong support for risk assessment and intelligent decision-making in the financial field. Future research can further improve the prediction effect and generalization ability by introducing more unstructured data and improving the model architecture.

[LG-51] Quantum framework for Reinforcement Learning: integrating Markov Decision Process quantum arithmetic and trajectory search

链接: https://arxiv.org/abs/2412.18208
作者: Thet Htar Su,Shaswot Shresthamali,Masaaki Kondo
关键词: Markov Decision Process, classical Markov Decision, Decision Process, Markov Decision, classical Markov
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a quantum framework for addressing reinforcement learning (RL) tasks, grounded in the quantum principles and leveraging a fully quantum model of the classical Markov Decision Process (MDP). By employing quantum concepts and a quantum search algorithm, this work presents the implementation and optimization of the agent-environment interactions entirely within the quantum domain, eliminating reliance on classical computations. Key contributions include the quantum-based state transitions, return calculation, and trajectory search mechanism that utilize quantum principles to demonstrate the realization of RL processes through quantum phenomena. The implementation emphasizes the fundamental role of quantum superposition in enhancing computational efficiency for RL tasks. Experimental results demonstrate the capacity of a quantum model to achieve quantum advantage in RL, highlighting the potential of fully quantum implementations in decision-making tasks. This work not only underscores the applicability of quantum computing in machine learning but also contributes the field of quantum reinforcement learning (QRL) by offering a robust framework for understanding and exploiting quantum computing in RL systems.

[LG-52] PCM Selector: Penalized Covariate-Mediator Selection Operator for Evaluating Linear Causal Effects

链接: https://arxiv.org/abs/2412.18180
作者: Hisayoshi Nanmo,Manabu Kuroki
关键词: high-dimensional data problems, structural equation model, linear structural equation, PCM Selector, standard statistical estimation
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:For a data-generating process for random variables that can be described with a linear structural equation model, we consider a situation in which (i) a set of covariates satisfying the back-door criterion cannot be observed or (ii) such a set can be observed, but standard statistical estimation methods cannot be applied to estimate causal effects because of multicollinearity/high-dimensional data problems. We propose a novel two-stage penalized regression approach, the penalized covariate-mediator selection operator (PCM Selector), to estimate the causal effects in such scenarios. Unlike existing penalized regression analyses, when a set of intermediate variables is available, PCM Selector provides a consistent or less biased estimator of the causal effect. In addition, PCM Selector provides a variable selection procedure for intermediate variables to obtain better estimation accuracy of the causal effects than does the back-door criterion.

[LG-53] Heterogeneous transfer learning for high dimensional regression with feature mismatch

链接: https://arxiv.org/abs/2412.18081
作者: Jae Ho Chang,Massimiliano Russo,Subhadeep Paul
关键词: transferring knowledge, homogeneous transfer learning, target, target domain, transfer learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of transferring knowledge from a source, or proxy, domain to a new target domain for learning a high-dimensional regression model with possibly different features. Recently, the statistical properties of homogeneous transfer learning have been investigated. However, most homogeneous transfer and multi-task learning methods assume that the target and proxy domains have the same feature space, limiting their practical applicability. In applications, target and proxy feature spaces are frequently inherently different, for example, due to the inability to measure some variables in the target data-poor environments. Conversely, existing heterogeneous transfer learning methods do not provide statistical error guarantees, limiting their utility for scientific discovery. We propose a two-stage method that involves learning the relationship between the missing and observed features through a projection step in the proxy data and then solving a joint penalized regression optimization problem in the target data. We develop an upper bound on the method’s parameter estimation risk and prediction risk, assuming that the proxy and the target domain parameters are sparsely different. Our results elucidate how estimation and prediction error depend on the complexity of the model, sample size, the extent of overlap, and correlation between matched and mismatched features.

[LG-54] An information theoretic limit to data amplification

链接: https://arxiv.org/abs/2412.18041
作者: S. J. Watts,L. Crow
关键词: recent years generative, years generative artificial, generative artificial intelligence, Generative Adversarial Networks, Monte Carlo simulated
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:In recent years generative artificial intelligence has been used to create data to support science analysis. For example, Generative Adversarial Networks (GANs) have been trained using Monte Carlo simulated input and then used to generate data for the same problem. This has the advantage that a GAN creates data in a significantly reduced computing time. N training events for a GAN can result in GN generated events with the gain factor, G, being more than one. This appears to violate the principle that one cannot get information for free. This is not the only way to amplify data so this process will be referred to as data amplification which is studied using information theoretic concepts. It is shown that a gain of greater than one is possible whilst keeping the information content of the data unchanged. This leads to a mathematical bound which only depends on the number of generated and training events. This study determines conditions on both the underlying and reconstructed probability distributions to ensure this bound. In particular, the resolution of variables in amplified data is not improved by the process but the increase in sample size can still improve statistical significance. The bound is confirmed using computer simulation and analysis of GAN generated data from the literature.

[LG-55] Combinatorial Regularity for Relatively Perfect Discrete Morse Gradient Vector Fields of ReLU Neural Networks

链接: https://arxiv.org/abs/2412.18005
作者: Robyn Brooks,Marissa Masden
关键词: ReLU neural networks, ReLU neural, piecewise linear Morse, neural networks, canonical polyhedral complex
类目: Algebraic Topology (math.AT); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: 32 pages, 9 figures

点击查看摘要

Abstract:One common function class in machine learning is the class of ReLU neural networks. ReLU neural networks induce a piecewise linear decomposition of their input space called the canonical polyhedral complex. It has previously been established that it is decidable whether a ReLU neural network is piecewise linear Morse. In order to expand computational tools for analyzing the topological properties of ReLU neural networks, and to harness the strengths of discrete Morse theory, we introduce a schematic for translating between a given piecewise linear Morse function (e.g. parameters of a ReLU neural network) on a canonical polyhedral complex and a compatible (``relatively perfect") discrete Morse function on the same complex. Our approach is constructive, producing an algorithm that can be used to determine if a given vertex in a canonical polyhedral complex corresponds to a piecewise linear Morse critical point. Furthermore we provide an algorithm for constructing a consistent discrete Morse pairing on cells in the canonical polyhedral complex which contain this vertex. We additionally provide some new realizability results with respect to sublevel set topology in the case of shallow ReLU neural networks.

[LG-56] Shifted Composition III: Local Error Framework for KL Divergence

链接: https://arxiv.org/abs/2412.17997
作者: Jason M. Altschuler,Sinho Chewi
关键词: adapt coupling arguments, Coupling arguments, central tool, tool for bounding, bounding the deviation
类目: atistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Coupling arguments are a central tool for bounding the deviation between two stochastic processes, but traditionally have been limited to Wasserstein metrics. In this paper, we apply the shifted composition rule–an information-theoretic principle introduced in our earlier work–in order to adapt coupling arguments to the Kullback-Leibler (KL) divergence. Our framework combine the strengths of two previously disparate approaches: local error analysis and Girsanov’s theorem. Akin to the former, it yields tight bounds by incorporating the so-called weak error, and is user-friendly in that it only requires easily verified local assumptions; and akin to the latter, it yields KL divergence guarantees and applies beyond Wasserstein contractivity. We apply this framework to the problem of sampling from a target distribution \pi . Here, the two stochastic processes are the Langevin diffusion and an algorithmic discretization thereof. Our framework provides a unified analysis when \pi is assumed to be strongly log-concave (SLC), weakly log-concave (WLC), or to satisfy a log-Sobolev inequality (LSI). Among other results, this yields KL guarantees for the randomized midpoint discretization of the Langevin diffusion. Notably, our result: (1) yields the optimal \tilde O(\sqrt d/\epsilon) rate in the SLC and LSI settings; (2) is the first result to hold beyond the 2-Wasserstein metric in the SLC setting; and (3) is the first result to hold in \emphany metric in the WLC and LSI settings. Subjects: Statistics Theory (math.ST); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML) Cite as: arXiv:2412.17997 [math.ST] (or arXiv:2412.17997v1 [math.ST] for this version) https://doi.org/10.48550/arXiv.2412.17997 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-57] Data-Driven Priors in the Maximum Entropy on the Mean Method for Linear Inverse Problems

链接: https://arxiv.org/abs/2412.17916
作者: Matthew King-Roskamp,Rustum Choksi,Tim Hoheisel
关键词: linear inverse problems, method for linear, setting of approximate, theoretical framework, framework for implementing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 25 pages, 13 figures

点击查看摘要

Abstract:We establish the theoretical framework for implementing the maximumn entropy on the mean (MEM) method for linear inverse problems in the setting of approximate (data-driven) priors. We prove a.s. convergence for empirical means and further develop general estimates for the difference between the MEM solutions with different priors \mu and \nu based upon the epigraphical distance between their respective log-moment generating functions. These estimates allow us to establish a rate of convergence in expectation for empirical means. We illustrate our results with denoising on MNIST and Fashion-MNIST data sets.

[LG-58] EnhancePPG: Improving PPG-based Heart Rate Estimation with Self-Supervision and Augmentation

链接: https://arxiv.org/abs/2412.17860
作者: Luca Benfenati,Sofia Belloni,Alessio Burrello,Panagiotis Kasnesis,Xiaying Wang,Luca Benini,Massimo Poncino,Enrico Macii,Daniele Jahier Pagliari
关键词: modern wearable devices, Heart rate, wellness monitoring, modern wearable, wearable devices
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Heart rate (HR) estimation from photoplethysmography (PPG) signals is a key feature of modern wearable devices for health and wellness monitoring. While deep learning models show promise, their performance relies on the availability of large datasets. We present EnhancePPG, a method that enhances state-of-the-art models by integrating self-supervised learning with data augmentation (DA). Our approach combines self-supervised pre-training with DA, allowing the model to learn more generalizable features, without needing more labelled data. Inspired by a U-Net-like autoencoder architecture, we utilize unsupervised PPG signal reconstruction, taking advantage of large amounts of unlabeled data during the pre-training phase combined with data augmentation, to improve state-of-the-art models’ performance. Thanks to our approach and minimal modification to the state-of-the-art model, we improve the best HR estimation by 12.2%, lowering from 4.03 Beats-Per-Minute (BPM) to 3.54 BPM the error on PPG-DaLiA. Importantly, our EnhancePPG approach focuses exclusively on the training of the selected deep learning model, without significantly increasing its inference latency

[LG-59] Compact Neural Network Algorithm for Electrocardiogram Classification

链接: https://arxiv.org/abs/2412.17852
作者: Mateo Frausto-Avila,Jose Pablo Manriquez-Amavizca,Alfred U’Ren,Mario A. Quiroz-Juarez
关键词: integrating machine learning, machine learning approaches, robust cardiac diagnostics, present a high-performance, integrating machine
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we present a high-performance, compact electrocardiogram (ECG)-based system for automatic classification of arrhythmias, integrating machine learning approaches to achieve robust cardiac diagnostics. Our method combines a compact artificial neural network with feature enhancement techniques, including mathematical transformations, signal analysis and data extraction algorithms, to capture both morphological and time-frequency features from ECG signals. A novel aspect of this work is the addition of 17 newly engineered features, which complement the algorithm’s capability to extract significant data and physiological patterns from the ECG signal. This combination enables the classifier to detect multiple arrhythmia types, such as atrial fibrillation, sinus tachycardia, ventricular flutter, and other common arrhythmic disorders. The system achieves an accuracy of 97.36% on the MIT-BIH arrhythmia database, using a lower complexity compared to state-of-the-art models. This compact tool shows potential for clinical deployment, as well as adaptation for portable devices in long-term cardiac health monitoring applications.

[LG-60] Cross-Species and Cross-Modality Epileptic Seizure Detection via Multi-Space Alignment

链接: https://arxiv.org/abs/2412.17842
作者: Z. Wang,S. Li,Dongrui Wu
关键词: million people worldwide, impacts global health, Epilepsy significantly impacts, significantly impacts global, global health
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Epilepsy significantly impacts global health, affecting about 65 million people worldwide, along with various animal species. The diagnostic processes of epilepsy are often hindered by the transient and unpredictable nature of seizures. Here we propose a multi-space alignment approach based on cross-species and cross-modality electroencephalogram (EEG) data to enhance the detection capabilities and understanding of epileptic seizures. By employing deep learning techniques, including domain adaptation and knowledge distillation, our framework aligns cross-species and cross-modality EEG signals to enhance the detection capability beyond traditional within-species and with-modality models. Experiments on multiple surface and intracranial EEG datasets of humans and canines demonstrated substantial improvements in the detection accuracy, achieving over 90% AUC scores for cross-species and cross-modality seizure detection with extremely limited labeled data from the target species/modality. To our knowledge, this is the first study that demonstrates the effectiveness of integrating heterogeneous data from different species and modalities to improve EEG-based seizure detection performance. The approach may also be generalizable to different brain-computer interface paradigms, and suggests the possibility to combine data from different species/modalities to increase the amount of training data for large EEG models.

[LG-61] SCFNet:A Transferable IIIC EEG Classification Network

链接: https://arxiv.org/abs/2412.17835
作者: Weijin Xu
关键词: harmful brain activities, common harmful brain, Epilepsy and epileptiform, unified EEG signal, EEG signal acquisition
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Epilepsy and epileptiform discharges are common harmful brain activities, and electroencephalogram (EEG) signals are widely used to monitor the onset status of patients. However, due to the lack of unified EEG signal acquisition standards, there are many obstacles in practical applications, especially the difficulty in transferring and using models trained on different numbers of channels. To address this issue, we proposes a neural network architecture with a single-channel feature extraction (Singal Channel Feature) model backend fusion (SCFNet). The feature extractor of the model is an RCNN network with single-channel input, which does not depend on other channels, thereby enabling easier migration to data with different numbers of channels. Experimental results show that on the IIIC-Seizure dataset, the accuracy of EEG-SCFNet has improved by 4% compared to the baseline model and also increased by 1.3% compared to the original RCNN neural network model. Even with only fine-tuning the classification head, its performance can still maintain a level comparable to the baseline. In addition, in terms of cross-dataset transfer, EEG-SCFNet can still maintain certain performance even if the channel leads are different.

[LG-62] EEG-GMACN: Interpretable EEG Graph Mutual Attention Convolutional Network

链接: https://arxiv.org/abs/2412.17834
作者: Haili Ye,Stephan Goerttler,Fei He
关键词: record brain electrical, brain electrical activity, Graph Signal Processing, Analyzing EEG signals, EEG
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalogram (EEG) is a valuable technique to record brain electrical activity through electrodes placed on the scalp. Analyzing EEG signals contributes to the understanding of neurological conditions and developing brain-computer interface. Graph Signal Processing (GSP) has emerged as a promising method for EEG spatial-temporal analysis, by further considering the topological relationships between electrodes. However, existing GSP studies lack interpretability of electrode importance and the credibility of prediction confidence. This work proposes an EEG Graph Mutual Attention Convolutional Network (EEG-GMACN), by introducing an ‘Inverse Graph Weight Module’ to output interpretable electrode graph weights, enhancing the clinical credibility and interpretability of EEG classification results. Additionally, we incorporate a mutual attention mechanism module into the model to improve its capability to distinguish critical electrodes and introduce credibility calibration to assess the uncertainty of prediction results. This study enhances the transparency and effectiveness of EEG analysis, paving the way for its widespread use in clinical and neuroscience research.

[LG-63] A Beginners Guide to Power and Energy Measurement and Estimation for Computing and Machine Learning

链接: https://arxiv.org/abs/2412.17830
作者: Akshaya Jagannadharao,Nicole Beckage,Sovan Biswas,Hilary Egan,Jamil Gafur,Thijs Metsch,Dawn Nafus,Giuseppe Raffa,Charles Tripp
关键词: learning are increasing, machine learning, environmental footprint, Concerns, energy
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Released as NREL Tech Report

点击查看摘要

Abstract:Concerns about the environmental footprint of machine learning are increasing. While studies of energy use and emissions of ML models are a growing subfield, most ML researchers and developers still do not incorporate energy measurement as part of their work practices. While measuring energy is a crucial step towards reducing carbon footprint, it is also not straightforward. This paper introduces the main considerations necessary for making sound use of energy measurement tools and interpreting energy estimates, including the use of at-the-wall versus on-device measurements, sampling strategies and best practices, common sources of error, and proxy measures. It also contains practical tips and real-world scenarios that illustrate how these considerations come into play. It concludes with a call to action for improving the state of the art of measurement methods and standards for facilitating robust comparisons between diverse hardware and software environments.

[LG-64] Decoding individual words from non-invasive brain recordings across 723 participants

链接: https://arxiv.org/abs/2412.17829
作者: Stéphane d’Ascoli,Corentin Bel,Jérémy Rapin,Hubert Banville,Yohann Benchetrit,Christophe Pallier,Jean-Rémi King
关键词: electrodes implanted inside, recently enabled, neural activity, electrodes implanted, implanted inside
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has recently enabled the decoding of language from the neural activity of a few participants with electrodes implanted inside their brain. However, reliably decoding words from non-invasive recordings remains an open challenge. To tackle this issue, we introduce a novel deep learning pipeline to decode individual words from non-invasive electro- (EEG) and magneto-encephalography (MEG) signals. We train and evaluate our approach on an unprecedentedly large number of participants (723) exposed to five million words either written or spoken in English, French or Dutch. Our model outperforms existing methods consistently across participants, devices, languages, and tasks, and can decode words absent from the training set. Our analyses highlight the importance of the recording device and experimental protocol: MEG and reading are easier to decode than EEG and listening, respectively, and it is preferable to collect a large amount of data per participant than to repeat stimuli across a large number of participants. Furthermore, decoding performance consistently increases with the amount of (i) data used for training and (ii) data used for averaging during testing. Finally, single-word predictions show that our model effectively relies on word semantics but also captures syntactic and surface properties such as part-of-speech, word length and even individual letters, especially in the reading condition. Overall, our findings delineate the path and remaining challenges towards building non-invasive brain decoders for natural language.

[LG-65] CPFI-EIT: A CNN-PINN Framework for Full-Inverse Electrical Impedance Tomography on Non-Smooth Conductivity Distributions

链接: https://arxiv.org/abs/2412.17827
作者: Yang Xuanxuan,Zhang Yangming,Chen Haofeng,Ma Gang,Wang Xiaojie
关键词: combines convolutional neural, electrical impedance tomography, convolutional neural networks, physics-informed neural networks, hybrid learning framework
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 11 pages,11 figures

点击查看摘要

Abstract:This paper introduces a hybrid learning framework that combines convolutional neural networks (CNNs) and physics-informed neural networks (PINNs) to address the challenging problem of full-inverse electrical impedance tomography (EIT). EIT is a noninvasive imaging technique that reconstructs the spatial distribution of internal conductivity based on boundary voltage measurements from injected currents. This method has applications across medical imaging, multiphase flow detection, and tactile sensing. However, solving EIT involves a nonlinear partial differential equation (PDE) derived from Maxwell’s equations, posing significant computational challenges as an ill-posed inverse problem. Existing PINN approaches primarily address semi-inverse EIT, assuming full access to internal potential data, which limits practical applications in realistic, full-inverse scenarios. Our framework employs a forward CNN-based supervised network to map differential boundary voltage measurements to a discrete potential distribution under fixed Neumann boundary conditions, while an inverse PINN-based unsupervised network enforces PDE constraints for conductivity reconstruction. Instead of traditional automatic differentiation, we introduce discrete numerical differentiation to bridge the forward and inverse networks, effectively decoupling them, enhancing modularity, and reducing computational demands. We validate our framework under realistic conditions, using a 16-electrode setup and rigorous testing on complex conductivity distributions with sharp boundaries, without Gaussian smoothing. This approach demonstrates robust flexibility and improved applicability in full-inverse EIT, establishing a practical solution for real-world imaging challenges.

[LG-66] EEG-Based Mental Imagery Task Adaptation via Ensemble of Weight-Decomposed Low-Rank Adapters

链接: https://arxiv.org/abs/2412.17818
作者: Taveena Lotey,Aman Verma,Partha Pratim Roy
关键词: Brain Computer Interfaces, Computer Interfaces, Brain Computer, widely researched, Electroencephalography
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) is widely researched for neural decoding in Brain Computer Interfaces (BCIs) as it is non-invasive, portable, and economical. However, EEG signals suffer from inter- and intra-subject variability, leading to poor performance. Recent technological advancements have led to deep learning (DL) models that have achieved high performance in various fields. However, such large models are compute- and resource-intensive and are a bottleneck for real-time neural decoding. Data distribution shift can be handled with the help of domain adaptation techniques of transfer learning (fine-tuning) and adversarial training that requires model parameter updates according to the target domain. One such recent technique is Parameter-efficient fine-tuning (PEFT), which requires only a small fraction of the total trainable parameters compared to fine-tuning the whole model. Therefore, we explored PEFT methods for adapting EEG-based mental imagery tasks. We considered two mental imagery tasks: speech imagery and motor imagery, as both of these tasks are instrumental in post-stroke neuro-rehabilitation. We proposed a novel ensemble of weight-decomposed low-rank adaptation methods, EDoRA, for parameter-efficient mental imagery task adaptation through EEG signal classification. The performance of the proposed PEFT method is validated on two publicly available datasets, one speech imagery, and the other motor imagery dataset. In extensive experiments and analysis, the proposed method has performed better than full fine-tune and state-of-the-art PEFT methods for mental imagery EEG classification.

信息检索

[IR-0] Contrastive Representation for Interactive Recommendation AAAI-2025

链接: https://arxiv.org/abs/2412.18396
作者: Jingyu Li,Zhiyong Feng,Dongxiao He,Hongqi Chen,Qinghang Gao,Guoli Wu
关键词: long term objectives, gained significant attention, significant attention recently, capture dynamic interest, quickly capture dynamic
类目: Information Retrieval (cs.IR)
*备注: AAAI-2025 Accepted paper

点击查看摘要

Abstract:Interactive Recommendation (IR) has gained significant attention recently for its capability to quickly capture dynamic interest and optimize both short and long term objectives. IR agents are typically implemented through Deep Reinforcement Learning (DRL), because DRL is inherently compatible with the dynamic nature of IR. However, DRL is currently not perfect for IR. Due to the large action space and sample inefficiency problem, training DRL recommender agents is challenging. The key point is that useful features cannot be extracted as high-quality representations for the recommender agent to optimize its policy. To tackle this problem, we propose Contrastive Representation for Interactive Recommendation (CRIR). CRIR efficiently extracts latent, high-level preference ranking features from explicit interaction, and leverages the features to enhance users’ representation. Specifically, the CRIR provides representation through one representation network, and refines it through our proposed Preference Ranking Contrastive Learning (PRCL). The key insight of PRCL is that it can perform contrastive learning without relying on computations involving high-level representations or large potential action sets. Furthermore, we also propose a data exploiting mechanism and an agent training mechanism to better adapt CRIR to the DRL backbone. Extensive experiments have been carried out to show our method’s superior improvement on the sample efficiency while training an DRL-based IR agent.

[IR-1] RaSeRec: Retrieval-Augmented Sequential Recommendation

链接: https://arxiv.org/abs/2412.18378
作者: Xinping Zhao,Baotian Hu,Yan Zhong,Shouzheng Huang,Zihao Zheng,Meng Wang,Haofen Wang,Min zhang
关键词: dominate parametric learning, neural network architectures, recall long tails, achieved improved performance, powerful neural network
类目: Information Retrieval (cs.IR)
*备注: 20 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Although prevailing supervised and self-supervised learning (SSL)-augmented sequential recommendation (SeRec) models have achieved improved performance with powerful neural network architectures, we argue that they still suffer from two limitations: (1) Preference Drift, where models trained on past data can hardly accommodate evolving user preference; and (2) Implicit Memory, where head patterns dominate parametric learning, making it harder to recall long tails. In this work, we explore retrieval augmentation in SeRec, to address these limitations. To this end, we propose a Retrieval-Augmented Sequential Recommendation framework, named RaSeRec, the main idea of which is to maintain a dynamic memory bank to accommodate preference drifts and retrieve relevant memories to augment user modeling explicitly. It consists of two stages: (i) collaborative-based pre-training, which learns to recommend and retrieve; (ii) retrieval-augmented fine-tuning, which learns to leverage retrieved memories. Extensive experiments on three datasets fully demonstrate the superiority and effectiveness of RaSeRec.

[IR-2] Efficient Long Context Language Model Retrieval with Compression

链接: https://arxiv.org/abs/2412.18232
作者: Minju Seo,Jinheon Baek,Seongyun Lee,Sung Ju Hwang
关键词: Long Context Language, perform Information Retrieval, Context Language Models, surpass traditional sparse, dense retrieval methods
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Long Context Language Models (LCLMs) have emerged as a new paradigm to perform Information Retrieval (IR), which enables the direct ingestion and retrieval of information by processing an entire corpus in their single context, showcasing the potential to surpass traditional sparse and dense retrieval methods. However, processing a large number of passages within in-context for retrieval is computationally expensive, and handling their representations during inference further exacerbates the processing time; thus, we aim to make LCLM retrieval more efficient and potentially more effective with passage compression. Specifically, we propose a new compression approach tailored for LCLM retrieval, which is trained to maximize the retrieval performance while minimizing the length of the compressed passages. To accomplish this, we generate the synthetic data, where compressed passages are automatically created and labeled as chosen or rejected according to their retrieval success for a given query, and we train the proposed Compression model for Long context Retrieval (CoLoR) with this data via preference optimization while adding the length regularization loss on top of it to enforce brevity. Through extensive experiments on 9 datasets, we show that CoLoR improves the retrieval performance by 6% while compressing the in-context size by a factor of 1.91.

[IR-3] Unlocking the Hidden Treasures: Enhancing Recommendations with Unlabeled Data

链接: https://arxiv.org/abs/2412.18170
作者: Yuhan Zhao,Rui Chen,Qilong Han,Hongtao Song,Li Chen
关键词: Collaborative filtering, massive unlabeled data, unlabeled data presents, unlabeled data, recommender systems
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Collaborative filtering (CF) stands as a cornerstone in recommender systems, yet effectively leveraging the massive unlabeled data presents a significant challenge. Current research focuses on addressing the challenge of unlabeled data by extracting a subset that closely approximates negative samples. Regrettably, the remaining data are overlooked, failing to fully integrate this valuable information into the construction of user preferences. To address this gap, we introduce a novel positive-neutral-negative (PNN) learning paradigm. PNN introduces a neutral class, encompassing intricate items that are challenging to categorize directly as positive or negative samples. By training a model based on this triple-wise partial ranking, PNN offers a promising solution to learning complex user preferences. Through theoretical analysis, we connect PNN to one-way partial AUC (OPAUC) to validate its efficacy. Implementing the PNN paradigm is, however, technically challenging because: (1) it is difficult to classify unlabeled data into neutral or negative in the absence of supervised signals; (2) there does not exist any loss function that can handle set-level triple-wise ranking relationships. To address these challenges, we propose a semi-supervised learning method coupled with a user-aware attention model for knowledge acquisition and classification refinement. Additionally, a novel loss function with a two-step centroid ranking approach enables handling set-level rankings. Extensive experiments on four real-world datasets demonstrate that, when combined with PNN, a wide range of representative CF models can consistently and significantly boost their performance. Even with a simple matrix factorization, PNN can achieve comparable performance to sophisticated graph neutral networks.

[IR-4] From Pairwise to Ranking: Climbing the Ladder to Ideal Collaborative Filtering with Pseudo-Ranking

链接: https://arxiv.org/abs/2412.18168
作者: Yuhan Zhao,Rui Chen,Li Chen,Shuang Zhang,Qilong Han,Hongtao Song
关键词: optimal top-K recommendations, make optimal top-K, ideal collaborative filtering, users’ full rankings, collaborative filtering
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Intuitively, an ideal collaborative filtering (CF) model should learn from users’ full rankings over all items to make optimal top-K recommendations. Due to the absence of such full rankings in practice, most CF models rely on pairwise loss functions to approximate full rankings, resulting in an immense performance gap. In this paper, we provide a novel analysis using the multiple ordinal classification concept to reveal the inevitable gap between a pairwise approximation and the ideal case. However, bridging the gap in practice encounters two formidable challenges: (1) none of the real-world datasets contains full ranking information; (2) there does not exist a loss function that is capable of consuming ranking information. To overcome these challenges, we propose a pseudo-ranking paradigm (PRP) that addresses the lack of ranking information by introducing pseudo-rankings supervised by an original noise injection mechanism. Additionally, we put forward a new ranking loss function designed to handle ranking information effectively. To ensure our method’s robustness against potential inaccuracies in pseudo-rankings, we equip the ranking loss function with a gradient-based confidence mechanism to detect and mitigate abnormal gradients. Extensive experiments on four real-world datasets demonstrate that PRP significantly outperforms state-of-the-art methods.

[IR-5] me-Probability Dependent Knowledge Extraction in IoT-enabled Smart Building

链接: https://arxiv.org/abs/2412.18042
作者: Hangli Ge,Hirotsugu Seike,Noboru Koshizuka
关键词: Internet of Things, emerging Internet, human comfort, applications for comprehensive, energy efficiency
类目: Information Retrieval (cs.IR); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Smart buildings incorporate various emerging Internet of Things (IoT) applications for comprehensive management of energy efficiency, human comfort, automation, and security. However, the development of a knowledge extraction framework is fundamental. Currently, there is a lack of a unified and practical framework for modeling heterogeneous sensor data within buildings. In this paper, we propose a practical inference framework for extracting status-to-event knowledge within smart building. Our proposal includes IoT-based API integration, ontology model design, and time probability dependent knowledge extraction methods. The Building Topology Ontology (BOT) was leveraged to construct spatial relations among sensors and spaces within the building. We utilized Apache Jena Fuseki’s SPARQL server for storing and querying the RDF triple data. Two types of knowledge could be extracted: timestamp-based probability for abnormal event detection and time interval-based probability for conjunction of multiple events. We conducted experiments (over a 78-day period) in a real smart building environment. The data of light and elevator states has been collected for evaluation. The evaluation revealed several inferred events, such as room occupancy, elevator trajectory tracking, and the conjunction of both events. The numerical values of detected event counts and probability demonstrate the potential for automatic control in the smart building.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-25

目录

概览 (2024-12-25)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载