Arxiv今日论文 | 2024-12-25

本篇博文主要展示 2024-12-25 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决生成式预训练Transformer模型（GPT）在处理输入提示时的不必要约束问题。传统GPT模型在“预填充”阶段对所有输入标记逐步应用因果掩码，限制了模型在生成内部表示时的灵活性。论文提出的解决方案是采用“逐段”（Segment-by-Segment）策略，在预填充阶段根据已知的块结构进行掩码，允许每个块的初始标记以非因果方式访问后续标记，然后在生成输出标记时恢复传统的自回归过程。该方法无需额外的计算开销，并在Llama和Qwen等模型中实现了最先进的性能。

链接: https://arxiv.org/abs/2412.18487
作者: Shahar Katz,Liran Ringel,Yaniv Romano,Lior Wolf
机构: Blavatnik School of Computer Science, Tel Aviv University(布劳沃特尼克计算机科学学院，特拉维夫大学); Department of Computer Science, Technion – Israel Institute of Technology(计算机科学系，以色列理工学院); Department of Electrical and Computer Engineering, Technion – Israel Institute of Technology(电气与计算机工程系，以色列理工学院)
关键词: Generative Pre-Trained Transformer, Modern Language Models, Modern Language, Pre-Trained Transformer, backbone of Generative
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern Language Models (LMs) owe much of their success to masked causal attention, the backbone of Generative Pre-Trained Transformer (GPT) models. Although GPTs can process the entire user prompt at once, the causal masking is applied to all input tokens step-by-step, mimicking the generation process. This imposes an unnecessary constraint during the initial “prefill” phase when the model processes the input prompt and generates the internal representations before producing any output tokens. In this work, attention is masked based on the known block structure at the prefill phase, followed by the conventional token-by-token autoregressive process after that. For example, in a typical chat prompt, the system prompt is treated as one block, and the user prompt as the next one. Each of these is treated as a unit for the purpose of masking, such that the first tokens in each block can access the subsequent tokens in a non-causal manner. Then, the model answer is generated in the conventional causal manner. This Segment-by-Segment scheme entails no additional computational overhead. When integrating it into models such as Llama and Qwen, state-of-the-art performance is consistently achieved.
zh

[NLP-1] Is Large Language Model Good at Triple Set Prediction? An Empirical Study

【速读】：该论文试图解决知识图谱补全任务中的三元组集合预测 (Triple Set Prediction, TSP) 问题，即基于已知三元组预测未知三元组的所有元素。解决方案的关键在于提出了一种基于大语言模型 (LLM) 的新框架，该框架包括 LLM 驱动的规则挖掘和 LLM 驱动的三元组集合预测。具体来说，首先利用知识图谱中嵌入的丰富语义信息生成规则，这些规则独立于统计信息且高效；然后，在每个子图中应用这些规则并结合相关三元组，指导 LLM 预测缺失的三元组；最后，整合所有子图的预测结果以获得完整的预测三元组集合。实验结果表明，尽管 LLM 在生成规则方面表现出色，但在需要遵循大量事实知识进行预测时，容易产生幻觉 (hallucination)，导致性能显著下降。

链接: https://arxiv.org/abs/2412.18443
作者: Yuan Yuan,Yajing Xu,Wen Zhang
机构: Zhejiang University (浙江大学); Zhejiang University (浙江大学); Zhejiang University (浙江大学)
关键词: Knowledge Graph Completion, graph completion task, KGC tasks, Common KGC tasks, Graph Completion
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The core of the Knowledge Graph Completion (KGC) task is to predict and complete the missing relations or nodes in a KG. Common KGC tasks are mostly about inferring unknown elements with one or two elements being known in a triple. In comparison, the Triple Set Prediction (TSP) task is a more realistic knowledge graph completion task. It aims to predict all elements of unknown triples based on the information from known triples. In recent years, large language models (LLMs) have exhibited significant advancements in language comprehension, demonstrating considerable potential for KGC tasks. However, the potential of LLM on the TSP task has not yet to be investigated. Thus in this paper we proposed a new framework to explore the strengths and limitations of LLM in the TSP task. Specifically, the framework consists of LLM-based rule mining and LLM-based triple set prediction. The relation list of KG embedded within rich semantic information is first leveraged to prompt LLM in the generation of rules. This process is both efficient and independent of statistical information, making it easier to mine effective and realistic rules. For each subgraph, the specified rule is applied in conjunction with the relevant triples within that subgraph to guide the LLM in predicting the missing triples. Subsequently, the predictions from all subgraphs are consolidated to derive the complete set of predicted triples on KG. Finally, the method is evaluated on the relatively complete CFamily dataset. The experimental results indicate that when LLMs are required to adhere to a large amount of factual knowledge to predict missing triples, significant hallucinations occurs, leading to a noticeable decline in performance. To further explore the causes of this phenomenon, this paper presents a comprehensive analysis supported by a detailed case study.
zh

[NLP-2] Unlocking the Potential of Multiple BERT Models for Bangla Question Answering in NCTB Textbooks

【速读】：该论文试图解决在教育环境中自动评估文本理解能力的问题，特别是针对孟加拉语（Bangla）基于段落的问题回答任务。解决方案的关键在于使用先进的语言模型（如RoBERTa Base、Bangla-BERT和BERT Base）进行自动评估，并通过调整超参数（如批量大小、是否包含停用词、学习率等）来优化模型性能。研究结果表明，Bangla-BERT在F1分数和精确匹配（Exact Match, EM）指标上表现最佳，强调了超参数调优的重要性，并为未来在教育机构中开发自动化评估系统奠定了基础。

链接: https://arxiv.org/abs/2412.18440
作者: Abdullah Khondoker,Enam Ahmed Taufik,Md Iftekhar Islam Tashik,S M Ishtiak mahmud,Antara Firoz Parsa
机构: 未知
关键词: improving curricular effectiveness, Bangla passage-based question-answering, understanding student performance, BERT Base-in automatically, Evaluating text comprehension
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating text comprehension in educational settings is critical for understanding student performance and improving curricular effectiveness. This study investigates the capability of state-of-the-art language models-RoBERTa Base, Bangla-BERT, and BERT Base-in automatically assessing Bangla passage-based question-answering from the National Curriculum and Textbook Board (NCTB) textbooks for classes 6-10. A dataset of approximately 3,000 Bangla passage-based question-answering instances was compiled, and the models were evaluated using F1 Score and Exact Match (EM) metrics across various hyperparameter configurations. Our findings revealed that Bangla-BERT consistently outperformed the other models, achieving the highest F1 (0.75) and EM (0.53) scores, particularly with smaller batch sizes, the inclusion of stop words, and a moderate learning rate. In contrast, RoBERTa Base demonstrated the weakest performance, with the lowest F1 (0.19) and EM (0.27) scores under certain configurations. The results underscore the importance of fine-tuning hyperparameters for optimizing model performance and highlight the potential of machine learning models in evaluating text comprehension in educational contexts. However, limitations such as dataset size, spelling inconsistencies, and computational constraints emphasize the need for further research to enhance the robustness and applicability of these models. This study lays the groundwork for the future development of automated evaluation systems in educational institutions, providing critical insights into model performance in the context of Bangla text comprehension.
zh

[NLP-3] GeAR: Graph-enhanced Agent for Retrieval-augmented Generation

【速读】：该论文试图解决多跳检索场景下传统稀疏或稠密检索器（retrievers）的性能瓶颈问题。解决方案的关键在于提出了GeAR系统，通过两个创新点提升检索增强生成（Retrieval-augmented Generation, RAG）系统的性能：(i) 图扩展（graph expansion），增强如BM25等传统基础检索器的能力；(ii) 引入代理框架（agent framework），结合图扩展技术。实验结果表明，GeAR在三个多跳问答数据集上展现出优越的检索性能，并在MuSiQue数据集上实现了超过10%的性能提升，同时减少了所需的token数量和迭代次数。

链接: https://arxiv.org/abs/2412.18431
作者: Zhili Shen,Chenxin Diao,Pavlos Vougiouklis,Pascual Merita,Shriram Piramanayagam,Damien Graux,Dandan Tu,Zeren Jiang,Ruofei Lai,Yang Ren,Jeff Z. Pan
机构: 未知
关键词: Retrieval-augmented generation systems, Retrieval-augmented generation, document retrieval capabilities, effective document retrieval, generation systems rely
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation systems rely on effective document retrieval capabilities. By design, conventional sparse or dense retrievers face challenges in multi-hop retrieval scenarios. In this paper, we present GeAR, which advances RAG performance through two key innovations: (i) graph expansion, which enhances any conventional base retriever, such as BM25, and (ii) an agent framework that incorporates graph expansion. Our evaluation demonstrates GeAR’s superior retrieval performance on three multi-hop question answering datasets. Additionally, our system achieves state-of-the-art results with improvements exceeding 10% on the challenging MuSiQue dataset, while requiring fewer tokens and iterations compared to other multi-step retrieval systems.
zh

[NLP-4] Explainable Multi-Modal Data Exploration in Natural Language via LLM Agent

【速读】：该论文试图解决在自然语言查询中结合数据库系统与其他非结构化模态（如图像）的多模态数据探索问题。解决方案的关键在于提出了XMODE系统，该系统利用基于大语言模型（LLM）的代理AI框架，将自然语言问题分解为子任务，如文本到SQL生成和图像分析，从而实现可解释的多模态数据探索。实验结果表明，XMODE系统在准确性、查询延迟、API成本、规划效率和解释质量等多个性能指标上优于现有的多模态探索系统，得益于更有效地利用了LLM的推理能力。

链接: https://arxiv.org/abs/2412.18428
作者: Farhad Nooralahzadeh,Yi Zhang,Jonathan Furst,Kurt Stockinger
机构: Zurich University of Applied Sciences(苏黎世应用科学大学)
关键词: hospitals collect large, collect large amounts, International enterprises, text documents, natural language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:International enterprises, organizations, or hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying database systems combined with other unstructured modalities such as images in natural language is widely unexplored. In this paper, we propose XMODE - a system that enables explainable, multi-modal data exploration in natural language. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) XMODE leverages a LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis. (3) Experimental results on multi-modal datasets over relational data and images demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling not only in accuracy but also in various performance metrics such as query latency, API costs, planning efficiency, and explanation quality, thanks to the more effective utilization of the reasoning capabilities of LLMs. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2412.18428 [cs.AI] (or arXiv:2412.18428v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.18428 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-5] LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding Reasoning and Locating

【速读】：该论文试图解决现有文档理解基准在处理复杂文档元素、长文本上下文和广泛任务范围方面的局限性，特别是它们仅能处理少量页面且未能全面分析布局元素定位的问题。解决方案的关键在于提出了一个综合基准 LongDocURL，该基准整合了长文档理解 (Long Document Understanding)、数值推理 (numerical Reasoning) 和跨元素定位 (cross-element Locating) 三个主要任务，并包含基于不同任务和答案证据分类的20个子任务。此外，论文通过半自动化构建流程收集了2,325个高质量问答对，覆盖超过33,000页文档，显著超越现有基准。通过在26种不同配置的开源和闭源模型上进行全面评估实验，揭示了该领域的重要性能差距。

链接: https://arxiv.org/abs/2412.18424
作者: Chao Deng,Jiale Yuan,Pi Bu,Peijie Wang,Zhong-Zhi Li,Jian Xu,Xiao-Hui Li,Yuan Gao,Jun Song,Bo Zheng,Cheng-Lin Liu
机构: Institute of Automation of Chinese Academy of Sciences(中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); Alibaba Group(阿里巴巴集团)
关键词: Large vision language, understanding capabilities remarkably, Large vision, document understanding capabilities, complex document elements
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.
zh

[NLP-6] Multilingual Mathematical Reasoning: Advancing Open-Source LLM s in Hindi and English AAAI2025

【速读】：该论文旨在解决大型语言模型（LLMs）在数学推理任务中的不足，特别是在非英语语言如印地语中的表现。解决方案的关键在于采用了一系列创新方法，包括课程学习（Curriculum Learning），逐步增加问题难度；分解策略（Decomposition Strategy），简化复杂算术操作；以及结构化解答设计（Structured Solution Design），将解答过程分为多个阶段。此外，通过零样本、少样本链式思维（Chain-of-Thought, CoT）方法和监督微调，结合英语和印地语样本的双语训练，显著提升了如WizardMath 7B等开源LLMs的数学推理能力，使其在英语和印地语数据集上均表现出优越的性能。

链接: https://arxiv.org/abs/2412.18415
作者: Avinash Anand,Kritarth Prasad,Chhavi Kirtani,Ashwin R Nair,Manvendra Kumar Nema,Raj Jaiswal,Rajiv Ratn Shah
机构: Indian Institute of Technology Delhi (印度理工学院德里分校)
关键词: Large Language Models, Large Language, excel in linguistic, linguistic tasks, tasks but struggle
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Large Language Models (LLMs) excel in linguistic tasks but struggle with mathematical reasoning, particularly in non English languages like Hindi. This research aims to enhance the mathematical reasoning skills of smaller, resource efficient open-source LLMs in both Hindi and English. We evaluate models like OpenHathi 7B, LLaMA-2 7B, WizardMath 7B, Mistral 7B, LLeMMa 7B, MAmmoTH 7B, Gemini Pro, and GPT-4 using zero-shot, few-shot chain-of-thought (CoT) methods, and supervised fine-tuning. Our approach incorporates curriculum learning, progressively training models on increasingly difficult problems, a novel Decomposition Strategy to simplify complex arithmetic operations, and a Structured Solution Design that divides solutions into phases. Our experiments result in notable performance enhancements. WizardMath 7B exceeds Gemini’s accuracy on English datasets by +6% and matches Gemini’s performance on Hindi datasets. Adopting a bilingual approach that combines English and Hindi samples achieves results comparable to individual language models, demonstrating the capability to learn mathematical reasoning in both languages. This research highlights the potential for improving mathematical reasoning in open-source LLMs.
zh

[NLP-7] ChaI-TeA: A Benchmark for Evaluating Autocompletion of Interactions with LLM -based Chatbots

【速读】：该论文试图解决在基于大型语言模型（LLM）的聊天机器人交互中，用户输入自然语言文本时耗时且费力的问题。解决方案的关键在于引入聊天交互自动补全任务（chatbot interaction autocomplete），并提出了一个名为ChaI-TeA的自动补全评估框架。该框架包括任务的正式定义、适用的数据集和评估指标，并通过测试9种模型，发现现有模型在生成建议的排序方面仍有较大改进空间。论文为从业者提供了实践见解，并为研究人员开辟了新的研究方向，同时公开了该框架以供未来研究使用。

链接: https://arxiv.org/abs/2412.18377
作者: Shani Goren,Oren Kalinsky,Tomer Stav,Yuri Rapoport,Yaron Fairstein,Ram Yazdy,Nachshon Cohen,Alexander Libov,Guy Kushilevitz
机构: Amazon Research(亚马逊研究); Technion - Israel institute of technology(以色列理工学院)
关键词: rise of LLMs, LLMs has deflected, deflected a growing, growing portion, portion of human-computer
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rise of LLMs has deflected a growing portion of human-computer interactions towards LLM-based chatbots. The remarkable abilities of these models allow users to interact using long, diverse natural language text covering a wide range of topics and styles. Phrasing these messages is a time and effort consuming task, calling for an autocomplete solution to assist users. We introduce the task of chatbot interaction autocomplete. We present ChaI-TeA: CHat InTEraction Autocomplete; An autcomplete evaluation framework for LLM-based chatbot interactions. The framework includes a formal definition of the task, coupled with suitable datasets and metrics. We use the framework to evaluate After formally defining the task along with suitable datasets and metrics, we test 9 models on the defined auto completion task, finding that while current off-the-shelf models perform fairly, there is still much room for improvement, mainly in ranking of the generated suggestions. We provide insights for practitioners working on this task and open new research directions for researchers in the field. We release our framework to serve as a foundation for future research.
zh

[NLP-8] Bidirectional Topic Matching: Quantifying Thematic Overlap Between Corpora Through Topic Modelling

【速读】：该论文试图解决跨语料库主题建模中主题重叠与差异的量化问题，提出了双向主题匹配 (Bidirectional Topic Matching, BTM) 方法。解决方案的关键在于采用双模型框架，分别为每个语料库训练独立的主题模型，并通过相互应用这些模型来实现全面的跨语料库比较。BTM 的灵活性体现在其能够整合多种主题建模方法（如 BERTopic、Top2Vec 和 LDA），并通过验证表明其在处理异常主题时具有显著优势。该方法不仅能够识别共享主题，还能揭示语料库间的独特主题，从而提供更细致的主题关系分析。

链接: https://arxiv.org/abs/2412.18376
作者: Raven Adam,Marie Lisa Kogler
机构: University of Graz(格拉茨大学); Department of Environmental Systems Sciences(环境系统科学系)
关键词: Bidirectional Topic Matching, introduces Bidirectional Topic, study introduces Bidirectional, Latent Dirichlet Allocation, introduces Bidirectional
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:This study introduces Bidirectional Topic Matching (BTM), a novel method for cross-corpus topic modeling that quantifies thematic overlap and divergence between corpora. BTM is a flexible framework that can incorporate various topic modeling approaches, including BERTopic, Top2Vec, and Latent Dirichlet Allocation (LDA). BTM employs a dual-model approach, training separate topic models for each corpus and applying them reciprocally to enable comprehensive cross-corpus comparisons. This methodology facilitates the identification of shared themes and unique topics, providing nuanced insights into thematic relationships. Validation against cosine similarity-based methods demonstrates the robustness of BTM, with strong agreement metrics and distinct advantages in handling outlier topics. A case study on climate news articles showcases BTM’s utility, revealing significant thematic overlaps and distinctions between corpora focused on climate change and climate action. BTM’s flexibility and precision make it a valuable tool for diverse applications, from political discourse analysis to interdisciplinary studies. By integrating shared and unique topic analyses, BTM offers a comprehensive framework for exploring thematic relationships, with potential extensions to multilingual and dynamic datasets. This work highlights BTM’s methodological contributions and its capacity to advance discourse analysis across various domains.
zh

[NLP-9] owards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset

【速读】：该论文试图解决机器翻译领域中特定领域术语翻译的挑战，尤其是人工智能（AI）领域的术语翻译问题。解决方案的关键在于引入了GIST，一个大规模的多语言AI术语数据集，包含从2000年至2023年顶级AI会议论文中提取的5K个术语，并将其翻译成阿拉伯语、中文、法语、日语和俄语。该数据集的构建采用了结合大型语言模型（LLMs）提取和人工专家翻译的混合框架，并通过众包评估验证了其翻译准确性优于现有资源。GIST通过无需重新训练的翻译后精炼方法集成到翻译工作流中，显著提升了BLEU和COMET评分。论文还展示了其在ACL Anthology平台上的实际应用，增强了非英语使用者对AI研究的访问性，从而促进了全球AI研究的包容性和协作性。

链接: https://arxiv.org/abs/2412.18367
作者: Jiarui Liu,Iman Ouzzani,Wenkai Li,Lechen Zhang,Tianyue Ou,Houda Bouamor,Zhijing Jin,Mona Diab
机构: Carnegie Mellon University(卡内基梅隆大学); University of Michigan(密歇根大学); University of Toronto(多伦多大学)
关键词: achieved significant advancements, remains challenging, significant advancements, field of machine, achieved significant
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The field of machine translation has achieved significant advancements, yet domain-specific terminology translation, particularly in AI, remains challenging. We introduced GIST, a large-scale multilingual AI terminology dataset containing 5K terms extracted from top AI conference papers spanning 2000 to 2023. The terms were translated into Arabic, Chinese, French, Japanese, and Russian using a hybrid framework that combines LLMs for extraction with human expertise for translation. The dataset’s quality was benchmarked against existing resources, demonstrating superior translation accuracy through crowdsourced evaluation. GIST was integrated into translation workflows using post-translation refinement methods that required no retraining, where LLM prompting consistently improved BLEU and COMET scores. A web demonstration on the ACL Anthology platform highlights its practical application, showcasing improved accessibility for non-English speakers. This work aims to address critical gaps in AI terminology resources and fosters global inclusivity and collaboration in AI research.
zh

计算机视觉

[CV-0] A region-wide multi-year set of crop field boundary labels for Africa

【速读】：该论文试图解决非洲农业快速转型过程中缺乏年度作物田地图的问题。解决方案的关键在于利用高分辨率遥感影像和先进的机器学习模型来生成这些地图。研究通过自定义标注平台对2017至2023年间采集的33,746张Planet图像进行田地边界标注，收集了42,403个标注数据，并通过贝叶斯风险度量评估标注质量。尽管在3-5米分辨率影像中，标注质量在田地边缘位置和小规模田地数量上表现较低，但这些标注数据仍能有效训练田地映射模型，并提供了关于区域农业特征的有价值信息，如中位田地大小和密度。最终，研究成果包括影像、矢量化标注及质量信息，均公开发布供下载。

链接: https://arxiv.org/abs/2412.18483
作者: L.D. Estes,A. Wussah,M. Asipunu,M. Gathigi,P. Kovačič,J. Muhando,B.V. Yeboah,F.K. Addai,E.S. Akakpo,M.K. Allotey,P. Amkoya,E. Amponsem,K.D. Donkoh,N. Ha,E. Heltzel,C. Juma,R. Mdawida,A. Miroyo,J. Mucha,J. Mugami,F. Mwawaza,D.A. Nyarko,P. Oduor,K.N. Ohemeng,S.I.D. Segbefia,T. Tumbula,F. Wambua,G.H. Xeflide,S. Ye,F. Yeboah
机构: 未知
关键词: undergoing rapid transformation, African agriculture, Class, agriculture is undergoing, undergoing rapid
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:African agriculture is undergoing rapid transformation. Annual maps of crop fields are key to understanding the nature of this transformation, but such maps are currently lacking and must be developed using advanced machine learning models trained on high resolution remote sensing imagery. To enable the development of such models, we delineated field boundaries in 33,746 Planet images captured between 2017 and 2023 across the continent using a custom labeling platform with built-in procedures for assessing and mitigating label error. We collected 42,403 labels, including 7,204 labels arising from tasks dedicated to assessing label quality (Class 1 labels), 32,167 from sites mapped once by a single labeller (Class 2) and 3,032 labels from sites where 3 or more labellers were tasked to map the same location (Class 4). Class 1 labels were used to calculate labeller-specific quality scores, while Class 1 and 4 sites mapped by at least 3 labellers were used to further evaluate label uncertainty using a Bayesian risk metric. Quality metrics showed that label quality was moderately high (0.75) for measures of total field extent, but low regarding the number of individual fields delineated (0.33), and the position of field edges (0.05). These values are expected when delineating small-scale fields in 3-5 m resolution imagery, which can be too coarse to reliably distinguish smaller fields, particularly in dense croplands, and therefore requires substantial labeller judgement. Nevertheless, previous work shows that such labels can train effective field mapping models. Furthermore, this large, probabilistic sample on its own provides valuable insight into regional agricultural characteristics, highlighting variations in the median field size and density. The imagery and vectorized labels along with quality information is available for download from two public repositories.
zh

[CV-1] Underwater Image Restoration via Polymorphic Large Kernel CNNs ICASSP2025

【速读】：该论文试图解决水下图像恢复 (Underwater Image Restoration, UIR) 这一计算机视觉中的挑战性任务，特别是在复杂水下环境中图像退化的问题。解决方案的关键在于提出了UIR-PolyKernel方法，该方法利用多态大核卷积神经网络 (Polymorphic Large Kernel CNNs)，通过结合不同大小和形状的大核卷积来有效捕捉水下图像中的长距离依赖关系。此外，引入的混合域注意力模块 (Hybrid Domain Attention) 整合了频域和空间域注意力机制，以增强特征重要性，并通过频域捕捉人类难以察觉但对识别模式至关重要的隐藏特征。这种方法不仅提升了模型的泛化能力和鲁棒性，还在基准数据集上实现了最先进的性能，展示了纯CNN架构在性能和计算效率之间的平衡。

链接: https://arxiv.org/abs/2412.18459
作者: Xiaojiao Guo,Yihang Dong,Xuhang Chen,Weiwen Chen,Zimeng Li,FuChen Zheng,Chi-Man Pun
机构: University of Macau; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences; Huizhou Univeristy; Shenzhen Polytechnic University; The Hong Kong University of Science and Technology (Guangzhou); Baoshan Univeristy
关键词: computer vision due, Underwater Image Restoration, Image Restoration, Large Kernel CNNs, Polymorphic Large Kernel
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by ICASSP2025

点击查看摘要

Abstract:Underwater Image Restoration (UIR) remains a challenging task in computer vision due to the complex degradation of images in underwater environments. While recent approaches have leveraged various deep learning techniques, including Transformers and complex, parameter-heavy models to achieve significant improvements in restoration effects, we demonstrate that pure CNN architectures with lightweight parameters can achieve comparable results. In this paper, we introduce UIR-PolyKernel, a novel method for underwater image restoration that leverages Polymorphic Large Kernel CNNs. Our approach uniquely combines large kernel convolutions of diverse sizes and shapes to effectively capture long-range dependencies within underwater imagery. Additionally, we introduce a Hybrid Domain Attention module that integrates frequency and spatial domain attention mechanisms to enhance feature importance. By leveraging the frequency domain, we can capture hidden features that may not be perceptible to humans but are crucial for identifying patterns in both underwater and on-air images. This approach enhances the generalization and robustness of our UIR model. Extensive experiments on benchmark datasets demonstrate that UIR-PolyKernel achieves state-of-the-art performance in underwater image restoration tasks, both quantitatively and qualitatively. Our results show that well-designed pure CNN architectures can effectively compete with more complex models, offering a balance between performance and computational efficiency. This work provides new insights into the potential of CNN-based approaches for challenging image restoration tasks in underwater environments. The code is available at \hrefthis https URLthis https URL.
zh

[CV-2] 3DGraphLLM : Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

【速读】：该论文试图解决现有方法在处理3D场景时未能充分利用物体间语义关系的问题，限制了仅依赖物体坐标信息的局限性。解决方案的关键在于提出了3DGraphLLM方法，通过构建可学习的3D场景图表示，将物体及其语义关系信息作为输入，提升大型语言模型（LLMs）在3D视觉-语言任务中的表现。实验结果表明，该方法在多个基准数据集上优于未利用语义关系的基线方法。

链接: https://arxiv.org/abs/2412.18450
作者: Tatiana Zemskova,Dmitry Yudin
机构: Artificial Intelligence Research Institute; Moscow Institute of Physics and Technology
关键词: compact scene model, Large Language Models, represents a compact, promising for robotic, scene graph represents
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects. The code is publicly available at this https URL.
zh

[CV-3] Fashionability-Enhancing Outfit Image Editing with Conditional Diffusion Models

【速读】：该论文试图解决时尚领域图像生成中对输出图像时尚性提升不足的问题。解决方案的关键在于提出了一种基于扩散模型的新方法，该方法通过三个核心组件实现：1) 时尚性增强 (fashionability enhancement)，确保生成图像比输入更时尚；2) 保留身体特征 (preservation of body characteristics)，使生成图像保持输入的形状和比例；3) 自动时尚优化 (automatic fashion optimization)，无需手动输入或外部提示。此外，论文通过OpenSkill-based和五方面关键对比的成对比较方法，利用多位时尚专家标注的时尚性评分来收集训练数据，从而评估和提升生成图像的时尚性。实验结果表明，该方法在生成具有更高时尚性的图像方面优于基线Fashion++。

链接: https://arxiv.org/abs/2412.18421
作者: Qice Qin,Yuki Hirakawa,Ryotaro Shimizu,Takuya Furusawa,Edgar Simo-Serra
机构: Waseda University(早稻田大学); ZOZO Research
关键词: preserving body characteristics, images, domain has predominantly, predominantly focused, focused on preserving
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Image generation in the fashion domain has predominantly focused on preserving body characteristics or following input prompts, but little attention has been paid to improving the inherent fashionability of the output images. This paper presents a novel diffusion model-based approach that generates fashion images with improved fashionability while maintaining control over key attributes. Key components of our method include: 1) fashionability enhancement, which ensures that the generated images are more fashionable than the input; 2) preservation of body characteristics, encouraging the generated images to maintain the original shape and proportions of the input; and 3) automatic fashion optimization, which does not rely on manual input or external prompts. We also employ two methods to collect training data for guidance while generating and evaluating the images. In particular, we rate outfit images using fashionability scores annotated by multiple fashion experts through OpenSkill-based and five critical aspect-based pairwise comparisons. These methods provide complementary perspectives for assessing and improving the fashionability of the generated images. The experimental results show that our approach outperforms the baseline Fashion++ in generating images with superior fashionability, demonstrating its effectiveness in producing more stylish and appealing fashion images.
zh

[CV-4] Re-assessing ImageNet: How aligned is its single-label assumption with its multi-label nature?

【速读】：该论文试图解决传统单标签分类评估方法在ImageNet数据集上的局限性问题，认为这种评估方式未能充分捕捉图像中的复杂语义，可能阻碍深度神经网络（DNN）模型对这些复杂性的有效学习。解决方案的关键在于倡导采用多标签基准测试方法，以更全面地评估DNN模型的能力。论文通过分析预训练的先进DNN模型在ImageNet及其变体ImageNetV2上的表现，发现文献中报告的11%至14%的准确率下降主要归因于数据集中多标签图像的比例问题。研究结果表明，考虑多标签特性后，模型在ImageNetV2上的效果并无显著下降，并揭示了ImageNet预训练模型在一定程度上能够捕捉多标签特性。因此，论文提出了一种新的评估方法，以增强现有评估方式，强调在基准测试中考虑ImageNet的多标签特性对正确评估DNN模型效果的重要性。

链接: https://arxiv.org/abs/2412.18409
作者: Esla Timothy Anzaku,Seyed Amir Mousavi,Arnout Van Messem,Wesley De Neve
机构: Ghent University, Belgium; Ghent University Global Campus, South Korea; University of Liège, Belgium
关键词: computer vision, traditionally evaluated, single concept, ImageNet, single-label classification
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:ImageNet, an influential dataset in computer vision, is traditionally evaluated using single-label classification, which assumes that an image can be adequately described by a single concept or label. However, this approach may not fully capture the complex semantics within the images available in ImageNet, potentially hindering the development of models that effectively learn these intricacies. This study critically examines the prevalent single-label benchmarking approach and advocates for a shift to multi-label benchmarking for ImageNet. This shift would enable a more comprehensive assessment of the capabilities of deep neural network (DNN) models. We analyze the effectiveness of pre-trained state-of-the-art DNNs on ImageNet and one of its variants, ImageNetV2. Studies in the literature have reported unexpected accuracy drops of 11% to 14% on ImageNetV2. Our findings show that these reported declines are largely attributable to a characteristic of the dataset that has not received sufficient attention – the proportion of images with multiple labels. Taking this characteristic into account, the results of our experiments provide evidence that there is no substantial degradation in effectiveness on ImageNetV2. Furthermore, we acknowledge that ImageNet pre-trained models exhibit some capability at capturing the multi-label nature of the dataset even though they were trained under the single-label assumption. Consequently, we propose a new evaluation approach to augment existing approaches that assess this capability. Our findings highlight the importance of considering the multi-label nature of the ImageNet dataset during benchmarking. Failing to do so could lead to incorrect conclusions regarding the effectiveness of DNNs and divert research efforts from addressing other substantial challenges related to the reliability and robustness of these models.
zh

[CV-5] Extract Free Dense Misalignment from CLIP AAAI2025

【速读】：该论文试图解决视觉-语言基础模型在生成输出时经常出现的与输入不一致的问题，如图像描述中的对象幻觉和文本到图像生成模型中的提示不一致。其关键解决方案是提出了一种名为CLIP4DM的新方法，该方法从预训练的CLIP模型中检测密集的不一致性，特别是识别图像和文本之间的不一致词语。核心创新包括重新设计基于梯度的归因计算方法，使得单个文本标记的负梯度能够指示不一致性，并提出了F-CLIPScore，通过聚合不一致归因与全局对齐分数来评估对齐质量。该方法在零样本模型中表现出最先进的性能，并在效率上优于微调模型。

链接: https://arxiv.org/abs/2412.18404
作者: JeongYeon Nam,Jinbae Im,Wonjae Kim,Taeho Kil
机构: NAVER Corp.(NAVER公司); Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)
关键词: frequently produce outputs, Recent vision-language foundation, produce outputs misaligned, Recent vision-language, vision-language foundation models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 14 figures, AAAI 2025

点击查看摘要

Abstract:Recent vision-language foundation models still frequently produce outputs misaligned with their inputs, evidenced by object hallucination in captioning and prompt misalignment in the text-to-image generation model. Recent studies have explored methods for identifying misaligned elements, aiming not only to enhance interpretability but also to improve model performance. However, current approaches primarily rely on large foundation models in a zero-shot manner or fine-tuned models with human annotations, which limits scalability due to significant computational costs. This work proposes a novel approach, dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP, specifically focusing on pinpointing misaligned words between image and text. We carefully revamp the gradient-based attribution computation method, enabling negative gradient of individual text tokens to indicate misalignment. We also propose F-CLIPScore, which aggregates misaligned attributions with a global alignment score. We evaluate our method on various dense misalignment detection benchmarks, covering various image and text domains and misalignment types. Our method demonstrates state-of-the-art performance among zero-shot models and competitive performance with fine-tuned models while maintaining superior efficiency. Our qualitative examples show that our method has a unique strength to detect entity-level objects, intangible objects, and attributes that can not be easily detected for existing works. We conduct ablation studies and analyses to highlight the strengths and limitations of our approach. Our code is publicly available at this https URL.
zh

[CV-6] RDPM: Solve Diffusion Probabilistic Models via Recurrent Token Prediction

【速读】：该论文试图解决高保真图像合成中的连续信号处理问题，并提出了一种新的生成框架——循环扩散概率模型 (Recurrent Diffusion Probabilistic Model, RDPM)。解决方案的关键在于通过循环标记预测机制增强扩散过程，首次在离散扩散领域取得突破。RDPM通过逐步向图像的潜在表示引入高斯噪声，并将其编码为向量量化标记，在离散值域上实现独特的扩散过程。该过程迭代预测后续时间步的标记代码，将初始标准高斯噪声转换为源数据分布，与GPT风格模型的损失函数保持一致。RDPM不仅利用扩散过程确保高质量生成，还将连续信号转换为一系列高保真离散标记，从而与其他离散标记（如文本）保持统一的优化策略，为多模态生成模型的统一发展提供了新的思路。

链接: https://arxiv.org/abs/2412.18390
作者: Wu Xiaoping,Hu Jie,Wei Xiaoming
机构: Meituan(美团); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所)
关键词: Large Language Models, Large Language, Recurrent Diffusion Probabilistic, operating diffusion processes, Diffusion Probabilistic Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 8 pages

点击查看摘要

Abstract:Diffusion Probabilistic Models (DPMs) have emerged as the de facto approach for high-fidelity image synthesis, operating diffusion processes on continuous VAE latent, which significantly differ from the text generation methods employed by Large Language Models (LLMs). In this paper, we introduce a novel generative framework, the Recurrent Diffusion Probabilistic Model (RDPM), which enhances the diffusion process through a recurrent token prediction mechanism, thereby pioneering the field of Discrete Diffusion. By progressively introducing Gaussian noise into the latent representations of images and encoding them into vector-quantized tokens in a recurrent manner, RDPM facilitates a unique diffusion process on discrete-value domains. This process iteratively predicts the token codes for subsequent timesteps, transforming the initial standard Gaussian noise into the source data distribution, aligning with GPT-style models in terms of the loss function. RDPM demonstrates superior performance while benefiting from the speed advantage of requiring only a few inference steps. This model not only leverages the diffusion process to ensure high-quality generation but also converts continuous signals into a series of high-fidelity discrete tokens, thereby maintaining a unified optimization strategy with other discrete tokens, such as text. We anticipate that this work will contribute to the development of a unified model for multimodal generation, specifically by integrating continuous signal domains such as images, videos, and audio with text. We will release the code and model weights to the open-source community.
zh

[CV-7] Switch-a-View: Few-Shot View Selection Learned from Edited Videos

【速读】：该论文试图解决在制作教程视频时自动选择视角的问题，提出了名为Switch-a-View的模型。解决方案的关键在于如何从未标注但经过人工编辑的视频样本中训练模型。具体方法是通过伪标签任务为训练视频中的片段分配主要视角（如第一人称或第三人称），并发现视角切换时刻与视频中的视觉和语音内容之间的模式。基于这些模式，模型能够对未见过的多视角视频进行输入，并决定何时显示哪个视角。此外，论文还引入了少样本训练设置，使模型能够适应新的数据领域。

链接: https://arxiv.org/abs/2412.18386
作者: Sagnik Majumder,Tushar Nagarajan,Ziad Al-Halah,Kristen Grauman
机构: UT Austin(德克萨斯大学奥斯汀分校); University of Utah(犹他大学)
关键词: learns to automatically, automatically select, timepoint when creating, how-to video, video
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Switch-a-View, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled–but human-edited–video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between those view-switch moments on the one hand and the visual and spoken content in the how-to video on the other hand. Armed with this predictor, our model then takes an unseen multi-view video as input and orchestrates which viewpoint should be displayed when. We further introduce a few-shot training setting that permits steering the model towards a new data domain. We demonstrate our idea on a variety of real-world video from HowTo100M and Ego-Exo4D and rigorously validate its advantages.
zh

[CV-8] RSGaussian:3D Gaussian Splatting with LiDAR for Aerial Remote Sensing Novel View Synthesis

【速读】：该论文试图解决航空遥感场景中的新视角合成 (Novel View Synthesis, NVS) 问题，特别是针对LiDAR点云与2D图像之间的几何对齐和深度估计精度问题。解决方案的关键在于提出了RSGaussian方法，该方法将LiDAR点云作为约束引入3D高斯光栅化 (Gaussian Splatting) 方法中，确保高斯分布沿几何基准生长和分裂，从而解决过度生长和浮点问题。此外，通过引入带有畸变参数的坐标变换来实现LiDAR点云与2D图像的像素级对齐，促进异构数据融合，并结合深度和平面一致性损失来优化损失函数，引导高斯分布逼近真实的深度和平面表示，显著提升深度估计精度。

链接: https://arxiv.org/abs/2412.18380
作者: Yiling Yao,Wenjuan Zhang,Bing Zhang,Bocheng Li,Yaning Wang,Bowen Wang
机构: 未知
关键词: Gaussian Splatting method, study presents RSGaussian, floaters issues occurs, aerial remote sensing, LiDAR point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:This study presents RSGaussian, an innovative novel view synthesis (NVS) method for aerial remote sensing scenes that incorporate LiDAR point cloud as constraints into the 3D Gaussian Splatting method, which ensures that Gaussians grow and split along geometric benchmarks, addressing the overgrowth and floaters issues occurs. Additionally, the approach introduces coordinate transformations with distortion parameters for camera models to achieve pixel-level alignment between LiDAR point clouds and 2D images, facilitating heterogeneous data fusion and achieving the high-precision geo-alignment required in aerial remote sensing. Depth and plane consistency losses are incorporated into the loss function to guide Gaussians towards real depth and plane representations, significantly improving depth estimation accuracy. Experimental results indicate that our approach has achieved novel view synthesis that balances photo-realistic visual quality and high-precision geometric estimation under aerial remote sensing datasets. Finally, we have also established and open-sourced a dense LiDAR point cloud dataset along with its corresponding aerial multi-view images, AIR-LONGYAN.
zh

人工智能

[AI-0] MotifGPL: Motif-Enhanced Graph Prototype Learning for Deciphering Urban Social Segregation AAAI AAAI-25

链接: https://arxiv.org/abs/2412.18464
作者: Tengfei He,Xiao Zhou
关键词: urban, spanning racial, Social segregation, income dimensions, diverse and severe
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Accepted by the 39th Annual AAAI Conference on Artificial Intelligence (AAAI-25); 10 pages, 8 figures, 3 tables; Includes the appendix

点击查看摘要

Abstract:Social segregation in cities, spanning racial, residential, and income dimensions, is becoming more diverse and severe. As urban spaces and social relations grow more complex, residents in metropolitan areas experience varying levels of social segregation. If left unaddressed, this could lead to increased crime rates, heightened social tensions, and other serious issues. Effectively quantifying and analyzing the structures within urban spaces and resident interactions is crucial for addressing segregation. Previous studies have mainly focused on surface-level indicators of urban segregation, lacking comprehensive analyses of urban structure and mobility. This limitation fails to capture the full complexity of segregation. To address this gap, we propose a framework named Motif-Enhanced Graph Prototype Learning (MotifGPL),which consists of three key modules: prototype-based graph structure extraction, motif distribution discovery, and urban graph structure reconstruction. Specifically, we use graph structure prototype learning to extract key prototypes from both the urban spatial graph and the origin-destination graph, incorporating key urban attributes such as points of interest, street view images, and flow indices. To enhance interpretability, the motif distribution discovery module matches each prototype with similar motifs, representing simpler graph structures reflecting local patterns. Finally, we use the motif distribution results to guide the reconstruction of the two graphs. This model enables a detailed exploration of urban spatial structures and resident mobility patterns, helping identify and analyze motif patterns that influence urban segregation, guiding the reconstruction of urban graph structures. Experimental results demonstrate that MotifGPL effectively reveals the key motifs affecting urban social segregation and offer robust guidance for mitigating this issue.

[AI-1] GeFL: Model-Agnostic Federated Learning with Generative Models

链接: https://arxiv.org/abs/2412.18460
作者: Honggu Kang,Seohyeon Cha,Joonhyuk Kang
关键词: Federated learning, promising paradigm, paradigm in distributed, distributed learning, Model-Aided Federated Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages

点击查看摘要

Abstract:Federated learning (FL) is a promising paradigm in distributed learning while preserving the privacy of users. However, the increasing size of recent models makes it unaffordable for a few users to encompass the model. It leads the users to adopt heterogeneous models based on their diverse computing capabilities and network bandwidth. Correspondingly, FL with heterogeneous models should be addressed, given that FL typically involves training a single global model. In this paper, we propose Generative Model-Aided Federated Learning (GeFL), incorporating a generative model that aggregates global knowledge across users of heterogeneous models. Our experiments on various classification tasks demonstrate notable performance improvements of GeFL compared to baselines, as well as limitations in terms of privacy and scalability. To tackle these concerns, we introduce a novel framework, GeFL-F. It trains target networks aided by feature-generative models. We empirically demonstrate the consistent performance gains of GeFL-F, while demonstrating better privacy preservation and robustness to a large number of clients. Codes are available at [1].

[AI-2] Multi-Agent Norm Perception and Induction in Distributed Healthcare

链接: https://arxiv.org/abs/2412.18454
作者: Chao Li,Olga Petruchik,Elizaveta Grishanina,Sergey Kovalchuk
关键词: Induction Learning Model, Perception and Induction, Induction Learning, Learning Model aimed, distributed healthcare environments
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 15 pages,8 figures,152 conferences,3 tables

点击查看摘要

Abstract:This paper presents a Multi-Agent Norm Perception and Induction Learning Model aimed at facilitating the integration of autonomous agent systems into distributed healthcare environments through dynamic interaction processes. The nature of the medical norm system and its sharing channels necessitates distinct approaches for Multi-Agent Systems to learn two types of norms. Building on this foundation, the model enables agents to simultaneously learn descriptive norms, which capture collective tendencies, and prescriptive norms, which dictate ideal behaviors. Through parameterized mixed probability density models and practice-enhanced Markov games, the multi-agent system perceives descriptive norms in dynamic interactions and captures emergent prescriptive norms. We conducted experiments using a dataset from a neurological medical center spanning from 2016 to 2020.

[AI-3] SoK: On the Offensive Potential of AI

链接: https://arxiv.org/abs/2412.18442
作者: Saskia Laura Schröer,Giovanni Apruzzese,Soheil Human,Pavel Laskov,Hyrum S. Anderson,Edward W. N. Bernroider,Aurore Fass,Ben Nassi,Vera Rimmer,Fabio Roli,Samer Salam,Ashley Shen,Ali Sunyaev,Tim Wadwha-Brown,Isabel Wagner,Gang Wang
关键词: society increasingly benefits, increasingly benefits, Artificial Intelligence, society increasingly, offensive
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Our society increasingly benefits from Artificial Intelligence (AI). Unfortunately, more and more evidence shows that AI is also used for offensive purposes. Prior works have revealed various examples of use cases in which the deployment of AI can lead to violation of security and privacy objectives. No extant work, however, has been able to draw a holistic picture of the offensive potential of AI. In this SoK paper we seek to lay the ground for a systematic analysis of the heterogeneous capabilities of offensive AI. In particular we (i) account for AI risks to both humans and systems while (ii) consolidating and distilling knowledge from academic literature, expert opinions, industrial venues, as well as laymen – all of which being valuable sources of information on offensive AI. To enable alignment of such diverse sources of knowledge, we devise a common set of criteria reflecting essential technological factors related to offensive AI. With the help of such criteria, we systematically analyze: 95 research papers; 38 InfoSec briefings (from, e.g., BlackHat); the responses of a user study (N=549) entailing individuals with diverse backgrounds and expertise; and the opinion of 12 experts. Our contributions not only reveal concerning ways (some of which overlooked by prior work) in which AI can be offensively used today, but also represent a foothold to address this threat in the years to come. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2412.18442 [cs.CR] (or arXiv:2412.18442v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.18442 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent

链接: https://arxiv.org/abs/2412.18426
作者: Kangjia Zhao,Jiahui Song,Leigang Sha,Haozhan Shen,Zhi Chen,Tiancheng Zhao,Xiubo Liang,Jianwei Yin
关键词: automated GUI Testing, GUI Testing, GUI, Autonomous GUI Testing, hot topic
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Nowadays, research on GUI agents is a hot topic in the AI community. However, current research focuses on GUI task automation, limiting the scope of applications in various GUI scenarios. In this paper, we propose a formalized and comprehensive environment to evaluate the entire process of automated GUI Testing (GTArena), offering a fair, standardized environment for consistent operation of diverse multimodal large language models. We divide the testing process into three key subtasks: test intention generation, test task execution, and GUI defect detection, and construct a benchmark dataset based on these to conduct a comprehensive evaluation. It evaluates the performance of different models using three data types: real mobile applications, mobile applications with artificially injected defects, and synthetic data, thoroughly assessing their capabilities in this relevant task. Additionally, we propose a method that helps researchers explore the correlation between the performance of multimodal language large models in specific scenarios and their general capabilities in standard benchmark tests. Experimental results indicate that even the most advanced models struggle to perform well across all sub-tasks of automated GUI Testing, highlighting a significant gap between the current capabilities of Autonomous GUI Testing and its practical, real-world applicability. This gap provides guidance for the future direction of GUI Agent development. Our code is available at this https URL.

[AI-5] Research on the Proximity Relationships of Psychosomatic Disease Knowledge Graph Modules Extracted by Large Language Models

链接: https://arxiv.org/abs/2412.18419
作者: Zihan Zhou,Ziyi Zeng,Wenhao Jiang,Yihui Zhu,Jiaxin Mao,Yonggui Yuan,Min Xia,Shubin Zhao,Mengyu Yao,Yunqian Chen
关键词: global health issues, social changes accelerate, significantly increased, major challenge, challenge in global
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As social changes accelerate, the incidence of psychosomatic disorders has significantly increased, becoming a major challenge in global health issues. This necessitates an innovative knowledge system and analytical methods to aid in diagnosis and treatment. Here, we establish the ontology model and entity types, using the BERT model and LoRA-tuned LLM for named entity recognition, constructing the knowledge graph with 9668 triples. Next, by analyzing the network distances between disease, symptom, and drug modules, it was found that closer network distances among diseases can predict greater similarities in their clinical manifestations, treatment approaches, and psychological mechanisms, and closer distances between symptoms indicate that they are more likely to co-occur. Lastly, by comparing the proximity d and proximity z score, it was shown that symptom-disease pairs in primary diagnostic relationships have a stronger association and are of higher referential value than those in diagnostic relationships. The research results revealed the potential connections between diseases, co-occurring symptoms, and similarities in treatment strategies, providing new perspectives for the diagnosis and treatment of psychosomatic disorders and valuable information for future mental health research and practice.

[AI-6] Exploring Flexible Scenario Generation in Godot Simulator

链接: https://arxiv.org/abs/2412.18408
作者: Daniel Peraltai,Xin Qin
关键词: Cyber-physical systems, physical components engineered, combine cyber, cyber and physical, physical components
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cyber-physical systems (CPS) combine cyber and physical components engineered to make decisions and interact within dynamic environments. Ensuring the safety of CPS is of great importance, requiring extensive testing across diverse and complex scenarios. To generate as many testing scenarios as possible, previous efforts have focused on describing scenarios using formal languages to generate scenes. In this paper, we introduce an alternative approach: reconstructing scenes inside the open-source game engine, Godot. We have developed a pipeline that enables the reconstruction of testing scenes directly from provided images of scenarios. These reconstructed scenes can then be deployed within simulated environments to assess a CPS. This approach offers a scalable and flexible solution for testing CPS in realistic environments.

[AI-7] PAoI: Ensuring Fresh Service Status at the Network Edge in Compute-First Networking

链接: https://arxiv.org/abs/2412.18391
作者: Haosheng He,Jianpeng Qi,Chao Liu,Junyu Dong,Yanwei Yu
关键词: accurate status information, compute-first networking, maintaining fresh, fresh and accurate, crucial for effective
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In compute-first networking, maintaining fresh and accurate status information at the network edge is crucial for effective access to remote services. This process typically involves three phases: Status updating, user accessing, and user requesting. However, current studies on status effectiveness, such as Age of Information at Query (QAoI), do not comprehensively cover all these phases. Therefore, this paper introduces a novel metric, TPAoI, aimed at optimizing update decisions by measuring the freshness of service status. The stochastic nature of edge environments, characterized by unpredictable communication delays in updating, requesting, and user access times, poses a significant challenge when modeling. To address this, we model the problem as a Markov Decision Process (MDP) and employ a Dueling Double Deep Q-Network (D3QN) algorithm for optimization. Extensive experiments demonstrate that the proposed TPAoI metric effectively minimizes AoI, ensuring timely and reliable service updates in dynamic edge environments. Results indicate that TPAoI reduces AoI by an average of 47% compared to QAoI metrics and decreases update frequency by an average of 48% relative to conventional AoI metrics, showing significant improvement.

[AI-8] Weak Scaling Capability in Token Space: An Observation from Large Vision Language Model

链接: https://arxiv.org/abs/2412.18387
作者: Tenghui Li,Guoxu Zhou,Xuyang Zhao,Qibin Zhao
关键词: training data, widely validated, size of training, scaling capability, vision tokens
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The scaling capability has been widely validated with respect to the number of parameters and the size of training data. One important question that is unexplored is that does scaling capability also exists similarly with respect to the number of vision tokens? This study fills the gap by investigating the relationship between the number of vision tokens and the performance of vision-language models. Our theoretical analysis and empirical evaluations reveal that the model exhibits weak scaling capabilities on the length (N_l), with performance approximately (S(N_l) \approx (c/N_l)^\alpha), where (c, \alpha) are hyperparameters. Interestingly, this scaling behavior remains largely unaffected by the inclusion or exclusion of the user’s question in the input. Furthermore, fusing the user’s question with the vision token can enhance model performance when the question is relevant to the task. To address the computational challenges associated with large-scale vision tokens, we propose a novel architecture that efficiently reduces the token count while integrating user question tokens into the representation. Our findings may offer insights for developing more efficient and effective vision-language models under specific task constraints.

[AI-9] A Many Objective Problem Where Crossover is Provably Indispensable AAAI2025

链接: https://arxiv.org/abs/2412.18375
作者: Andre Opris
关键词: paper addresses theory, evolutionary multiobjective optimisation, paper addresses, addresses theory, theory in evolutionary
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
*备注: To appear in the proceedings of AAAI 2025

点击查看摘要

Abstract:This paper addresses theory in evolutionary multiobjective optimisation (EMO) and focuses on the role of crossover operators in many-objective optimisation. The advantages of using crossover are hardly understood and rigorous runtime analyses with crossover are lagging far behind its use in practice, specifically in the case of more than two objectives. We present a many-objective problem class together with a theoretical runtime analysis of the widely used NSGA-III to demonstrate that crossover can yield an exponential speedup on the runtime. In particular, this algorithm can find the Pareto set in expected polynomial time when using crossover while without crossover it requires exponential time to even find a single Pareto-optimal point. To our knowledge, this is the first rigorous runtime analysis in many-objective optimisation demonstrating an exponential performance gap when using crossover for more than two objectives.

[AI-10] Unveiling the Threat of Fraud Gangs to Graph Neural Networks: Multi-Target Graph Injection Attacks against GNN-Based Fraud Detectors AAAI AAAI2025

链接: https://arxiv.org/abs/2412.18370
作者: Jinhyeok Choi,Heehyeon Kim,Joyce Jiyoung Whang
关键词: identifying fraudulent users, uncovering malicious behaviors, Graph neural networks, graph injection attack, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 19 pages, 5 figures, 12 tables, The 39th AAAI Conference on Artificial Intelligence (AAAI 2025)

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as an effective tool for fraud detection, identifying fraudulent users, and uncovering malicious behaviors. However, attacks against GNN-based fraud detectors and their risks have rarely been studied, thereby leaving potential threats unaddressed. Recent findings suggest that frauds are increasingly organized as gangs or groups. In this work, we design attack scenarios where fraud gangs aim to make their fraud nodes misclassified as benign by camouflaging their illicit activities in collusion. Based on these scenarios, we study adversarial attacks against GNN-based fraud detectors by simulating attacks of fraud gangs in three real-world fraud cases: spam reviews, fake news, and medical insurance frauds. We define these attacks as multi-target graph injection attacks and propose MonTi, a transformer-based Multi-target one-Time graph injection attack model. MonTi simultaneously generates attributes and edges of all attack nodes with a transformer encoder, capturing interdependencies between attributes and edges more effectively than most existing graph injection attack methods that generate these elements sequentially. Additionally, MonTi adaptively allocates the degree budget for each attack node to explore diverse injection structures involving target, candidate, and attack nodes, unlike existing methods that fix the degree budget across all attack nodes. Experiments show that MonTi outperforms the state-of-the-art graph injection attack methods on five real-world graphs.

机器学习

[LG-0] MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning

链接: https://arxiv.org/abs/2412.18437
作者: Abdelmadjid Chergui,Grigor Bezirganyan,Sana Sellami,Laure Berti-Équille,Sébastien Fournier
关键词: diverse data types, Choosing a suitable, suitable deep learning, data types, structures and characteristics
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Choosing a suitable deep learning architecture for multimodal data fusion is a challenging task, as it requires the effective integration and processing of diverse data types, each with distinct structures and characteristics. In this paper, we introduce MixMAS, a novel framework for sampling-based mixer architecture search tailored to multimodal learning. Our approach automatically selects the optimal MLP-based architecture for a given multimodal machine learning (MML) task. Specifically, MixMAS utilizes a sampling-based micro-benchmarking strategy to explore various combinations of modality-specific encoders, fusion functions, and fusion networks, systematically identifying the architecture that best meets the task’s performance metrics.

信息检索

[IR-0] Contrastive Representation for Interactive Recommendation AAAI-2025

链接: https://arxiv.org/abs/2412.18396
作者: Jingyu Li,Zhiyong Feng,Dongxiao He,Hongqi Chen,Qinghang Gao,Guoli Wu
关键词: long term objectives, gained significant attention, significant attention recently, capture dynamic interest, quickly capture dynamic
类目: Information Retrieval (cs.IR)
*备注: AAAI-2025 Accepted paper

点击查看摘要

Abstract:Interactive Recommendation (IR) has gained significant attention recently for its capability to quickly capture dynamic interest and optimize both short and long term objectives. IR agents are typically implemented through Deep Reinforcement Learning (DRL), because DRL is inherently compatible with the dynamic nature of IR. However, DRL is currently not perfect for IR. Due to the large action space and sample inefficiency problem, training DRL recommender agents is challenging. The key point is that useful features cannot be extracted as high-quality representations for the recommender agent to optimize its policy. To tackle this problem, we propose Contrastive Representation for Interactive Recommendation (CRIR). CRIR efficiently extracts latent, high-level preference ranking features from explicit interaction, and leverages the features to enhance users’ representation. Specifically, the CRIR provides representation through one representation network, and refines it through our proposed Preference Ranking Contrastive Learning (PRCL). The key insight of PRCL is that it can perform contrastive learning without relying on computations involving high-level representations or large potential action sets. Furthermore, we also propose a data exploiting mechanism and an agent training mechanism to better adapt CRIR to the DRL backbone. Extensive experiments have been carried out to show our method’s superior improvement on the sample efficiency while training an DRL-based IR agent.

[IR-1] RaSeRec: Retrieval-Augmented Sequential Recommendation

链接: https://arxiv.org/abs/2412.18378
作者: Xinping Zhao,Baotian Hu,Yan Zhong,Shouzheng Huang,Zihao Zheng,Meng Wang,Haofen Wang,Min zhang
关键词: dominate parametric learning, neural network architectures, recall long tails, achieved improved performance, powerful neural network
类目: Information Retrieval (cs.IR)
*备注: 20 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Although prevailing supervised and self-supervised learning (SSL)-augmented sequential recommendation (SeRec) models have achieved improved performance with powerful neural network architectures, we argue that they still suffer from two limitations: (1) Preference Drift, where models trained on past data can hardly accommodate evolving user preference; and (2) Implicit Memory, where head patterns dominate parametric learning, making it harder to recall long tails. In this work, we explore retrieval augmentation in SeRec, to address these limitations. To this end, we propose a Retrieval-Augmented Sequential Recommendation framework, named RaSeRec, the main idea of which is to maintain a dynamic memory bank to accommodate preference drifts and retrieve relevant memories to augment user modeling explicitly. It consists of two stages: (i) collaborative-based pre-training, which learns to recommend and retrieve; (ii) retrieval-augmented fine-tuning, which learns to leverage retrieved memories. Extensive experiments on three datasets fully demonstrate the superiority and effectiveness of RaSeRec.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-25

目录

概览 (2024-12-25)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载