本篇博文主要内容为 2025-01-10 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-01-10)
今日共更新345篇论文,其中:
- 自然语言处理共38篇(Computation and Language (cs.CL))
- 人工智能共85篇(Artificial Intelligence (cs.AI))
- 计算机视觉共87篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共102篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, LLMs)在处理结构化图像理解任务(如表格和图表的解释)时缺乏多跳选择性注意力能力的问题。当前的模型在需要跨图像中的不同结构和文本进行推理时表现不佳。为了解决这一问题,论文提出了ReFocus框架,该框架通过生成Python代码对输入图像进行视觉编辑,从而生成“视觉思维”(visual thoughts),逐步调整和优化视觉焦点。具体来说,ReFocus允许模型通过代码调用工具,依次绘制框、高亮部分区域和掩码区域,从而增强视觉推理过程。实验表明,ReFocus在表格和图表任务上显著提升了性能,平均增益分别为11.0%和6.8%。此外,ReFocus通过生成中间信息的视觉链式思维(visual chain-of-thought)提供了比标准视觉问答(VQA)数据更好的监督信号,进一步提升了模型性能。
链接: https://arxiv.org/abs/2501.05452
作者: Xingyu Fu,Minqian Liu,Zhengyuan Yang,John Corring,Yijuan Lu,Jianwei Yang,Dan Roth,Dinei Florencio,Cha Zhang
机构: University of Pennsylvania(宾夕法尼亚大学); Virginia Tech(弗吉尼亚理工大学); Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project link: this https URL
Abstract:Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate “visual thoughts” by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.
zh
[NLP-1] A survey of textual cyber abuse detection using cutting-edge language models and large language models
【速读】: 该论文旨在探讨社交媒体平台上各种形式的网络滥用(online abuse)问题,包括仇恨言论(hate speech)、网络欺凌(cyberbullying)、情感虐待(emotional abuse)、诱导(grooming)和色情信息(sexting)等。论文的核心关注点在于分析新兴技术,特别是语言模型(Language Models, LMs)和大语言模型(Large Language Models, LLMs),如何重塑这些滥用内容的检测与生成机制。论文深入探讨了社交媒体滥用的传播机制及其对心理和社会的影响,并强调了先进语言模型在增强自动化检测系统方面的潜力,同时也指出了其生成有害内容的风险。解决方案的关键在于利用这些技术提升滥用行为的检测能力,同时警惕其可能带来的负面影响,从而为在线安全和伦理的持续讨论提供新的见解。
链接: https://arxiv.org/abs/2501.05443
作者: Jose A. Diaz-Garcia,Joao Paulo Carvalho
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 37 pages, under review in WIREs Data Mining and Knowledge Discovery
Abstract:The success of social media platforms has facilitated the emergence of various forms of online abuse within digital communities. This abuse manifests in multiple ways, including hate speech, cyberbullying, emotional abuse, grooming, and sexting. In this paper, we present a comprehensive analysis of the different forms of abuse prevalent in social media, with a particular focus on how emerging technologies, such as Language Models (LMs) and Large Language Models (LLMs), are reshaping both the detection and generation of abusive content within these networks. We delve into the mechanisms through which social media abuse is perpetuated, exploring the psychological and social impact. Additionally, we examine the dual role of advanced language models-highlighting their potential to enhance automated detection systems for abusive behavior while also acknowledging their capacity to generate harmful content. This paper aims to contribute to the ongoing discourse on online safety and ethics, offering insights into the evolving landscape of cyberabuse and the technological innovations that both mitigate and exacerbate it.
zh
[NLP-2] LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
【速读】: 该论文试图解决现有长上下文语言模型(LCLMs)评估基准过于侧重于长上下文回忆(long-context recall)的问题,即模型在处理大量无关信息时仅需基于少量关键片段生成简短回答。为此,作者提出了一个新的基准测试LongProc(Long Procedural Generation),该基准不仅要求模型整合高度分散的信息,还需要进行长文本生成。LongProc包含六种多样化的过程生成任务,例如从HTML页面提取结构化信息并转换为TSV格式,以及执行复杂的搜索程序以创建旅行计划等。这些任务通过测试模型遵循详细过程指令、综合和推理分散信息以及生成结构化长文本输出(最多8K tokens)的能力,挑战了LCLMs的性能。此外,由于这些任务遵循确定性过程并生成结构化输出,因此支持基于规则的可靠评估。通过对17个LCLMs在三个难度级别(输出token数分别为500、2K和8K)上的评估,研究发现尽管所有测试模型声称其上下文窗口大小超过32K tokens,但开源模型通常在2K-token任务上表现不佳,而闭源模型如GPT-4o在8K-token任务上表现出显著性能下降。进一步分析表明,LCLMs在长文本生成中难以保持长距离连贯性。这些发现揭示了当前LCLMs的关键局限性,并指出了显著的改进空间。
链接: https://arxiv.org/abs/2501.05414
作者: Xi Ye,Fangcong Yin,Yinghui He,Joie Zhang,Howard Yen,Tianyu Gao,Greg Durrett,Danqi Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured outputs, they enable reliable rule-based evaluation. We evaluate 17 LCLMs on LongProc across three difficulty levels, with maximum numbers of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Further analysis reveals that LCLMs struggle to maintain long-range coherence in long-form generations. These findings highlight critical limitations in current LCLMs and suggest substantial room for improvement. Data and code available at: this https URL
zh
[NLP-3] FairCode: Evaluating Social Bias of LLM s in Code Generation
【速读】: 该论文试图解决大语言模型(LLMs)在代码生成中的偏见问题。尽管LLMs在代码生成方面表现出显著能力,但现有研究对其输出质量和安全性的评估仍显不足,尤其是在偏见评估方面。现有方法通常通过恶意提示或重新应用判别模型的任务和数据集来评估偏见,但这些方法并不完全适用于代码生成任务。因此,论文提出了一个专门用于评估代码生成模型偏见的基准——FairCode。FairCode包含两个任务:函数实现和测试用例生成,通过多样化的场景评估社会偏见。此外,论文还提出了一种新的评估指标FairScore,用于衡量模型在该基准上的表现。实验结果表明,所有测试的LLMs均表现出一定程度的偏见。
链接: https://arxiv.org/abs/2501.05396
作者: Yongkang Du,Jen-tse Huang,Jieyu Zhao,Lu Lin
机构: Pennsylvania State University(宾夕法尼亚州立大学); University of Southern California(南加州大学)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:Large language models (LLMs) have demonstrated significant capability in code generation, drawing increasing attention to the evaluation of the quality and safety of their outputs. However, research on bias in code generation remains limited. Existing studies typically assess bias by applying malicious prompts or reapply tasks and dataset for discriminative models. Given that LLMs are often aligned with human values and that prior datasets are not fully optimized for code-related tasks, there is a pressing need for benchmarks specifically designed for evaluating code models. In this study, we introduce FairCode, a novel benchmark for evaluating bias in code generation. FairCode comprises two tasks: function implementation and test case generation, each evaluating social bias through diverse scenarios. Additionally, we propose a new metric, FairScore, to assess model performance on this benchmark. We conduct experiments on widely used LLMs and provide a comprehensive analysis of the results. The findings reveal that all tested LLMs exhibit bias. The code is available at this https URL.
zh
[NLP-4] Search-o1 : Agent ic Search-Enhanced Large Reasoning Models
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在复杂推理任务中因知识不足而导致的频繁不确定性和潜在错误问题。为解决这一局限性,论文提出了Search-o1框架,其关键解决方案包括两个方面:首先,通过引入代理检索增强生成(Retrieval-Augmented Generation, RAG)机制,在推理过程中动态检索外部知识,以补充LRMs在遇到不确定知识点时的不足;其次,设计了一个独立的Reason-in-Documents模块,用于在将检索到的文档信息注入推理链之前进行深度分析,从而减少噪声并保持推理流程的连贯性。通过这种方式,Search-o1显著提升了LRMs在复杂推理任务中的可信度和适用性。
链接: https://arxiv.org/abs/2501.05366
作者: Xiaoxi Li,Guanting Dong,Jiajie Jin,Yuyao Zhang,Yujia Zhou,Yutao Zhu,Peitian Zhang,Zhicheng Dou
机构: Renmin University of China(中国人民大学); Tsinghua University(清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive long stepwise reasoning capabilities through large-scale reinforcement learning. However, their extended reasoning processes often suffer from knowledge insufficiency, leading to frequent uncertainties and potential errors. To address this limitation, we introduce \textbfSearch-o1, a framework that enhances LRMs with an agentic retrieval-augmented generation (RAG) mechanism and a Reason-in-Documents module for refining retrieved documents. Search-o1 integrates an agentic search workflow into the reasoning process, enabling dynamic retrieval of external knowledge when LRMs encounter uncertain knowledge points. Additionally, due to the verbose nature of retrieved documents, we design a separate Reason-in-Documents module to deeply analyze the retrieved information before injecting it into the reasoning chain, minimizing noise and preserving coherent reasoning flow. Extensive experiments on complex reasoning tasks in science, mathematics, and coding, as well as six open-domain QA benchmarks, demonstrate the strong performance of Search-o1. This approach enhances the trustworthiness and applicability of LRMs in complex reasoning tasks, paving the way for more reliable and versatile intelligent systems. The code is available at \urlthis https URL.
zh
[NLP-5] Stream Aligner: Efficient Sentence-Level Alignment via Distribution Induction AAAI
【速读】: 该论文旨在解决大语言模型(LLMs)在生成过程中与人类价值观和意图对齐的问题。尽管现有的对齐策略(如自适应训练和推理时方法)已显示出一定的潜力,但这些方法在任务复杂性和能力平衡方面仍存在挑战。为此,论文提出了一种新的对齐范式——流式分布诱导对齐器(Stream Aligner),其关键创新在于通过一个小模型动态学习后缀句子的偏好,迭代修正上游模型生成的后缀句子,并在后续生成中使用修正后的句子。这种方法不仅减少了对额外模型能力的依赖,还提升了LLMs的推理能力,并降低了用户交互时的延迟。实验结果表明,Stream Aligner在多个任务中显著提升了模型的有用性、无害性和数学能力。
链接: https://arxiv.org/abs/2501.05336
作者: Hantao Lou,Jiaming Ji,Kaile Wang,Yaodong Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AAAI Alignment Track 2025 Poster
Abstract:The rapid advancement of large language models (LLMs) has led to significant improvements in their capabilities, but also to increased concerns about their alignment with human values and intentions. Current alignment strategies, including adaptive training and inference-time methods, have demonstrated potential in this area. However, these approaches still struggle to balance deployment complexity and capability across various tasks and difficulties. In this work, we introduce the Streaming Distribution Induce Aligner (Stream Aligner), a novel alignment paradigm that combines efficiency with enhanced performance in various tasks throughout the generation process. Stream Aligner achieves dynamic sentence-level correction by using a small model to learn the preferences of the suffix sentence, iteratively correcting the suffix sentence output by the upstream model, and then using the corrected sentence to replace the suffix sentence in subsequent generations. Compared to Aligner, our experiments demonstrate that Stream Aligner reduces reliance on the capabilities of additional models, enhances the reasoning abilities of LLMs, and decreases latency during user interaction. Specifically, Stream Aligner-2B model has achieved an improvement of 76.1% in helpfulness, 36.0% in harmlessness on the tested Llama2-70B-chat model, and Stream Aligner-8B has achieved an improvement of 3.5% on the math ability of the tested Llama3-70B-Instruct model.
zh
[NLP-6] Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing ACL COLING2025
【速读】: 该论文试图解决在低资源语言(如印度的马拉地语)中设计鲁棒的抄袭检测系统的问题。随着区域语言数据的增加,现有的抄袭检测系统在这些语言中的表现往往不足,尤其是在语义和句法特征的捕捉方面。论文提出的解决方案关键在于结合BERT(Bidirectional Encoder Representations from Transformers)句子嵌入和TF-IDF(Term Frequency-Inverse Document Frequency)特征表示,通过加权投票集成机器学习模型,有效捕捉文本的统计、语义和句法特征,从而提高马拉地语文本抄袭检测的准确性。
链接: https://arxiv.org/abs/2501.05260
作者: Atharva Mutsaddi,Aditya Choudhary
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted into LoResLM: The First Workshop on Language Models for Low-Resource Languages, colocated with COLING 2025 and set to be published into ACL Anthology
Abstract:Plagiarism involves using another person’s work or concepts without proper attribution, presenting them as original creations. With the growing amount of data communicated in regional languages such as Marathi – one of India’s regional languages – it is crucial to design robust plagiarism detection systems tailored for low-resource languages. Language models like Bidirectional Encoder Representations from Transformers (BERT) have demonstrated exceptional capability in text representation and feature extraction, making them essential tools for semantic analysis and plagiarism detection. However, the application of BERT for low-resource languages remains under-explored, particularly in the context of plagiarism detection. This paper presents a method to enhance the accuracy of plagiarism detection for Marathi texts using BERT sentence embeddings in conjunction with Term Frequency-Inverse Document Frequency (TF-IDF) feature representation. This approach effectively captures statistical, semantic, and syntactic aspects of text features through a weighted voting ensemble of machine learning models.
zh
[NLP-7] CallNavi: A Study and Challenge on Function Calling Routing and Invocation in Large Language Models
【速读】: 该论文旨在解决在聊天机器人系统中生成API调用时面临的挑战,特别是在复杂、多步骤任务中需要准确选择和执行API的问题。解决方案的关键包括三个方面:首先,引入了一个新的数据集,用于评估模型在API功能选择、参数生成和嵌套API调用方面的表现;其次,通过对不同复杂度的任务进行基准测试,评估了最先进的语言模型在API功能生成和参数准确性方面的性能;最后,提出了一种增强的API路由方法,该方法结合了通用大型语言模型用于API选择,以及微调模型用于参数生成,并结合了一些提示工程(prompt engineering)技术。这些方法显著提升了处理复杂API任务的能力,为实际应用中的API驱动聊天机器人系统提供了实用的改进。
链接: https://arxiv.org/abs/2501.05255
作者: Yewei Song,Cedric Lothritz,Xunzhu Tang,Saad Ezzini,Jacques Klein,Tegawendé F. Bissyandé,Andrey Boytsov,Ulrick Ble,Anne Goujon
机构: University of Luxembourg(卢森堡大学); Luxembourg Institute of Science and Technology(卢森堡科学技术研究所); Lancaster University(兰卡斯特大学); BGL BNP PARIBAS(法国巴黎银行卢森堡分行)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Interacting with a software system via a chatbot can be challenging, especially when the chatbot needs to generate API calls, in the right order and with the right parameters, to communicate with the system. API calling in chatbot systems poses significant challenges, particularly in complex, multi-step tasks requiring accurate API selection and execution. We contribute to this domain in three ways: first, by introducing a novel dataset designed to assess models on API function selection, parameter generation, and nested API calls; second, by benchmarking state-of-the-art language models across varying levels of complexity to evaluate their performance in API function generation and parameter accuracy; and third, by proposing an enhanced API routing method that combines general-purpose large language models for API selection with fine-tuned models for parameter generation and some prompt engineering approach. These approaches lead to substantial improvements in handling complex API tasks, offering practical advancements for real-world API-driven chatbot systems.
zh
[NLP-8] Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLM s
【速读】: 该论文旨在解决为爱沙尼亚语电视内容生成高质量同语言字幕的问题。解决方案的关键在于对Whisper模型进行微调,使用人工生成的爱沙尼亚语字幕进行训练,并结合迭代伪标签(iterative pseudo-labeling)和基于大语言模型(LLM)的后编辑(post-editing)技术进行增强。实验表明,通过使用未标注数据集进行伪标签处理,字幕质量显著提升。此外,测试时应用基于LLM的编辑进一步提高了字幕的准确性,而在训练过程中使用LLM编辑并未带来额外收益。该方法有望生成接近人类标准的高质量字幕,并可扩展至实时应用场景。
链接: https://arxiv.org/abs/2501.05234
作者: Artem Fedorchenko,Tanel Alumäe
机构: Tallinn University of Technology (塔林理工大学); Tallinn University of Technology (塔林理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:This paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We fine-tune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.
zh
[NLP-9] Leverag ing Large Language Models for Zero-shot Lay Summarisation in Biomedicine and Beyond
【速读】: 该论文探讨了如何将大语言模型(Large Language Models, LLMs)应用于零样本(zero-shot)的通俗摘要生成(Lay Summarisation)任务。论文提出了一种基于现实生活过程的两阶段框架,用于生成通俗摘要,并发现随着模型规模的增大,人类评审员对使用该方法生成的摘要的偏好度逐渐增加。此外,论文还评估了LLMs作为评审员的能力,发现它们能够复制人类评审员的偏好。最后,论文初步探索了将LLMs应用于自然语言处理(Natural Language Processing, NLP)文章的通俗摘要生成,发现LLMs能够泛化到这一新领域,并通过深入的人类评估进一步证明了所提出方法生成的摘要具有更高的实用性。解决方案的关键在于提出的两阶段框架,该框架能够有效提升LLMs在零样本设置下的通俗摘要生成能力。
链接: https://arxiv.org/abs/2501.05224
作者: Tomas Goldsack,Carolina Scarton,Chenghua Lin
机构: University of Sheffield(谢菲尔德大学); University of Manchester(曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:In this work, we explore the application of Large Language Models to zero-shot Lay Summarisation. We propose a novel two-stage framework for Lay Summarisation based on real-life processes, and find that summaries generated with this method are increasingly preferred by human judges for larger models. To help establish best practices for employing LLMs in zero-shot settings, we also assess the ability of LLMs as judges, finding that they are able to replicate the preferences of human judges. Finally, we take the initial steps towards Lay Summarisation for Natural Language Processing (NLP) articles, finding that LLMs are able to generalise to this new domain, and further highlighting the greater utility of summaries generated by our proposed approach via an in-depth human evaluation.
zh
[NLP-10] ParaRev: Building a dataset for Scientific Parag raph Revision annotated with revision instruction COLING2025
【速读】: 该论文试图解决科学写作中自动化修订(automated revision)的局限性问题,特别是现有方法主要关注句子级别的修订(sentence-level revision),而忽略了更广泛的上下文信息,导致修订效果有限。论文提出将修订任务从句子级别扩展到段落级别(paragraph-level revision),以捕捉更丰富的上下文信息,从而实现更有意义的修改。解决方案的关键在于引入了一个名为ParaRev的数据集,该数据集包含经过修订的科学段落,并提供了详细的修订指令(detailed revision instructions)。实验表明,与传统的通用修订方法相比,使用详细指令显著提高了自动化修订的质量,无论使用何种模型或评估指标。
链接: https://arxiv.org/abs/2501.05222
作者: Léane Jourdan,Nicolas Hernandez,Richard Dufour,Florian Boudin,Akiko Aizawa
机构: Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France (南特大学, 中央南特大学, 法国国家科学研究中心, LS2N实验室, UMR 6004, 法国南特); JFLI, CNRS, Nantes University, France (JFLI, 法国国家科学研究中心, 南特大学, 法国); National Institute of Informatics, Japan (日本国立情报学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted at the WRAICogs 1 workoshop (co-located with Coling 2025)
Abstract:Revision is a crucial step in scientific writing, where authors refine their work to improve clarity, structure, and academic quality. Existing approaches to automated writing assistance often focus on sentence-level revisions, which fail to capture the broader context needed for effective modification. In this paper, we explore the impact of shifting from sentence-level to paragraph-level scope for the task of scientific text revision. The paragraph level definition of the task allows for more meaningful changes, and is guided by detailed revision instructions rather than general ones. To support this task, we introduce ParaRev, the first dataset of revised scientific paragraphs with an evaluation subset manually annotated with revision instructions. Our experiments demonstrate that using detailed instructions significantly improves the quality of automated revisions compared to general approaches, no matter the model or the metric considered.
zh
[NLP-11] A Novel Approach to Scalable and Automatic Topic-Controlled Question Generation in Education
【速读】: 该论文旨在解决教育领域中自动生成问题(Automatic Question Generation, QG)的挑战,特别是如何生成与特定主题相关且语义对齐的问题,以减少教师在创建教育内容时的工作负担。论文提出的解决方案是主题控制问题生成(Topic-Controlled Question Generation, T-CQG)方法,该方法通过在预训练的T5-small模型上进行微调,并结合专门为教育需求定制的数据集,来增强生成问题的主题相关性和有效性。关键点包括:1)通过微调和数据增强策略提升模型性能;2)解决段落级上下文中的语义对齐问题,以提高生成问题的主题特异性;3)引入新的评估方法来衡量生成问题的主题相关性。实验结果表明,该方法能够生成高质量、主题聚焦的问题,具有减少教师工作负担和支持个性化辅导系统的潜力,且由于其参数规模较小,具备较高的可扩展性和低成本优势。
链接: https://arxiv.org/abs/2501.05220
作者: Ziqing Li,Mutlu Cukurova,Sahan Bulathwela
机构: Department of Computer Science, University College London (伦敦大学学院计算机科学系); UCL Knowledge Lab, University College London (伦敦大学学院知识实验室); Centre for Artificial Intelligence, University College London (伦敦大学学院人工智能中心)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: To be published at ACM Conf. on Learning Analytics and Knowledge (LAK’25)
Abstract:The development of Automatic Question Generation (QG) models has the potential to significantly improve educational practices by reducing the teacher workload associated with creating educational content. This paper introduces a novel approach to educational question generation that controls the topical focus of questions. The proposed Topic-Controlled Question Generation (T-CQG) method enhances the relevance and effectiveness of the generated content for educational purposes. Our approach uses fine-tuning on a pre-trained T5-small model, employing specially created datasets tailored to educational needs. The research further explores the impacts of pre-training strategies, quantisation, and data augmentation on the model’s performance. We specifically address the challenge of generating semantically aligned questions with paragraph-level contexts, thereby improving the topic specificity of the generated questions. In addition, we introduce and explore novel evaluation methods to assess the topical relatedness of the generated questions. Our results, validated through rigorous offline and human-backed evaluations, demonstrate that the proposed models effectively generate high-quality, topic-focused questions. These models have the potential to reduce teacher workload and support personalised tutoring systems by serving as bespoke question generators. With its relatively small number of parameters, the proposals not only advance the capabilities of question generation models for handling specific educational topics but also offer a scalable solution that reduces infrastructure costs. This scalability makes them feasible for widespread use in education without reliance on proprietary large language models like ChatGPT.
zh
[NLP-12] GLaM-Sign: Greek Language Multimodal Lip Reading with Integrated Sign Language Accessibility
【速读】: 该论文旨在解决聋哑及听力障碍者(Deaf and Hard-of-Hearing, DHH)在希腊旅游、教育、医疗和公共服务等领域中的沟通障碍问题。解决方案的关键在于开发了一个名为“希腊语言多模态唇读与集成手语无障碍系统”(Greek Language Multimodal Lip Reading with Integrated Sign Language Accessibility, GLaM-Sign)的资源。该系统整合了高分辨率音频、视频、文本转录和希腊手语翻译,支持实时手语翻译和增强的字幕同步功能。通过多模态数据集的构建,GLaM-Sign不仅提升了沟通的包容性,还为未来在更多语言中的扩展和精确度提升奠定了基础,推动了人工智能在无障碍技术领域的创新和伦理标准的建立。
链接: https://arxiv.org/abs/2501.05213
作者: Dimitris Kouremenos,Klimis Ntalianis
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
Abstract:The Greek Language Multimodal Lip Reading with Integrated Sign Language Accessibility (GLaM-Sign) [1] is a groundbreaking resource in accessibility and multimodal AI, designed to support Deaf and Hard-of-Hearing (DHH) individuals. Developed from the FEELIT project [2], it integrates high-resolution audio, video, textual transcriptions, and Greek Sign Language translations for applications like real-time sign language translation and enhanced subtitle synchronization. While its primary focus is on promoting inclusivity in the Greek tourism sector, its adaptability extends to education, healthcare, and public services. Future advancements will enhance word-level precision and scalability to additional languages, supported by advanced AI methodologies and collaborations with diverse stakeholders. This dataset underscores the transformative potential of multimodal resources in bridging communication gaps, fostering innovation, and setting a benchmark for ethical AI and inclusive technologies.
zh
[NLP-13] Bringing Order Amidst Chaos: On the Role of Artificial Intelligence in Secure Software Engineering
【速读】: 该论文试图解决在安全软件工程(Secure Software Engineering, SSE)领域中,由于技术环境的不断演变所带来的复杂性和混乱问题,特别是如何通过人工智能(AI)技术提高漏洞和缺陷预测的准确性。论文的核心目标是解决影响AI准确性的领域特定差异,从而在SSE中引入秩序。
解决方案的关键在于采用多种实证策略,包括评估努力感知指标(effort-aware metrics)、分析静态应用安全测试工具(Static Application Security Testing Tools, SASTTs)、进行方法级别的分析,以及利用基于证据的技术(如系统性数据集审查)来表征漏洞预测数据集。通过这些方法,论文揭示了静态分析工具在识别漏洞方面的局限性、SASTT在覆盖漏洞类型上的不足、漏洞严重性评分之间的弱相关性,以及即时建模(just-in-time modeling)在提高缺陷预测准确性方面的潜力。最终,论文强调了上下文知识在改进AI驱动的漏洞和缺陷预测中的重要性,并为研究人员和从业者提供了有效的预测模型。
链接: https://arxiv.org/abs/2501.05165
作者: Matteo Esposito
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET)
备注: PhD thesis
Abstract:Context. Developing secure and reliable software remains a key challenge in software engineering (SE). The ever-evolving technological landscape offers both opportunities and threats, creating a dynamic space where chaos and order compete. Secure software engineering (SSE) must continuously address vulnerabilities that endanger software systems and carry broader socio-economic risks, such as compromising critical national infrastructure and causing significant financial losses. Researchers and practitioners have explored methodologies like Static Application Security Testing Tools (SASTTs) and artificial intelligence (AI) approaches, including machine learning (ML) and large language models (LLMs), to detect and mitigate these vulnerabilities. Each method has unique strengths and limitations. Aim. This thesis seeks to bring order to the chaos in SSE by addressing domain-specific differences that impact AI accuracy. Methodology. The research employs a mix of empirical strategies, such as evaluating effort-aware metrics, analyzing SASTTs, conducting method-level analysis, and leveraging evidence-based techniques like systematic dataset reviews. These approaches help characterize vulnerability prediction datasets. Results. Key findings include limitations in static analysis tools for identifying vulnerabilities, gaps in SASTT coverage of vulnerability types, weak relationships among vulnerability severity scores, improved defect prediction accuracy using just-in-time modeling, and threats posed by untouched methods. Conclusions. This thesis highlights the complexity of SSE and the importance of contextual knowledge in improving AI-driven vulnerability and defect prediction. The comprehensive analysis advances effective prediction models, benefiting both researchers and practitioners. Comments: PhD thesis Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET) Cite as: arXiv:2501.05165 [cs.SE] (or arXiv:2501.05165v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2501.05165 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Matteo Esposito [view email] [v1] Thu, 9 Jan 2025 11:38:58 UTC (9,282 KB) Full-text links: Access Paper: View a PDF of the paper titled Bringing Order Amidst Chaos: On the Role of Artificial Intelligence in Secure Software Engineering, by Matteo EspositoView PDFHTML (experimental)Other Formats view license Current browse context: cs.SE prev | next new | recent | 2025-01 Change to browse by: cs cs.AI cs.CL cs.CR cs.ET References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-14] Biomedical Relation Extraction via Adaptive Document-Relation Cross-Mapping and Concept Unique Identifier
【速读】: 该论文旨在解决文档级生物医学关系抽取(Bio-RE)中的三个主要问题:跨句子推理的困难、文档不完整性以及外部知识整合的缺乏。此外,标注数据的稀缺性也限制了模型的训练效果。为解决这些问题,论文提出了一种基于大语言模型(LLMs)的解决方案,其关键包括三个方面:首先,通过引入迭代关系摘要(IoRs)提示,生成任务特定的合成数据,缓解数据稀缺问题;其次,提出了一种自适应文档-关系交叉映射(ADRCM)微调方法,增强模型的上下文理解和跨句子推理能力;最后,在推理阶段,采用基于概念唯一标识符(CUI)的检索增强生成(RAG)方法,利用CUI作为实体索引,缩小检索范围并丰富相关文档上下文。实验结果表明,该方法在GDA、CDR和BioRED数据集上达到了最先进的性能。
链接: https://arxiv.org/abs/2501.05155
作者: Yufei Shang,Yanrong Guo,Shijie Hao,Richang Hong
机构: School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China (合肥工业大学计算机与信息工程学院,合肥 230009,中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures
Abstract:Document-Level Biomedical Relation Extraction (Bio-RE) aims to identify relations between biomedical entities within extensive texts, serving as a crucial subfield of biomedical text mining. Existing Bio-RE methods struggle with cross-sentence inference, which is essential for capturing relations spanning multiple sentences. Moreover, previous methods often overlook the incompleteness of documents and lack the integration of external knowledge, limiting contextual richness. Besides, the scarcity of annotated data further hampers model training. Recent advancements in large language models (LLMs) have inspired us to explore all the above issues for document-level Bio-RE. Specifically, we propose a document-level Bio-RE framework via LLM Adaptive Document-Relation Cross-Mapping (ADRCM) Fine-Tuning and Concept Unique Identifier (CUI) Retrieval-Augmented Generation (RAG). First, we introduce the Iteration-of-REsummary (IoRs) prompt for solving the data scarcity issue. In this way, Bio-RE task-specific synthetic data can be generated by guiding ChatGPT to focus on entity relations and iteratively refining synthetic data. Next, we propose ADRCM fine-tuning, a novel fine-tuning recipe that establishes mappings across different documents and relations, enhancing the model’s contextual understanding and cross-sentence inference capabilities. Finally, during the inference, a biomedical-specific RAG approach, named CUI RAG, is designed to leverage CUIs as indexes for entities, narrowing the retrieval scope and enriching the relevant document contexts. Experiments conducted on three Bio-RE datasets (GDA, CDR, and BioRED) demonstrate the state-of-the-art performance of our proposed method by comparing it with other related works.
zh
[NLP-15] Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
【速读】: 该论文旨在解决当前大规模视觉-语言模型(Large Vision-Language Models, LVLMs)在非英语输入理解和目标语言生成方面的局限性。现有方法主要通过增加多语言训练数据来缓解这些问题,但缺乏对不同训练数据组合如何影响不同语言群体表现的深入理解。论文通过一系列多阶段实验,系统研究了以下关键问题:(1) 在不降低英语性能的前提下,可以包含多少种训练语言;(2) 预训练和指令微调数据的最佳语言分布;(3) 如何提升多语言文本-图像理解能力,并为此引入了一个新的基准任务。研究发现,通过同时包含多达100种训练语言,并仅使用25-50%的非英语数据,可以显著提升多语言性能,同时保持强大的英语性能。此外,预训练和指令微调中包含非英语OCR数据对于提升多语言文本-图像理解能力至关重要。最终,论文基于这些发现训练了Centurio模型,该模型在涵盖14个任务和56种语言的评估中表现出色。
链接: https://arxiv.org/abs/2501.05122
作者: Gregor Geigle,Florian Schneider,Carolin Holtermann,Chris Biemann,Radu Timofte,Anne Lauscher,Goran Glavaš
机构: 1WüNLP, University of Würzburg (维尔茨堡大学); 2Computer Vision Lab, CAIDAS, University of Würzburg (维尔茨堡大学); 3Language Technology Group, University of Hamburg (汉堡大学); 4Data Science Group, University of Hamburg (汉堡大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.
zh
[NLP-16] Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents
【速读】: 该论文试图解决科学文档元数据(metadata)不足的问题,特别是在中小型出版商发布的文档中,这一问题尤为突出,影响了文档的可访问性(accessibility)和FAIR原则(Findability, Accessibility, Interoperability, and Reusability)的遵循。解决方案的关键在于评估和比较多种特征学习和预测方法,包括自然语言处理(NLP)、计算机视觉(CV)以及多模态方法(multimodal approaches),以从具有高模板多样性的文档中提取元数据。通过提供全面的实验结果,分析这些方法在元数据提取中的准确性和效率,论文旨在提高科学文档的可访问性,促进其更广泛的使用,并为未来研究提供有价值的见解。
链接: https://arxiv.org/abs/2501.05082
作者: Zeyd Boukhers,Cong Yang
机构: Fraunhofer Institute for Applied Information Technology FIT(弗劳恩霍夫应用信息技术研究所); University Hospital of Cologne(科隆大学医院); University of Koblenz(科布伦茨大学); Soochow University(苏州大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注:
Abstract:The availability of metadata for scientific documents is pivotal in propelling scientific knowledge forward and for adhering to the FAIR principles (i.e. Findability, Accessibility, Interoperability, and Reusability) of research findings. However, the lack of sufficient metadata in published documents, particularly those from smaller and mid-sized publishers, hinders their accessibility. This issue is widespread in some disciplines, such as the German Social Sciences, where publications often employ diverse templates. To address this challenge, our study evaluates various feature learning and prediction methods, including natural language processing (NLP), computer vision (CV), and multimodal approaches, for extracting metadata from documents with high template variance. We aim to improve the accessibility of scientific documents and facilitate their wider use. To support our comparison of these methods, we provide comprehensive experimental results, analyzing their accuracy and efficiency in extracting metadata. Additionally, we provide valuable insights into the strengths and weaknesses of various feature learning and prediction methods, which can guide future research in this field.
zh
[NLP-17] SWE-Fixer: Training Open-Source LLM s for Effective and Efficient GitHub Issue Resolution
【速读】: 该论文试图解决当前在软件工程领域中,使用大型语言模型(LLMs)解决GitHub用户报告的问题时,依赖专有模型导致的复现性、可访问性和透明度不足的问题。为了解决这些问题,作者提出了SWE-Fixer,一个开源的LLM模型,专门用于高效解决GitHub问题。SWE-Fixer的关键解决方案包括两个核心模块:代码文件检索模块和代码编辑模块。代码文件检索模块采用BM25算法结合轻量级LLM模型实现从粗到细的文件检索;代码编辑模块则利用另一个LLM模型生成针对检索到的文件的补丁。此外,为了解决公开数据集缺乏的问题,作者还构建了一个包含110K GitHub问题及其对应补丁的广泛数据集,并分别训练了SWE-Fixer的两个模块。通过在SWE-Bench Lite和Verified基准测试中的评估,SWE-Fixer在开源模型中达到了最先进的性能,分别获得了23.3%和30.2%的分数,验证了该方法的有效性。
链接: https://arxiv.org/abs/2501.05040
作者: Chengxing Xie,Bowen Li,Chang Gao,He Du,Wai Lam,Difan Zou,Kai Chen
机构: Department of Computer Science, Cranberry-Lemon University (克兰伯里-莱蒙大学计算机科学系); Department of Computational Neuroscience, University of the Witwatersrand (威特沃特斯兰德大学计算神经科学系)
类目: Computation and Language (cs.CL)
备注: Our code, data, and model will be released at this https URL
Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on GitHub by fixing code based on the issues reported by the users. However, many current approaches rely on proprietary LLMs, which limits reproducibility, accessibility, and transparency. The critical components of LLMs for addressing software engineering issues and how their capabilities can be effectively enhanced remain unclear. To address these challenges, we introduce SWE-Fixer, a novel open-source LLM designed to effectively and efficiently resolve GitHub issues. SWE-Fixer comprises two essential modules: a code file retrieval module and a code editing module. The retrieval module employs BM25 along with a lightweight LLM model to achieve coarse-to-fine file retrieval. Subsequently, the code editing module utilizes the other LLM model to generate patches for the identified files. Then, to mitigate the lack of publicly available datasets, we compile an extensive dataset that includes 110K GitHub issues along with their corresponding patches, and train the two modules of SWE-Fixer separately. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models with scores of 23.3% and 30.2%, respectively. These outcomes highlight the efficacy of our approach. We will make our model, dataset, and code publicly available at this https URL.
zh
[NLP-18] Enhancing Human-Like Responses in Large Language Models
【速读】: 该论文探讨了如何使大语言模型(LLMs)更加拟人化,重点研究了提升自然语言理解、对话连贯性和情感智能的技术。解决方案的关键在于通过多种方法增强模型的表现,包括使用多样化数据集进行微调(fine-tuning)、融入心理学原理,以及设计能够更好模拟人类推理模式的模型。这些改进不仅提升了用户交互体验,还为AI在不同领域的应用开辟了新的可能性。未来的研究将关注这些拟人化特性可能带来的伦理问题和潜在偏见。
链接: https://arxiv.org/abs/2501.05032
作者: Ethem Yağız Çalık,Talha Rüzgar Akkuş
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper explores the advancements in making large language models (LLMs) more human-like. We focus on techniques that enhance natural language understanding, conversational coherence, and emotional intelligence in AI systems. The study evaluates various approaches, including fine-tuning with diverse datasets, incorporating psychological principles, and designing models that better mimic human reasoning patterns. Our findings demonstrate that these enhancements not only improve user interactions but also open new possibilities for AI applications across different domains. Future work will address the ethical implications and potential biases introduced by these human-like attributes.
zh
[NLP-19] reeKV: Smooth Key-Value Cache Compression with Tree Structures
【速读】: 该论文旨在解决基于Transformer的大语言模型(LLMs)在处理长序列和资源受限环境下的键值(KV)缓存压缩问题。现有的方法通常基于位置或重要性分数来剔除缓存中的token,但这些方法存在局限性:基于位置的策略可能会忽略预定义区域之外的关键信息,而基于全局重要性分数的方法则容易导致区域偏差,从而限制KV缓存的整体上下文保留能力,并可能影响LLMs在复杂任务中的表现。论文通过小波分析发现,随着token接近序列末尾,其对生成的贡献逐渐增加,并且与邻近token的差异增大,表明从远距离到近距离上下文的过渡具有复杂性和变异性。基于这一观察,论文提出了TreeKV,一种无需训练的树结构缓存压缩方法。TreeKV通过保持固定缓存大小,使LLMs在长文本场景下仍能输出高质量结果,并适用于生成和预填充阶段。实验表明,TreeKV在PG19和OpenWebText2的语言建模任务中优于所有基线模型,并在Longbench基准测试中以仅6%的缓存预算实现了最佳性能。
链接: https://arxiv.org/abs/2501.04987
作者: Ziwei He,Jian Yuan,Haoli Bai,Jingwen Leng,Bo Jiang
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Efficient key-value (KV) cache compression is critical for scaling transformer-based Large Language Models (LLMs) in long sequences and resource-limited settings. Existing methods evict tokens based on their positions or importance scores, but position-based strategies can miss crucial information outside predefined regions, while those relying on global importance scores resulting in strong regional biases, limiting the KV cache’s overall context retention and potentially impairing the performance of LLMs on complex tasks. Our wavelet analysis reveals that as tokens approach the end of sequence, their contributions to generation gradually increase and tends to diverge more from neighboring tokens, indicating a smooth transition with increasing complexity and variability from distant to nearby context. Motivated by this observation, we propose TreeKV, an intuitive, training-free method that employs a tree structure for smooth cache compression. TreeKV maintains a fixed cache size, allowing LLMs to deliver high-quality output even in long text scenarios. Unlike most compression methods, TreeKV is applicable to both the generation and prefilling stages. It consistently surpasses all baseline models in language modeling tasks on PG19 and OpenWebText2, allowing LLMs trained with short context window to generalize to longer window with a 16x cache reduction. On the Longbench benchmark, TreeKV achieves the best performance with only 6% of the budget at optimal efficiency.
zh
[NLP-20] SensorQA: A Question Answering Benchmark for Daily-Life Monitoring
【速读】: 该论文试图解决如何从长期时间序列传感器数据中提取有用信息并以人类可理解的方式呈现的问题。现有研究主要集中在学习分类模型上,而较少关注终端用户如何主动从传感器数据中提取有价值的见解,这通常由于缺乏合适的数据集而受到限制。为解决这一问题,论文引入了\Dataset,这是第一个由人工创建的问答(QA)数据集,专门用于日常生活监控中的长期时间序列传感器数据。该数据集包含5.6K个多样且实用的查询,反映了真实的人类兴趣,并与从传感器数据中提取的准确答案配对。论文还建立了最先进AI模型在该数据集上的基准,并评估了其在典型边缘设备上的性能。结果表明,当前模型与最佳QA性能和效率之间存在差距,强调了需要新的贡献。
链接: https://arxiv.org/abs/2501.04974
作者: Benjamin Reichman,Xiaofan Yu,Lanxiang Hu,Jack Truxal,Atishay Jain,Rushil Chandrupatla,Tajana Šimunić Rosing,Larry Heck
机构: Georgia Institute of Technology(佐治亚理工学院); University of California San Diego(加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rapid growth in sensor data, effectively interpreting and interfacing with these data in a human-understandable way has become crucial. While existing research primarily focuses on learning classification models, fewer studies have explored how end users can actively extract useful insights from sensor data, often hindered by the lack of a proper dataset. To address this gap, we introduce \Dataset, the first human-created question-answering (QA) dataset for long-term time-series sensor data for daily life monitoring. \Dataset is created by human workers and includes 5.6K diverse and practical queries that reflect genuine human interests, paired with accurate answers derived from sensor data. We further establish benchmarks for state-of-the-art AI models on this dataset and evaluate their performance on typical edge devices. Our results reveal a gap between current models and optimal QA performance and efficiency, highlighting the need for new contributions. The dataset and code are available at: \urlthis https URL.
zh
[NLP-21] VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models
【速读】: 该论文旨在解决语音交互模型(Spoken Language Models, SLMs)在理解和应用广泛世界知识方面的性能评估问题。随着语音交互模型需求的增长,现有的评估基准(如AudioQA)主要依赖于文本格式的问题和答案,无法全面评估模型在纯语音交互中的表现。为此,论文提出了VoxEval,一个专门设计用于通过纯语音交互评估SLMs知识理解能力的新型语音问答基准。VoxEval的关键创新在于其保持了问题和答案的语音格式,评估了模型在不同音频条件(如音色、音频质量和说话风格)下的鲁棒性,并首次引入了具有挑战性的领域(如数学问题求解)的语音格式评估。通过VoxEval对现有SLMs的全面评估,论文揭示了当前模型在知识理解和语音交互方面的显著性能局限,为未来的改进指明了关键方向。
链接: https://arxiv.org/abs/2501.04962
作者: Wenqian Cui,Xiaoqi Jiao,Ziqiao Meng,Irwin King
机构: The Chinese University of Hong Kong(香港中文大学); LightSpeed Studios, Tencent(腾讯光子工作室群); National University of Singapore(新加坡国立大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:With the growing demand for developing speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. When engaging in conversations with humans, it is essential for these models to comprehend a wide range of world knowledge. In this paper, we introduce VoxEval, a novel speech question-answering benchmark specifically designed to assess SLMs’ knowledge understanding through purely speech-based interactions. Unlike existing AudioQA benchmarks, VoxEval maintains speech format for both questions and answers, evaluates model robustness across diverse audio conditions (varying timbres, audio qualities, and speaking styles), and pioneers the assessment of challenging domains like mathematical problem-solving in spoken format. Our comprehensive evaluation of recent SLMs using VoxEval reveals significant performance limitations in current models, highlighting crucial areas for future improvements.
zh
[NLP-22] Demystifying Domain-adaptive Post-training for Financial LLM s
【速读】: 该论文试图解决在金融领域中对大语言模型(LLMs)进行领域自适应后训练(domain-adaptive post-training)时面临的挑战,特别是在确定最佳适应标准和训练策略方面。为了解决这些问题,作者提出了FINDAP方法,该方法通过系统化和细粒度的研究,识别目标领域所需的核心能力,并设计与之匹配的全面评估套件。关键解决方案包括分析关键后训练阶段(如持续预训练、指令微调和偏好对齐)的有效性,并提出一种基于生成式奖励模型过程信号的新型偏好数据蒸馏方法。最终,通过这一训练方案,作者开发了Llama-Fin模型,该模型在广泛的金融任务中实现了最先进的性能。研究还揭示了每个后训练阶段对特定能力的贡献,并提供了领域自适应中的具体挑战和有效解决方案。
链接: https://arxiv.org/abs/2501.04961
作者: Zixuan Ke,Yifei Ming,Xuan-Phi Nguyen,Caiming Xiong,Shafiq Joty
机构: Salesforce AI Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:
Abstract:Domain-adaptive post-training of large language models (LLMs) has emerged as a promising approach for specialized domains such as medicine and finance. However, significant challenges remain in identifying optimal adaptation criteria and training strategies across varying data and model configurations. To address these challenges, we introduce FINDAP, a systematic and fine-grained investigation into domain-adaptive post-training of LLMs for the finance domain. Our approach begins by identifying the core capabilities required for the target domain and designing a comprehensive evaluation suite aligned with these needs. We then analyze the effectiveness of key post-training stages, including continual pretraining, instruction tuning, and preference alignment. Building on these insights, we propose an effective training recipe centered on a novel preference data distillation method, which leverages process signals from a generative reward model. The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks. Our analysis also highlights how each post-training stage contributes to distinct capabilities, uncovering specific challenges and effective solutions, providing valuable insights for domain adaptation of LLMs. Project page: this https URL
zh
[NLP-23] Step-by-Step Mastery: Enhancing Soft Constraint Following Ability of Large Language Models
【速读】: 该论文试图解决大型语言模型(LLMs)在处理涉及多个软约束(soft constraints)的指令时面临的挑战。软约束在语义上相关且难以通过自动化方法验证,这成为LLMs的一个重要瓶颈。为解决这一问题,论文提出了一种自动生成高质量输出的流程,并引入了一种基于课程学习(curriculum learning)的训练范式,以充分利用所获取的数据。实验评估表明,该方法有效提升了LLMs遵循软约束的能力,并分析了驱动改进的关键因素。
链接: https://arxiv.org/abs/2501.04945
作者: Qingyu Ren,Jie Zeng,Qianyu He,Jiaqing Liang,Yanghua Xiao,Weikang Zhou,Zeye Sun,Fei Yu
机构: 1Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University (复旦大学计算机科学技术学院, 上海市数据科学重点实验室); 2School of Data Science, Fudan University (复旦大学数据科学学院); 3Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:It is crucial for large language models (LLMs) to follow instructions that involve multiple constraints. However, soft constraints are semantically related and difficult to verify through automated methods. These constraints remain a significant challenge for LLMs. To enhance the ability of LLMs to follow soft constraints, we initially design a pipeline to obtain high-quality outputs automatically. Additionally, to fully utilize the acquired data, we introduce a training paradigm based on curriculum learning. We experimentally evaluate the effectiveness of our methods in improving LLMs’ soft constraint following ability and analyze the factors driving the improvements. The datasets and code are publicly available at this https URL.
zh
[NLP-24] Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全机制上存在的漏洞问题,特别是针对商业闭源MLLMs的攻击成功率较低的问题。研究发现,MLLMs在理解能力和安全能力之间存在一种“Shuffle Inconsistency”(乱序不一致性),即模型能够很好地理解乱序的有害指令,但在安全能力上却容易被这些乱序指令绕过,导致生成有害响应。基于这一发现,论文提出了一种名为SI-Attack的文本-图像越狱攻击方法。该方案的关键在于利用乱序不一致性,并通过基于查询的黑箱优化方法(query-based black-box optimization)选择最具危害性的乱序输入,从而显著提高对商业MLLMs(如GPT-4o和Claude-3.5-Sonnet)的攻击成功率。实验结果表明,SI-Attack在多个基准测试中均能有效提升攻击性能。
链接: https://arxiv.org/abs/2501.04931
作者: Shiji Zhao,Ranjie Duan,Fengxiang Wang,Chi Chen,Caixin Kang,Jialing Tao,YueFeng Chen,Hui Xue,Xingxing Wei
机构: Institute of Artificial Intelligence, Beihang University (北京航空航天大学人工智能研究院); Alibaba Group (阿里巴巴集团)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs’ potential risks. Existing MLLMs’ jailbreak methods often bypass the model’s safety mechanism through complex optimization methods or carefully designed image and text prompts. Despite achieving some progress, they have a low attack success rate on commercial closed-source MLLMs. Unlike previous research, we empirically find that there exists a Shuffle Inconsistency between MLLMs’ comprehension ability and safety ability for the shuffled harmful instruction. That is, from the perspective of comprehension ability, MLLMs can understand the shuffled harmful text-image instructions well. However, they can be easily bypassed by the shuffled harmful instructions from the perspective of safety ability, leading to harmful responses. Then we innovatively propose a text-image jailbreak attack named SI-Attack. Specifically, to fully utilize the Shuffle Inconsistency and overcome the shuffle randomness, we apply a query-based black-box optimization method to select the most harmful shuffled inputs based on the feedback of the toxic judge model. A series of experiments show that SI-Attack can improve the attack’s performance on three benchmarks. In particular, SI-Attack can obviously improve the attack success rate for commercial MLLMs such as GPT-4o or Claude-3.5-Sonnet.
zh
[NLP-25] Investigating Numerical Translation with Large Language Models ICASSP2025
【速读】: 该论文旨在解决大语言模型(LLMs)在机器翻译中处理数字数据时的可靠性问题。数字翻译的不准确可能导致严重的安全问题,如财务损失或医疗错误。尽管LLMs在机器翻译领域取得了显著进展,但其在数字翻译方面的能力尚未得到充分研究。论文通过构建一个基于真实业务数据的中英文数字翻译数据集,系统评估了当前开源LLMs在十种数字翻译类型中的表现。实验结果表明,数字翻译错误是一个普遍问题,大多数开源LLMs在面对测试场景时表现不佳,尤其是在涉及“百万”、“十亿”和“亿”等大单位数字类型时,即使是最新的llama3.1 8b模型的错误率也高达20%。论文最后提出了三种潜在策略,以缓解大单位数字翻译中的错误问题。
链接: https://arxiv.org/abs/2501.04927
作者: Wei Tang,Jiawei Yu,Yuang Li,Yanqing Zhao,Weidong Zhang,Wei Feng,Min Zhang,Hao Yang
机构: Huawei Translation Services Center, China (华为翻译服务中心, 中国)
类目: Computation and Language (cs.CL)
备注: Accepted by ICASSP 2025
Abstract:The inaccurate translation of numbers can lead to significant security issues, ranging from financial setbacks to medical inaccuracies. While large language models (LLMs) have made significant advancements in machine translation, their capacity for translating numbers has not been thoroughly explored. This study focuses on evaluating the reliability of LLM-based machine translation systems when handling numerical data. In order to systematically test the numerical translation capabilities of currently open source LLMs, we have constructed a numerical translation dataset between Chinese and English based on real business data, encompassing ten types of numerical translation. Experiments on the dataset indicate that errors in numerical translation are a common issue, with most open-source LLMs faltering when faced with our test scenarios. Especially when it comes to numerical types involving large units like million",
billion", and “yi”, even the latest llama3.1 8b model can have error rates as high as 20%. Finally, we introduce three potential strategies to mitigate the numerical mistranslations for large units.
zh
[NLP-26] JELLY: Joint Emotion Recognition and Context Reasoning with LLM s for Conversational Speech Synthesis ICASSP2025
【速读】: 该论文试图解决的是在对话语音合成(Conversational Speech Synthesis, CSS)中生成更加自然的语音的问题,特别是如何通过考虑对话上下文来生成情感上合适的语音。现有的挑战在于情感对话语音数据集的稀缺性以及如何有效地将情感识别与上下文推理结合到语音生成过程中。
解决方案的关键在于提出了一个名为JELLY的新型CSS框架,该框架通过微调大型语言模型(Large Language Model, LLM)并结合多个部分LoRA模块来实现。具体来说,JELLY引入了一个情感感知的Q-former编码器(Emotion-aware Q-former encoder),该编码器能够使LLM感知语音中的情感,并通过情感语音数据集进行训练,将语音情感与文本对齐。整个模型随后通过对话语音数据进行微调,以推断情感上下文并生成情感上合适的对话语音。实验结果表明,JELLY在情感上下文建模方面表现出色,能够生成与对话自然对齐的语音,同时缓解了情感对话语音数据集的稀缺性问题。
链接: https://arxiv.org/abs/2501.04904
作者: Jun-Hyeok Cha,Seung-Bin Kim,Hyung-Seok Oh,Seong-Whan Lee
机构: Korea University (韩国大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by ICASSP 2025
Abstract:Recently, there has been a growing demand for conversational speech synthesis (CSS) that generates more natural speech by considering the conversational context. To address this, we introduce JELLY, a novel CSS framework that integrates emotion recognition and context reasoning for generating appropriate speech in conversation by fine-tuning a large language model (LLM) with multiple partial LoRA modules. We propose an Emotion-aware Q-former encoder, which enables the LLM to perceive emotions in speech. The encoder is trained to align speech emotions with text, utilizing datasets of emotional speech. The entire model is then fine-tuned with conversational speech data to infer emotional context for generating emotionally appropriate speech in conversation. Our experimental results demonstrate that JELLY excels in emotional context modeling, synthesizing speech that naturally aligns with conversation, while mitigating the scarcity of emotional conversational speech datasets.
zh
[NLP-27] SUGAR: Leverag ing Contextual Confidence for Smarter Retrieval ICASSP2025
【速读】: 该论文试图解决大型语言模型(LLMs)在生成回答时由于参数知识有限而产生的幻觉问题(hallucinations)。尽管检索增强生成(RAG)通过提供外部知识在一定程度上缓解了这一问题,但均匀检索支持上下文会导致生成效率低下,因为并非所有情况下都需要触发检索器,且检索到的噪声内容可能会分散模型注意力,导致生成无用的回答。为解决这些问题,论文提出了基于语义不确定性引导的自适应检索(SUGAR),通过利用基于上下文的熵值来主动决定是否进行检索,并进一步确定是单步检索还是多步检索。实验结果表明,基于语义不确定性估计的选择性检索不仅提高了多种问答任务的性能,还实现了更高效的推理。
链接: https://arxiv.org/abs/2501.04899
作者: Hanna Zubkova,Ji-Hoon Park,Seong-Whan Lee
机构: Korea University (韩国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICASSP2025
Abstract:Bearing in mind the limited parametric knowledge of Large Language Models (LLMs), retrieval-augmented generation (RAG) which supplies them with the relevant external knowledge has served as an approach to mitigate the issue of hallucinations to a certain extent. However, uniformly retrieving supporting context makes response generation source-inefficient, as triggering the retriever is not always necessary, or even inaccurate, when a model gets distracted by noisy retrieved content and produces an unhelpful answer. Motivated by these issues, we introduce Semantic Uncertainty Guided Adaptive Retrieval (SUGAR), where we leverage context-based entropy to actively decide whether to retrieve and to further determine between single-step and multi-step retrieval. Our empirical results show that selective retrieval guided by semantic uncertainty estimation improves the performance across diverse question answering tasks, as well as achieves a more efficient inference.
zh
[NLP-28] Leverag ing Log Probabilities in Language Models to Forecast Future Events
【速读】: 该论文旨在解决在数据驱动决策(data-driven decision making)领域中准确预测未来事件的问题,特别是在战略规划中如何利用大语言模型(Large Language Models, LLMs)进行预测。论文提出了一种基于LLMs的AI驱动前瞻性预测方法,通过分析当前趋势及其发展轨迹,生成15个不同主题的预测,并采用基于对数概率(log probabilities)的多步骤方法估计这些预测的概率。该方法的创新之处在于利用LLMs处理大量文本数据的能力,显著提升了预测的准确性,其Brier分数达到0.186,比随机预测提高了26%,比现有广泛使用的AI系统提高了19%。
链接: https://arxiv.org/abs/2501.04880
作者: Tommaso Soru,Jim Marshall
机构: Serendipity AI Ltd. (Serendipity AI 有限公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages, 4 figures
Abstract:In the constantly changing field of data-driven decision making, accurately predicting future events is crucial for strategic planning in various sectors. The emergence of Large Language Models (LLMs) marks a significant advancement in this area, offering advanced tools that utilise extensive text data for prediction. In this industry paper, we introduce a novel method for AI-driven foresight using LLMs. Building on top of previous research, we employ data on current trends and their trajectories for generating forecasts on 15 different topics. Subsequently, we estimate their probabilities via a multi-step approach based on log probabilities. We show we achieve a Brier score of 0.186, meaning a +26% improvement over random chance and a +19% improvement over widely-available AI systems.
zh
[NLP-29] Real-Time Textless Dialogue Generation
【速读】: 该论文试图解决当前语音对话系统在自然性方面的不足,特别是由于传统级联设计(cascaded design)和对文本作为中间表示的依赖所导致的机器人化交互问题。这些问题包括响应时间慢、回复过于通用或谨慎、缺乏自然的节奏和流畅的轮流对话。论文提出了一种实时、无文本的语音对话生成模型(RTTL-DG),其关键解决方案在于直接处理流式语音对话,从而减少延迟并实现流畅的轮流对话。此外,该模型通过整合反馈信号(backchannels)、过滤器、笑声和其他副语言信号,增强了对话的自然性和人类化交互效果。
链接: https://arxiv.org/abs/2501.04877
作者: Long Mai,Julie Carson-Berndsen
机构: ML-Labs, School of Computer Science, University College Dublin (都柏林大学计算机科学学院ML实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models (LLMs) have led to significant progress in text-based dialogue systems. These systems can now generate high-quality responses that are accurate and coherent across a wide range of topics and tasks. However, spoken dialogue systems still lag behind in terms of naturalness. They tend to produce robotic interactions, with issues such as slow response times, overly generic or cautious replies, and a lack of natural rhythm and fluid turn-taking. This shortcoming is largely due to the over-reliance on the traditional cascaded design, which involve separate, sequential components, as well as the use of text as an intermediate representation. This paper propose a real-time, textless spoken dialogue generation model (RTTL-DG) that aims to overcome these challenges. Our system enables fluid turn-taking and generates responses with minimal delay by processing streaming spoken conversation directly. Additionally, our model incorporates backchannels, filters, laughter, and other paralinguistic signals, which are often absent in cascaded dialogue systems, to create more natural and human-like interactions. The implementations and generated samples are available in our repository: this https URL
zh
[NLP-30] Advancing Retrieval-Augmented Generation for Persian: Development of Language Models Comprehensive Benchmarks and Best Practices for Optimization
【速读】: 该论文旨在解决在低资源语言(如波斯语)中构建检索增强生成(Retrieval-Augmented Generation, RAG)系统时面临的特定障碍,特别是波斯语的复杂形态和灵活句法带来的挑战。解决方案的关键在于引入了波斯语特定的模型,包括MatinaRoberta(一种掩码语言模型)和MatinaSRoberta(一种微调的Sentence-BERT),并结合一个全面的基准测试框架。这些模型在包含73.11亿波斯语词汇的多样化语料库上进行训练,并通过预训练、微调(使用定制的损失函数)以及系统评估(采用传统指标和RAG评估框架)来提升检索和生成的准确性。研究结果表明,MatinaSRoberta在上下文相关性和检索准确性方面优于之前的嵌入模型,同时通过温度调整、块大小修改和文档摘要索引等技术进一步优化了RAG系统的性能。此外,研究还发现,较大模型(如Llama-3.1 (70B))在生成准确性上表现最佳,而较小模型在处理领域特定和正式语境时存在困难。这些发现强调了通过定制嵌入和检索生成设置来开发波斯语RAG系统的潜力,并提升了低资源语言中自然语言处理应用(如搜索引擎和法律文档分析)的性能。
链接: https://arxiv.org/abs/2501.04858
作者: Sara Bourbour Hosseinbeigi,Sina Asghari,Mohammad Ali Seif Kashani,Mohammad Hossein Shalchian,Mohammad Amin Abbasi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper examines the specific obstacles of constructing Retrieval-Augmented Generation(RAG) systems in low-resource languages, with a focus on Persian’s complicated morphology and versatile syntax. The research aims to improve retrieval and generation accuracy by introducing Persian-specific models, namely MatinaRoberta(a masked language model) and MatinaSRoberta(a fine-tuned Sentence-BERT), along with a comprehensive benchmarking framework. Three datasets-general knowledge(PQuad), scientifically specialized texts, and organizational reports, were used to assess these models after they were trained on a varied corpus of 73.11 billion Persian tokens. The methodology involved extensive pretraining, fine-tuning with tailored loss functions, and systematic evaluations using both traditional metrics and the Retrieval-Augmented Generation Assessment framework. The results show that MatinaSRoberta outperformed previous embeddings, achieving superior contextual relevance and retrieval accuracy across datasets. Temperature tweaking, chunk size modifications, and document summary indexing were explored to enhance RAG setups. Larger models like Llama-3.1 (70B) consistently demonstrated the highest generation accuracy, while smaller models faced challenges with domain-specific and formal contexts. The findings underscore the potential for developing RAG systems in Persian through customized embeddings and retrieval-generation settings and highlight the enhancement of NLP applications such as search engines and legal document analysis in low-resource languages.
zh
[NLP-31] Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models
【速读】: 该论文旨在解决历史土耳其语(historical Turkish)在自然语言处理(NLP)领域中研究不足的问题。具体而言,论文提出了首个历史土耳其语的命名实体识别(NER)数据集HisTR和首个通用依存树库(Universal Dependencies treebank)OTA-BOUN,并基于这些数据集训练了用于命名实体识别、依存句法分析和词性标注任务的Transformer模型。此外,论文还引入了奥斯曼文本语料库(Ottoman Text Corpus, OTC),这是一个涵盖多个历史时期的转写历史土耳其语文本的干净语料库。通过这些资源和模型,论文显著提升了历史土耳其语的计算分析能力,并在理解历史语言结构的相关任务中取得了显著成果。解决方案的关键在于提供了高质量的数据集和预训练模型,为未来历史土耳其语的NLP研究奠定了基础。
链接: https://arxiv.org/abs/2501.04828
作者: Şaziye Betül Özateş,Tarık Emre Tıraş,Ece Elif Adak,Berat Doğan,Fatih Burak Karagöz,Efe Eren Genç,Esma F. Bilgin Taşdemir
机构: Boğaziçi University (博阿齐奇大学); Medeniyet University (梅德尼耶特大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Additionally, we introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results show significant improvements in the computational analysis of historical Turkish, achieving promising results in tasks that require understanding of historical linguistic structures. They also highlight existing challenges, such as domain adaptation and language variations across time periods. All of the presented resources and models are made available at this https URL to serve as a benchmark for future progress in historical Turkish NLP.
zh
[NLP-32] Unifying the Extremes: Developing a Unified Model for Detecting and Predicting Extremist Traits and Radicalization
【速读】: 该论文试图解决如何通过社交媒体平台上的在线社区论坛,准确识别和量化极端主义(extremism)行为的问题。传统研究多集中于特定意识形态背景下的极端化(radicalization)过程,而本文提出了一种更为通用的方法来提取和分析跨意识形态的极端主义话语。解决方案的关键在于开发了一个基于言语行为特征的框架,用于在用户和社区层面上量化极端主义。该框架通过识别11个关键因素(称为“The Extremist Eleven”),构建了一个通用的心理社会模型,能够应用于不同意识形态的在线社区。通过分析incel社区的用户历史数据,该框架能够在用户实际加入极端主义论坛前10个月预测其行为,预测准确率(AUC)从0.6逐步提升至0.9。这一方法突破了传统模型对特定意识形态的依赖,提供了一种更为全面和跨意识形态的极端主义研究途径。
链接: https://arxiv.org/abs/2501.04820
作者: Allison Lahnala,Vasudha Varadarajan,Lucie Flek,H. Andrew Schwartz,Ryan L. Boyd
机构: Bonn-Aachen International Center for Information Technology (b-it), University of Bonn(波恩-亚琛信息技术国际中心,波恩大学); Department of Computer Science, Stony Brook University(计算机科学系,石溪大学); Department of Psychology, University of Texas at Dallas(心理学系,德克萨斯大学达拉斯分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 17 pages, 7 figures, 4 tables
Abstract:The proliferation of ideological movements into extremist factions via social media has become a global concern. While radicalization has been studied extensively within the context of specific ideologies, our ability to accurately characterize extremism in more generalizable terms remains underdeveloped. In this paper, we propose a novel method for extracting and analyzing extremist discourse across a range of online community forums. By focusing on verbal behavioral signatures of extremist traits, we develop a framework for quantifying extremism at both user and community levels. Our research identifies 11 distinct factors, which we term ``The Extremist Eleven,‘’ as a generalized psychosocial model of extremism. Applying our method to various online communities, we demonstrate an ability to characterize ideologically diverse communities across the 11 extremist traits. We demonstrate the power of this method by analyzing user histories from members of the incel community. We find that our framework accurately predicts which users join the incel community up to 10 months before their actual entry with an AUC of 0.6 , steadily increasing to AUC ~0.9 three to four months before the event. Further, we find that upon entry into an extremist forum, the users tend to maintain their level of extremism within the community, while still remaining distinguishable from the general online discourse. Our findings contribute to the study of extremism by introducing a more holistic, cross-ideological approach that transcends traditional, trait-specific models.
zh
[NLP-33] Reproducing HotFlip for Corpus Poisoning Attacks in Dense Retrieval ECIR2025
【速读】: 该论文旨在解决HotFlip方法在攻击检索系统时存在的两个主要问题:计算效率低下和依赖于用户查询的假设。首先,HotFlip在生成对抗性文本时,由于需要对每个查询-段落对进行梯度累积,导致计算时间过长,无法在合理时间内生成足够数量的对抗性段落。其次,该方法假设攻击者能够访问一组用户查询,这与现实中的对抗攻击场景不符。论文的解决方案包括显著提升HotFlip的计算效率,将每篇文档的对抗生成时间从4小时缩短至15分钟,并在相同硬件条件下实现。此外,论文还探讨了两种新的攻击任务:基于迁移的黑盒攻击和查询无关攻击,并提供了改进版本与原始方法的对比实验。实验结果表明,HotFlip能够有效攻击多种密集检索模型,但其攻击性能在面对更先进的检索方法时有所下降。
链接: https://arxiv.org/abs/2501.04802
作者: Yongkang Li,Panagiotis Eustratiadis,Evangelos Kanoulas
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: This paper has been accepted for oral presentation in the reproducibility track at ECIR 2025
Abstract:HotFlip is a topical gradient-based word substitution method for attacking language models. Recently, this method has been further applied to attack retrieval systems by generating malicious passages that are injected into a corpus, i.e., corpus poisoning. However, HotFlip is known to be computationally inefficient, with the majority of time being spent on gradient accumulation for each query-passage pair during the adversarial token generation phase, making it impossible to generate an adequate number of adversarial passages in a reasonable amount of time. Moreover, the attack method itself assumes access to a set of user queries, a strong assumption that does not correspond to how real-world adversarial attacks are usually performed. In this paper, we first significantly boost the efficiency of HotFlip, reducing the adversarial generation process from 4 hours per document to only 15 minutes, using the same hardware. We further contribute experiments and analysis on two additional tasks: (1) transfer-based black-box attacks, and (2) query-agnostic attacks. Whenever possible, we provide comparisons between the original method and our improved version. Our experiments demonstrate that HotFlip can effectively attack a variety of dense retrievers, with an observed trend that its attack performance diminishes against more advanced and recent methods. Interestingly, we observe that while HotFlip performs poorly in a black-box setting, indicating limited capacity for generalization, in query-agnostic scenarios its performance is correlated to the volume of injected adversarial passages.
zh
[NLP-34] Cued Speech Generation Leverag ing a Pre-trained Audiovisual Text-to-Speech Model
【速读】: 该论文旨在解决自动生成提示性语言(Cued Speech, CS)的问题,这是一种用于听力障碍者更好地理解口语的视觉交流系统。论文提出了一种基于预训练的视听自回归文本到语音模型(AVTacotron2)的迁移学习策略,通过重新编程该模型,使其能够从文本输入推断出提示性语言的手势和唇部动作。实验在两个公开数据集上进行,其中一个数据集是专门为此研究录制的。通过使用自动提示性语言识别系统评估性能,结果表明该方法的有效性,解码准确率在语音层面达到约77%。解决方案的关键在于利用预训练模型的迁移学习能力,将其重新应用于提示性语言的生成任务,从而显著提高了系统的性能。
链接: https://arxiv.org/abs/2501.04799
作者: Sanjana Sankar,Martin Lenglet,Gerard Bailly,Denis Beautemps,Thomas Hueber
机构: Univ. Grenoble Alpes(格勒诺布尔阿尔卑斯大学); CNRS(法国国家科学研究中心); Grenoble INP(格勒诺布尔国立理工学院); GIPSA-lab(格勒诺布尔图像、语音、信号和自动化实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents a novel approach for the automatic generation of Cued Speech (ACSG), a visual communication system used by people with hearing impairment to better elicit the spoken language. We explore transfer learning strategies by leveraging a pre-trained audiovisual autoregressive text-to-speech model (AVTacotron2). This model is reprogrammed to infer Cued Speech (CS) hand and lip movements from text input. Experiments are conducted on two publicly available datasets, including one recorded specifically for this study. Performance is assessed using an automatic CS recognition system. With a decoding accuracy at the phonetic level reaching approximately 77%, the results demonstrate the effectiveness of our approach.
zh
[NLP-35] Developing a Modular Compiler for a Subset of a C-like Language
【速读】: 该论文旨在解决为高级语言构建编译器时所面临的挑战,特别是如何在保持编译器功能强大和高效的同时,实现其模块化和内存效率。论文提出的解决方案之关键在于采用模块化设计(modular design),使得开发者能够根据需要添加或移除语言的子集,从而构建一个最小化且内存高效的编译器。开发过程通过小步迭代的方式进行,每一步都生成一个功能完整的编译器,逐步扩展语言子集的支持范围。通过遵循模块化设计、代码可重用性和文档化的行业最佳实践,该编译器在功能性、可维护性和可扩展性方面表现出色。此外,编译器在资源受限的单板计算机上进行了测试,进一步验证了其在内存受限设备上的高效性和适用性。
链接: https://arxiv.org/abs/2501.04503
作者: Debasish Dutta,Neeharika Sonowal,Irani Hazarika
机构: 未知
类目: Programming Languages (cs.PL); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:
Abstract:The paper introduces the development of a modular compiler for a subset of a C-like language, which addresses the challenges in constructing a compiler for high-level languages. This modular approach will allow developers to modify a language by adding or removing subsets as required, resulting in a minimal and memory-efficient compiler. The development process is divided into small, incremental steps, where each step yields a fully functioning compiler for an expanding subset of the language. The paper outlines the iterative developmental phase of the compiler, emphasizing progressive enhancements in capabilities and functionality. Adherence to industry best practices of modular design, code reusability, and documentation has enabled the resulting compiler’s functional efficiency, maintainability, and extensibility. The compiler proved to be effective not only in managing the language structure but also in developing optimized code, which demonstrates its practical usability. This was also further assessed using the compiler on a tiny memory-deficient single-board computer, again showing the compiler’s efficiency and suitability for resource-constrained devices.
zh
[NLP-36] FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching ICASSP2025
【速读】: 该论文试图解决音频超分辨率(audio super-resolution)中的挑战,特别是由于该问题的病态性(ill-posed nature)导致的困难。尽管扩散模型(diffusion models)在音频超分辨率中表现出色,但其主要局限在于需要大量的采样步骤,导致生成高质量音频样本时延迟显著增加。论文提出的解决方案FLowHigh,通过将流匹配(flow matching)这一高效的生成模型引入音频超分辨率,并结合专门为音频超分辨率设计的概率路径(probability paths),有效地捕捉高分辨率音频的分布,从而提升重建质量。FLowHigh的关键创新在于通过单步采样过程生成高保真、高分辨率的音频,显著降低了计算复杂度,同时在VCTK基准数据集上实现了最先进的性能表现。
链接: https://arxiv.org/abs/2501.04926
作者: Jun-Hak Yun,Seung-Bin Kim,Seong-Whan Lee
机构: Korea University (高丽大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by ICASSP 2025
Abstract:Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In this paper, we propose FLowHigh, a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution. We also explore probability paths specially tailored for audio super-resolution, which effectively capture high-resolution audio distributions, thereby enhancing reconstruction quality. The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates. The experimental results on the VCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art performance in audio super-resolution, as evaluated by log-spectral distance and ViSQOL while maintaining computational efficiency with only a single-step sampling process.
zh
[NLP-37] Enhancing Listened Speech Decoding from EEG via Parallel Phoneme Sequence Prediction ICASSP2025
【速读】: 该论文旨在解决从脑电图(EEG)信号中解码听语音(listened speech)的问题,特别是针对那些因神经系统疾病而导致语言感知受损的人群。论文提出了一种新颖的方法,通过利用辅助的音素预测器(phoneme predictor)来同时解码文本音素序列,从而增强从EEG信号中解码听语音的能力。解决方案的关键在于模型架构的三个主要部分:EEG模块、语音模块和音素预测器。EEG模块负责将EEG信号转换为EEG嵌入(EEG embeddings),语音模块从这些嵌入生成语音波形,而音素预测器则输出解码后的文本音素序列。该方法允许用户同时从EEG信号中获取语音波形和文本音素序列的解码结果,避免了传统方法中需要为每种模态分别构建串联顺序管道的复杂性。实验结果表明,该方法在两种模态上的表现均优于以往的方法。
链接: https://arxiv.org/abs/2501.04844
作者: Jihwan Lee,Tiantian Feng,Aditya Kommineni,Sudarsana Reddy Kadiri,Shrikanth Narayanan
机构: Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA (信号分析与解释实验室,南加州大学,洛杉矶,美国)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: ICASSP 2025
Abstract:Brain-computer interfaces (BCI) offer numerous human-centered application possibilities, particularly affecting people with neurological disorders. Text or speech decoding from brain activities is a relevant domain that could augment the quality of life for people with impaired speech perception. We propose a novel approach to enhance listened speech decoding from electroencephalography (EEG) signals by utilizing an auxiliary phoneme predictor that simultaneously decodes textual phoneme sequences. The proposed model architecture consists of three main parts: EEG module, speech module, and phoneme predictor. The EEG module learns to properly represent EEG signals into EEG embeddings. The speech module generates speech waveforms from the EEG embeddings. The phoneme predictor outputs the decoded phoneme sequences in text modality. Our proposed approach allows users to obtain decoded listened speech from EEG signals in both modalities (speech waveforms and textual phoneme sequences) simultaneously, eliminating the need for a concatenated sequential pipeline for each modality. The proposed approach also outperforms previous methods in both modalities. The source code and speech samples are publicly available.
zh
计算机视觉
[CV-0] An Empirical Study of Autoregressive Pre-training from Videos
【速读】:该论文旨在研究基于视频的自回归预训练(autoregressive pre-training)方法,探讨如何通过自回归模型从视频数据中学习有效的视觉表示。研究团队构建了一系列名为Toto的自回归视频模型,将视频视为视觉标记(visual tokens)序列,并训练Transformer模型以自回归方式预测未来的视觉标记。关键解决方案包括:1)使用包含超过1万亿视觉标记的多样化视频和图像数据集进行预训练;2)探索不同的架构、训练和推理设计选择;3)在图像识别、视频分类、目标跟踪和机器人技术等下游任务上评估学习到的视觉表示。研究结果表明,尽管引入的归纳偏差(inductive biases)较少,自回归预训练在多个基准测试中均表现出竞争力。此外,视频模型的扩展曲线与语言模型相似,但扩展速率不同。
链接: https://arxiv.org/abs/2501.05453
作者: Jathushan Rajasegaran,Ilija Radosavovic,Rahul Ravishankar,Yossi Gandelsman,Christoph Feichtenhofer,Jitendra Malik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at this https URL
zh
[CV-1] Decentralized Diffusion Models
【速读】:该论文试图解决大规模AI模型训练中由于依赖集中式、高带宽网络架构而导致的网络负担重、基础设施成本高以及电力系统压力大的问题。解决方案的关键在于提出了一种去中心化的扩散模型(Decentralized Diffusion Models)框架,该框架通过将扩散模型的训练分布到独立的集群或数据中心,消除了对集中式网络的依赖。具体而言,该方法在数据集的不同分区上训练一组专家扩散模型,每个模型在训练过程中完全独立。在推理时,这些专家模型通过一个轻量级的路由器进行集成,从而在整体上优化与单一模型在整个数据集上训练相同的目标。这种方法使得训练负担可以在多个“计算孤岛”之间分配,降低了基础设施成本,并提高了对局部GPU故障的容错能力。实验结果表明,去中心化的扩散模型在FLOP-for-FLOP性能上优于标准的扩散模型,并且能够在不到一周的时间内使用仅八个独立的GPU节点训练出高质量的扩散模型。
链接: https://arxiv.org/abs/2501.05450
作者: David McAllister,Matthew Tancik,Jiaming Song,Angjoo Kanazawa
机构: University of California, Berkeley (加州大学伯克利分校); Luma AI
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Project webpage: this https URL
Abstract:Large-scale AI model training divides work across thousands of GPUs, then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up infrastructure costs and straining power systems. We propose Decentralized Diffusion Models, a scalable framework for distributing diffusion model training across independent clusters or datacenters by eliminating the dependence on a centralized, high-bandwidth networking fabric. Our method trains a set of expert diffusion models over partitions of the dataset, each in full isolation from one another. At inference time, the experts ensemble through a lightweight router. We show that the ensemble collectively optimizes the same objective as a single model trained over the whole dataset. This means we can divide the training burden among a number of “compute islands,” lowering infrastructure costs and improving resilience to localized GPU failures. Decentralized diffusion models empower researchers to take advantage of smaller, more cost-effective and more readily available compute like on-demand GPU nodes rather than central integrated systems. We conduct extensive experiments on ImageNet and LAION Aesthetics, showing that decentralized diffusion models FLOP-for-FLOP outperform standard diffusion models. We finally scale our approach to 24 billion parameters, demonstrating that high-quality diffusion models can now be trained with just eight individual GPU nodes in less than a week.
zh
[CV-2] Explainable AI-Enhanced Deep Learning for Pumpkin Leaf Disease Detection: A Comparative Analysis of CNN Architectures
【速读】:该论文旨在解决南瓜叶病害对农业生产力的重大威胁,特别是传统病害识别方法耗时且易受人为错误影响的问题。为此,研究提出了一种基于深度学习的自动化病害诊断解决方案。关键点在于利用“南瓜叶病害数据集”(Pumpkin Leaf Disease Dataset),该数据集包含2000张高分辨率图像,分为五类:霜霉病(Downy mildew)、白粉病(Powdery mildew)、花叶病(Mosaic disease)、细菌性叶斑病(Bacterial leaf spot)和健康叶片。研究评估了多种深度学习架构,包括DenseNet201、DenseNet121、DenseNet169、Xception、ResNet50、ResNet101和InceptionResNetV2,发现ResNet50表现最佳,准确率达到90.5%,并且在精确率、召回率和F1分数上表现优异。此外,研究采用了可解释人工智能(Explainable AI, XAI)方法,如Grad-CAM、Grad-CAM++、Score-CAM和Layer-CAM,以增强模型决策过程的可解释性,从而提高对自动化病害诊断系统的信任度。这些结果表明,ResNet50在南瓜叶病害检测中具有显著潜力,能够实现更早、更准确的治疗干预。
链接: https://arxiv.org/abs/2501.05449
作者: Md. Arafat Alam Khandaker,Ziyan Shirin Raha,Shifat Islam,Tashreef Muhammad
机构: Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh (阿赫桑努拉科技大学计算机科学与工程系, 达卡, 孟加拉国); Department of Computer Science and Engineering, Southeast University, Dhaka, Bangladesh (东南大学计算机科学与工程系, 达卡, 孟加拉国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 2024 27th International Conference on Computer and Information Technology (ICCIT)
Abstract:Pumpkin leaf diseases are significant threats to agricultural productivity, requiring a timely and precise diagnosis for effective management. Traditional identification methods are laborious and susceptible to human error, emphasizing the necessity for automated solutions. This study employs on the “Pumpkin Leaf Disease Dataset”, that comprises of 2000 high-resolution images separated into five categories. Downy mildew, powdery mildew, mosaic disease, bacterial leaf spot, and healthy leaves. The dataset was rigorously assembled from several agricultural fields to ensure a strong representation for model training. We explored many proficient deep learning architectures, including DenseNet201, DenseNet121, DenseNet169, Xception, ResNet50, ResNet101 and InceptionResNetV2, and observed that ResNet50 performed most effectively, with an accuracy of 90.5% and comparable precision, recall, and F1-Score. We used Explainable AI (XAI) approaches like Grad-CAM, Grad-CAM++, Score-CAM, and Layer-CAM to provide meaningful representations of model decision-making processes, which improved understanding and trust in automated disease diagnostics. These findings demonstrate ResNet50’s potential to revolutionize pumpkin leaf disease detection, allowing for earlier and more accurate treatments.
zh
[CV-3] Relative Pose Estimation through Affine Corrections of Monocular Depth Priors
【速读】:该论文试图解决单目深度估计(Monocular Depth Estimation, MDE)模型在几何视觉任务中,特别是相对位姿估计(relative pose estimation)中的应用问题。尽管近年来MDE模型在预测仿射不变相对深度(affine-invariant relative depth)和度量(绝对)深度(metric depth)方面取得了显著进展,但如何有效利用这些预测结果来改进经典的关键点(keypoint-based)解决方案仍是一个相对未被充分探索的领域。论文提出了三种相对位姿估计的求解器,这些求解器明确考虑了独立的仿射(尺度和平移)模糊性(affine ambiguities),并涵盖了校准和非校准条件。此外,论文还提出了一种混合估计管道,将所提出的求解器与经典的点基求解器和极线约束(epipolar constraints)相结合。结果表明,仿射校正建模不仅对相对深度先验有益,而且对“度量”深度先验也有显著改进。实验证明,该方法在多个数据集上均优于经典的关键点基线和基于PnP的解决方案,并且能够与不同的特征匹配器和MDE模型协同工作,进一步提升性能。
链接: https://arxiv.org/abs/2501.05446
作者: Yifan Yu,Shaohui Liu,Rémi Pautrat,Marc Pollefeys,Viktor Larsson
机构: ETH Zurich(苏黎世联邦理工学院); PICO; Microsoft Spatial AI Lab(微软空间人工智能实验室); Lund University(隆德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the ``metric" ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent advances on both modules. Code is available at this https URL.
zh
[CV-4] Consistent Flow Distillation for Text-to-3D Generation
【速读】:该论文试图解决Score Distillation Sampling (SDS)在3D生成中由于最大似然行为导致的视觉质量下降和多样性不足的问题。为了解决这些限制,论文提出了Consistent Flow Distillation (CFD)方法。其关键解决方案在于利用扩散ODE(Ordinary Differential Equation)或SDE(Stochastic Differential Equation)采样过程的梯度来指导3D生成,并通过在多视角下引入一致的高斯噪声(multi-view consistent Gaussian noise)来确保2D图像流的一致性,从而提升3D生成的质量和多样性。实验结果表明,CFD在文本到3D生成任务中显著优于现有方法。
链接: https://arxiv.org/abs/2501.05445
作者: Runjie Yan,Yinbo Chen,Xiaolong Wang
机构: UC San Diego(加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Score Distillation Sampling (SDS) has made significant strides in distilling image-generative models for 3D generation. However, its maximum-likelihood-seeking behavior often leads to degraded visual quality and diversity, limiting its effectiveness in 3D applications. In this work, we propose Consistent Flow Distillation (CFD), which addresses these limitations. We begin by leveraging the gradient of the diffusion ODE or SDE sampling process to guide the 3D generation. From the gradient-based sampling perspective, we find that the consistency of 2D image flows across different viewpoints is important for high-quality 3D generation. To achieve this, we introduce multi-view consistent Gaussian noise on the 3D object, which can be rendered from various viewpoints to compute the flow gradient. Our experiments demonstrate that CFD, through consistent flows, significantly outperforms previous methods in text-to-3D generation.
zh
[CV-5] Can MLLM s Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理有机多模态推理(organic multimodal reasoning)方面的能力不足问题。现有的基准测试往往侧重于文本主导的推理或依赖浅层的视觉线索,未能充分评估模型在视觉和文本结合推理中的表现。为此,作者提出了EMMA(Enhanced MultiModal reAsoning)基准,旨在评估模型在数学、物理、化学和编程等领域的多模态推理能力。EMMA任务要求模型进行跨模态的复杂推理,无法通过单一模态的独立推理来解决,从而为MLLMs的推理能力提供了更全面的测试。通过对现有最先进的MLLMs在EMMA上的评估,作者发现这些模型在处理复杂多模态和多步推理任务时存在显著局限性,即使采用了Chain-of-Thought prompting和测试时计算扩展等先进技术,表现仍然不佳。这些发现强调了改进多模态架构和训练范式的必要性,以缩小人类与模型在多模态推理能力上的差距。
链接: https://arxiv.org/abs/2501.05444
作者: Yunzhuo Hao,Jiawei Gu,Huichen Will Wang,Linjie Li,Zhengyuan Yang,Lijuan Wang,Yu Cheng
机构: University of Electronic Science and Technology of China(电子科技大学); Sun Yat-sen University(中山大学); University of Washington(华盛顿大学); Microsoft(微软); The Chinese University of Hong Kong(香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs’ reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.
zh
[CV-6] Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces
【速读】:该论文试图解决在视频潜在扩散模型(latent video diffusion models)中,如何在不增加通道容量的情况下,将视频分词器(video tokenizers)的时间压缩比(temporal compression ratio)提升到4倍以上的问题。现有的视频分词器在时间压缩比超过4倍时,重建质量显著下降。论文提出的解决方案关键在于发现,低压缩编码器(low-compression encoder)对时间子采样视频的重建质量优于高压缩编码器(high-compression encoder)对原始视频的重建质量。基于这一发现,作者提出了一种自举式高时间压缩模型(bootstrapped high-temporal-compression model),该模型在训练良好的低压缩模型基础上逐步训练高压缩块(high-compression blocks),并通过跨层级特征混合模块(cross-level feature-mixing module)保留预训练低压缩模型的信息,同时引导高压缩块从完整视频序列中捕捉剩余细节。实验结果表明,该方法在提高时间压缩比的同时,显著改善了重建质量,并且生成的紧凑潜在空间有效支持了高质量视频生成。
链接: https://arxiv.org/abs/2501.05442
作者: Aniruddha Mahapatra,Long Mai,Yitian Zhang,David Bourgin,Feng Liu
机构: Adobe Research; Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project website: this https URL
Abstract:Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to direct extensions of existing video tokenizers. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a reduced token budget.
zh
[CV-7] he GAN is dead; long live the GAN! A Modern GAN Baseline NEURIPS2024
【速读】:该论文旨在解决生成对抗网络(GANs)训练困难的问题,并挑战现有文献中广泛使用的经验性技巧。论文提出了一种更为原则性的方法来构建现代GAN基线模型。其解决方案的关键在于:首先,推导了一种表现良好的正则化相对论GAN损失函数(regularized relativistic GAN loss),该损失函数解决了模式崩溃(mode dropping)和非收敛问题,这些问题以往通常通过一系列临时技巧来处理。其次,通过数学分析证明了该损失函数具有局部收敛性,这与大多数现有的相对论损失函数不同。最后,基于这一新的损失函数,论文摒弃了所有临时技巧,并使用现代架构替换了常见GAN中过时的骨干网络。以StyleGAN2为例,论文展示了一种简化和现代化的路线图,最终提出了一个新的极简基线模型——R3GAN。尽管结构简单,该方法在多个数据集(如FFHQ、ImageNet、CIFAR和Stacked MNIST)上超越了StyleGAN2,并与最先进的GAN和扩散模型相比表现出色。
链接: https://arxiv.org/abs/2501.05441
作者: Yiwen Huang,Aaron Gokaslan,Volodymyr Kuleshov,James Tompkin
机构: Brown University(布朗大学); Cornell University(康奈尔大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2024. Code available at this https URL
Abstract:There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regularized relativistic GAN loss that addresses issues of mode dropping and non-convergence that were previously tackled via a bag of ad-hoc tricks. We analyze our loss mathematically and prove that it admits local convergence guarantees, unlike most existing relativistic losses. Second, our new loss allows us to discard all ad-hoc tricks and replace outdated backbones used in common GANs with modern architectures. Using StyleGAN2 as an example, we present a roadmap of simplification and modernization that results in a new minimalist baseline – R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against state-of-the-art GANs and diffusion models.
zh
[CV-8] DPF*: improved Depth Potential Function for scale-invariant sulcal depth estimation
【速读】:该论文旨在解决大脑尺寸如何影响从解剖MRI(Magnetic Resonance Imaging)中提取的皮质表面几何特征,特别是沟深(sulcal depth)的问题。尽管已有文献广泛记录了大脑尺寸、皮质折叠和年龄之间的相互作用,但关于大脑尺寸如何影响沟深测量的定量分析仍然缺乏。论文的关键解决方案包括:1)首次提供了大脑尺寸对沟深测量影响的定量分析;2)提出了一种基于问题形式化的新型、尺度不变的沟深估计方法;3)建立了一个验证框架,并向社区共享了代码和基准数据;4)通过涵盖从受孕后26周到成年期的1,987名受试者的大样本,展示了新沟深测量方法的生物学相关性。这些贡献为沟深测量提供了新的视角和方法,具有重要的基础研究和临床应用价值。
链接: https://arxiv.org/abs/2501.05436
作者: Maxime Dieudonné(1),Guillaume Auzias(1),Julien Lefèvre(1) ((1) Institut de Neurosciences de la Timone, UMR 7289, CNRS, Aix-Marseille Université, 13005, Marseille, France)
机构: Institut de Neurosciences de la Timone, UMR 7289, CNRS, Aix-Marseille Université, 13005, Marseille, France (蒂莫内神经科学研究所, UMR 7289, 法国国家科学研究中心, 艾克斯-马赛大学, 13005, 马赛, 法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: GA and JL contributed equally to this work
Abstract:The shape of human brain is complex and highly variable, with interactions between brain size, cortical folding, and age well-documented in the literature. However, few studies have explored how global brain size influences geometric features of the cortical surface derived from anatomical MRI. In this work, we focus on sulcal depth, an imaging phenotype that has gained significant attention in both basic research and clinical applications. We make key contributions to the field by: 1) providing the first quantitative analysis of how brain size affects sulcal depth measurements; 2) introducing a novel, scale-invariant method for sulcal depth estimation based on an original formalization of the problem; 3) presenting a validation framework and sharing our code and benchmark data with the community; and 4) demonstrating the biological relevance of our new sulcal depth measure using a large sample of 1,987 subjects spanning the developmental period from 26 weeks post-conception to adulthood.
zh
[CV-9] Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation
【速读】:该论文试图解决直接3D生成(direct 3D generation)面临的挑战,特别是由于3D数据集稀缺且保真度较低,导致3D生成质量受限的问题。论文提出的解决方案Zero-1-to-G,关键在于利用预训练的2D扩散模型(pretrained 2D diffusion models)来实现基于单视图的3D生成。具体而言,Zero-1-to-G通过将3D表示的高斯溅射(Gaussian splats)分解为多视图图像,从而将复杂的3D生成任务重新定义为2D扩散框架内的生成问题。此外,论文引入了跨视图和跨属性注意力层(cross-view and cross-attribute attention layers),以捕捉复杂的相关性并确保生成的溅射在3D空间中的一致性。这一方法首次有效利用了预训练的2D扩散先验,显著提升了3D生成的效率和对未见对象的泛化能力。
链接: https://arxiv.org/abs/2501.05427
作者: Xuyi Meng,Chen Wang,Jiahui Lei,Kostas Daniilidis,Jiatao Gu,Lingjie Liu
机构: University of Pennsylvania(宾夕法尼亚大学); Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 2D image generation have achieved remarkable quality,largely driven by the capacity of diffusion models and the availability of large-scale datasets. However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets. In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct single-view generation on Gaussian splats using pretrained 2D diffusion models. Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes. This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pretrained 2D diffusion models. To incorporate 3D awareness, we introduce cross-view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats. This makes Zero-1-to-G the first direct image-to-3D generative model to effectively utilize pretrained 2D diffusion priors, enabling efficient training and improved generalization to unseen objects. Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation.
zh
[CV-10] From Images to Insights: Transforming Brain Cancer Diagnosis with Explainable AI
【速读】:该论文试图解决脑癌(Brain Cancer)在医学诊断中的挑战,特别是在缺乏专业放射科医生的情况下,脑癌的诊断往往困难、耗时且易受类内变异性的影响。为了解决这一问题,研究团队构建了孟加拉国脑癌MRI数据集(Bangladesh Brain Cancer MRI Dataset),包含6,056张MRI图像,分为脑肿瘤(Brain Tumor)、脑胶质瘤(Brain Glioma)和脑膜瘤(Brain Menin)三类。研究采用了先进的深度学习模型,其中DenseNet169表现出色,准确率、精确率、召回率和F1分数均达到0.9983。此外,研究还应用了可解释AI(Explainable AI, XAI)方法,如GradCAM、GradCAM++、ScoreCAM和LayerCAM,以提供模型决策过程的可视化解释。这些技术不仅提升了诊断的准确性,还增强了模型的透明度,有助于早期诊断和改善患者预后。
链接: https://arxiv.org/abs/2501.05426
作者: Md. Arafat Alam Khandaker,Ziyan Shirin Raha,Salehin Bin Iqbal,M.F. Mridha,Jungpil Shin
机构: 1Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh (阿赫桑努拉科技大学计算机科学与工程系, 达卡, 孟加拉国); 2Department of Computer Science, American International University Bangladesh, Dhaka, Bangladesh (孟加拉国美国国际大学计算机科学系, 达卡, 孟加拉国); 3Department of Computer Science and Engineering, University of Aizu, Aizuwakamatsu, Japan (会津大学计算机科学与工程系, 会津若松, 日本)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in 2024 27th International Conference on Computer and Information Technology (ICCIT)
Abstract:Brain cancer represents a major challenge in medical diagnostics, requisite precise and timely detection for effective treatment. Diagnosis initially relies on the proficiency of radiologists, which can cause difficulties and threats when the expertise is sparse. Despite the use of imaging resources, brain cancer remains often difficult, time-consuming, and vulnerable to intraclass variability. This study conveys the Bangladesh Brain Cancer MRI Dataset, containing 6,056 MRI images organized into three categories: Brain Tumor, Brain Glioma, and Brain Menin. The dataset was collected from several hospitals in Bangladesh, providing a diverse and realistic sample for research. We implemented advanced deep learning models, and DenseNet169 achieved exceptional results, with accuracy, precision, recall, and F1-Score all reaching 0.9983. In addition, Explainable AI (XAI) methods including GradCAM, GradCAM++, ScoreCAM, and LayerCAM were employed to provide visual representations of the decision-making processes of the models. In the context of brain cancer, these techniques highlight DenseNet169’s potential to enhance diagnostic accuracy while simultaneously offering transparency, facilitating early diagnosis and better patient outcomes.
zh
[CV-11] Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation
【速读】:该论文试图解决训练音频到图像生成模型(audio-to-image generative models)时所需的大量语义对齐的音频-视觉对数据难以获取的问题。传统方法依赖于从野外视频中提取的音频-视觉对,但这些数据的规模、质量和多样性受到限制,影响了生成模型的性能。论文提出了一种可扩展的图像声音化框架(scalable image sonification framework),通过现代视觉-语言模型(vision-language models)的推理能力,将来自不同高质量但独立的单模态数据源进行人工配对。这种方法避免了绝对依赖真实音频-视觉对齐数据的限制,从而提升了数据的规模、质量和多样性。实验表明,使用该框架生成的音频-图像对训练的生成模型在性能上可与现有最先进模型竞争,并展示了语义混合、插值、响度校准和混响建模等隐含的听觉能力。
链接: https://arxiv.org/abs/2501.05413
作者: Darius Petermann,Mahdi M. Kalayeh
机构: Indiana University(印第安纳大学); Netflix(网飞)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:
Abstract:Training audio-to-image generative models requires an abundance of diverse audio-visual pairs that are semantically aligned. Such data is almost always curated from in-the-wild videos, given the cross-modal semantic correspondence that is inherent to them. In this work, we hypothesize that insisting on the absolute need for ground truth audio-visual correspondence, is not only unnecessary, but also leads to severe restrictions in scale, quality, and diversity of the data, ultimately impairing its use in the modern generative models. That is, we propose a scalable image sonification framework where instances from a variety of high-quality yet disjoint uni-modal origins can be artificially paired through a retrieval process that is empowered by reasoning capabilities of modern vision-language models. To demonstrate the efficacy of this approach, we use our sonified images to train an audio-to-image generative model that performs competitively against state-of-the-art. Finally, through a series of ablation studies, we exhibit several intriguing auditory capabilities like semantic mixing and interpolation, loudness calibration and acoustic space modeling through reverberation that our model has implicitly developed to guide the image generation process.
zh
[CV-12] A Novel Pathology Foundation Model by Mayo Clinic Charite and Aignostics
【速读】:该论文旨在解决数字病理学(digital pathology)领域中基础模型(foundation models)在不同应用中的有效性问题。论文提出了一种基于RudolfV方法的新型视觉基础模型(vision foundation model),该模型在包含120万张组织病理学全切片图像(whole slide images)的数据集上进行了训练,数据来自Mayo Clinic和Charité - Universitätsmedizin Berlin两家医疗机构。尽管该模型在参数数量和训练数据集规模上并非最大,但其在21个公共基准数据集上实现了最先进的性能(state-of-the-art performance)。解决方案的关键在于通过RudolfV方法构建的视觉基础模型,能够在相对较小的模型规模和数据集上实现卓越的性能表现。
链接: https://arxiv.org/abs/2501.05409
作者: Maximilian Alber,Stephan Tietz,Jonas Dippel,Timo Milbich,Timothée Lesort,Panos Korfiatis,Moritz Krügener,Beatriz Perez Cancer,Neelay Shah,Alexander Möllers,Philipp Seegerer,Alexandra Carpen-Amarie,Kai Standvoss,Gabriel Dernbach,Edwin de Jong,Simon Schallenberg,Andreas Kunft,Helmut Hoffer von Ankershoffen,Gavin Schaeferle,Patrick Duffy,Matt Redlon,Philipp Jurmeister,David Horst,Lukas Ruff,Klaus-Robert Müller,Frederick Klauschen,Andrew Norgan
机构: Aignostics, Germany; Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, US; Department of Radiology, Mayo Clinic, Rochester MN, US; Department of Information Technology, Mayo Clinic, Rochester MN, US; Systems Quality Office, Mayo Clinic, Rochester MN, US; Machine Learning Group, Technische Universität Berlin, Germany; BIFOLD – Berlin Institute for the Foundations of Learning and Data, Germany; Department of Artificial Intelligence, Korea University, Republic of Korea; Max-Planck Institute for Informatics, Germany; German Cancer Research Center (DKFZ) & German Cancer Consortium (DKTK), Berlin & Munich Partner Sites, Germany; Institute of Pathology, Ludwig-Maximilians-Universität München, Germany; Institute of Pathology, Charité – Universitätsmedizin Berlin, Germany; Bavarian Cancer Research Center (BZKF), Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in digital pathology have demonstrated the effectiveness of foundation models across diverse applications. In this report, we present a novel vision foundation model based on the RudolfV approach. Our model was trained on a dataset comprising 1.2 million histopathology whole slide images, collected from two medical institutions: Mayo Clinic and Charité - Universtätsmedizin Berlin. Comprehensive evaluations show that our model achieves state-of-the-art performance across twenty-one public benchmark datasets, even though it is neither the largest model by parameter count nor by training dataset size.
zh
[CV-13] Performance of YOLOv7 in Kitchen Safety While Handling Knife
【速读】:该论文旨在解决厨房中刀具使用时的安全问题,特别是手指放置不当和刀刃与手部接触的风险。研究采用了YOLOv7(You Only Look Once version 7)这一先进的物体检测模型,通过识别刀具操作中的安全隐患来减少切割伤和其他严重事故的发生。解决方案的关键在于利用YOLOv7模型的高精度检测能力,通过评估模型的性能指标(如精确度、召回率、mAP50和mAP50-95),验证其在识别刀具相关风险方面的有效性。研究结果表明,YOLOv7在第31个训练周期时表现最佳,mAP50-95得分为0.7879,精确度为0.9063,召回率为0.7503,证明了该模型在提升厨房安全方面的潜力。
链接: https://arxiv.org/abs/2501.05399
作者: Athulya Sundaresan Geetha
机构: Department of Computer Science, Huddersfield University (哈德斯菲尔德大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Safe knife practices in the kitchen significantly reduce the risk of cuts, injuries, and serious accidents during food preparation. Using YOLOv7, an advanced object detection model, this study focuses on identifying safety risks during knife handling, particularly improper finger placement and blade contact with hand. The model’s performance was evaluated using metrics such as precision, recall, mAP50, and mAP50-95. The results demonstrate that YOLOv7 achieved its best performance at epoch 31, with a mAP50-95 score of 0.7879, precision of 0.9063, and recall of 0.7503. These findings highlight YOLOv7’s potential to accurately detect knife-related hazards, promoting the development of improved kitchen safety.
zh
[CV-14] Arc2Avatar: Generating Expressive 3D Avatars from a Single Image via ID Guidance
【速读】:该论文旨在解决从单张图像生成具有高真实感和身份保持的3D人头模型的问题。现有的方法在多视角设置下能够重建详细的3D场景,但在单张图像输入的情况下,生成具有密集对应关系的人头模型仍然具有挑战性。论文提出的解决方案Arc2Avatar,首次基于SDS(Score Distillation Sampling)方法,利用人脸基础模型(human face foundation model)作为指导,仅需单张图像输入即可生成多视角的人头模型。关键创新点包括:1)通过对合成数据进行微调并修改其条件,扩展了基础模型以支持多视角生成;2)采用改进的3D高斯散射(3D Gaussian Splatting)方法、连接正则化器(connectivity regularizers)以及针对任务定制的初始化策略,确保生成的模型与人脸网格模板保持密集对应关系,支持基于混合形状(blendshape)的表情生成;3)引入可选的基于SDS的校正步骤,进一步优化混合形状表情,提升真实感和多样性。实验表明,Arc2Avatar在真实感和身份保持方面达到了最先进的水平,并通过强身份先验和初始化策略有效解决了颜色问题,同时不牺牲细节。
链接: https://arxiv.org/abs/2501.05379
作者: Dimitrios Gerogiannis,Foivos Paraperas Papantoniou,Rolandos Alexandros Potamias,Alexandros Lattas,Stefanos Zafeiriou
机构: Imperial College London (伦敦帝国学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inspired by the effectiveness of 3D Gaussian Splatting (3DGS) in reconstructing detailed 3D scenes within multi-view setups and the emergence of large 2D human foundation models, we introduce Arc2Avatar, the first SDS-based method utilizing a human face foundation model as guidance with just a single image as input. To achieve that, we extend such a model for diverse-view human head generation by fine-tuning on synthetic data and modifying its conditioning. Our avatars maintain a dense correspondence with a human face mesh template, allowing blendshape-based expression generation. This is achieved through a modified 3DGS approach, connectivity regularizers, and a strategic initialization tailored for our task. Additionally, we propose an optional efficient SDS-based correction step to refine the blendshape expressions, enhancing realism and diversity. Experiments demonstrate that Arc2Avatar achieves state-of-the-art realism and identity preservation, effectively addressing color issues by allowing the use of very low guidance, enabled by our strong identity prior and initialization strategy, without compromising detail.
zh
[CV-15] 1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On
【速读】:该论文旨在解决虚拟试衣(Virtual Try-On, VTON)领域中现有方法在保留细粒度服装细节方面的局限性,特别是双网络范式带来的计算开销问题。早期VTON方法依赖于单一生成网络,但由于特征提取和融合的局限性,难以有效保留服装细节。尽管最近的双网络方法通过引入“ReferenceNet”提升了特征提取和融合的效果,但其计算开销较大,限制了在高分辨率和长时间图像/视频VTON应用中的可扩展性。为此,论文提出了一种新颖的单网络VTON方法,称为MNVTON,其关键创新在于引入了模态特定归一化(Modality-specific Normalization)策略,能够分别处理文本、图像和视频输入,使它们能够在VTON网络中共享相同的注意力层。实验结果表明,该方法在图像和视频VTON任务中均能实现更高质量、更细致的生成效果,证明了单网络范式在性能上可与双网络方法相媲美,同时提供了更高效的高质量、可扩展VTON解决方案。
链接: https://arxiv.org/abs/2501.05369
作者: Shuliang Ning,Yipeng Qin,Xiaoguang Han
机构: FNii, CUHKSZ (香港中文大学深圳校区未来网络研究院); SSE, CUHKSZ (香港中文大学深圳校区软件工程学院); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Virtual Try-On (VTON) has become a crucial tool in ecommerce, enabling the realistic simulation of garments on individuals while preserving their original appearance and pose. Early VTON methods relied on single generative networks, but challenges remain in preserving fine-grained garment details due to limitations in feature extraction and fusion. To address these issues, recent approaches have adopted a dual-network paradigm, incorporating a complementary “ReferenceNet” to enhance garment feature extraction and fusion. While effective, this dual-network approach introduces significant computational overhead, limiting its scalability for high-resolution and long-duration image/video VTON applications. In this paper, we challenge the dual-network paradigm by proposing a novel single-network VTON method that overcomes the limitations of existing techniques. Our method, namely MNVTON, introduces a Modality-specific Normalization strategy that separately processes text, image and video inputs, enabling them to share the same attention layers in a VTON network. Extensive experimental results demonstrate the effectiveness of our approach, showing that it consistently achieves higher-quality, more detailed results for both image and video VTON tasks. Our results suggest that the single-network paradigm can rival the performance of dualnetwork approaches, offering a more efficient alternative for high-quality, scalable VTON applications.
zh
[CV-16] CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models
【速读】:该论文试图解决扩散模型(diffusion models)在图像生成过程中可能被滥用以生成不适宜工作场所(NSFW)内容的问题。尽管Stable Diffusion模型内置了安全检测机制来审查初始文本提示和生成的最终图像,但这些机制在面对对抗性攻击(adversarial attacks)时存在漏洞,导致仍可能生成NSFW图像。论文的关键解决方案是提出了CROPS(Circular or RandOm Prompts for Safety)框架,该框架不依赖于特定模型,且无需额外训练即可有效防御对抗性攻击。此外,论文还开发了CROPS-1方法,利用一步扩散模型进行高效的NSFW检测,进一步减少了计算资源的需求。实验结果表明,该方法在性能和适用性方面具有显著优势。
链接: https://arxiv.org/abs/2501.05359
作者: Junha Park,Ian Ryu,Jaehui Hwang,Hyungkeun Park,Jiyoon Kim,Jong-Seok Lee
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With advances in diffusion models, image generation has shown significant performance improvements. This raises concerns about the potential abuse of image generation, such as the creation of explicit or violent images, commonly referred to as Not Safe For Work (NSFW) content. To address this, the Stable Diffusion model includes several safety checkers to censor initial text prompts and final output images generated from the model. However, recent research has shown that these safety checkers have vulnerabilities against adversarial attacks, allowing them to generate NSFW images. In this paper, we find that these adversarial attacks are not robust to small changes in text prompts or input latents. Based on this, we propose CROPS (Circular or RandOm Prompts for Safety), a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training. Moreover, we develop an approach that utilizes one-step diffusion models for efficient NSFW detection (CROPS-1), further reducing computational resources. We demonstrate the superiority of our method in terms of performance and applicability.
zh
[CV-17] JAQ: Joint Efficient Architecture Design and Low-Bit Quantization with Hardware-Software Co-Exploration AAAI2025
【速读】:该论文旨在解决在资源受限的边缘设备上部署神经网络模型时,如何在神经网络架构、量化精度和硬件加速器之间实现最优平衡的问题。具体挑战包括软件端低精度量化训练导致的内存开销(由于存储大量中间特征和潜在权重用于反向传播)以及硬件端由于硬件参数的离散性和编译器优化与个体算子之间复杂交互导致的搜索耗时。为解决这些问题,JAQ框架提出了两个关键解决方案:一是通过通道稀疏量化(CSQ)方案选择性地对模型中最敏感的组件进行量化,从而减少内存开销;二是设计BatchTile,利用硬件生成网络编码所有可能的平铺模式,加速最优编译器映射策略的搜索。实验结果表明,JAQ在ImageNet上实现了约7%的Top-1准确率提升,并将每次迭代的硬件搜索时间缩短至0.15秒。
链接: https://arxiv.org/abs/2501.05339
作者: Mingzi Wang,Yuan Meng,Chen Tang,Weixiang Zhang,Yijian Qin,Yang Yao,Yingxin Li,Tongtong Feng,Xin Wang,Xun Guan,Zhi Wang,Wenwu Zhu
机构: 1. 未知; 2. 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025
Abstract:The co-design of neural network architectures, quantization precisions, and hardware accelerators offers a promising approach to achieving an optimal balance between performance and efficiency, particularly for model deployment on resource-constrained edge devices. In this work, we propose the JAQ Framework, which jointly optimizes the three critical dimensions. However, effectively automating the design process across the vast search space of those three dimensions poses significant challenges, especially when pursuing extremely low-bit quantization. Specifical, the primary challenges include: (1) Memory overhead in software-side: Low-precision quantization-aware training can lead to significant memory usage due to storing large intermediate features and latent weights for back-propagation, potentially causing memory exhaustion. (2) Search time-consuming in hardware-side: The discrete nature of hardware parameters and the complex interplay between compiler optimizations and individual operators make the accelerator search time-consuming. To address these issues, JAQ mitigates the memory overhead through a channel-wise sparse quantization (CSQ) scheme, selectively applying quantization to the most sensitive components of the model during optimization. Additionally, JAQ designs BatchTile, which employs a hardware generation network to encode all possible tiling modes, thereby speeding up the search for the optimal compiler mapping strategy. Extensive experiments demonstrate the effectiveness of JAQ, achieving approximately 7% higher Top-1 accuracy on ImageNet compared to previous methods and reducing the hardware search time per iteration to 0.15 seconds.
zh
[CV-18] Comparison Study: Glacier Calving Front Delineation in Synthetic Aperture Radar Images With Deep Learning
【速读】:该论文试图解决如何利用深度学习(Deep Learning, DL)系统从合成孔径雷达(Synthetic Aperture Radar, SAR)图像中自动提取海洋终止冰川(marine-terminating glaciers)的崩解前沿位置(calving front position)的问题。崩解前沿位置的变化是冰质量损失的重要指标,也是冰川数值模型中的关键参数。论文的关键解决方案是通过深度学习系统实现大规模、连续且不受天气和光照条件影响的冰川崩解前沿监测。研究首次在统一的崩解前沿基准数据集上比较了不同深度学习系统的性能,并通过十名标注者的多标注者研究,将表现最佳的深度学习系统与人类标注者的表现进行了对比。结果表明,当前深度学习系统的平均偏差为221米,而人类标注者的平均偏差为38米,表明现有深度学习系统尚未达到人类水平,未来研究需要进一步探索视觉变换器(Vision Transformers)、基础模型(foundation models)以及更多信息的整合和处理策略,以实现完全自动化的冰川崩解前沿监测。
链接: https://arxiv.org/abs/2501.05281
作者: Nora Gourmelon,Konrad Heidler,Erik Loebel,Daniel Cheng,Julian Klink,Anda Dong,Fei Wu,Noah Maul,Moritz Koch,Marcel Dreier,Dakota Pyles,Thorsten Seehaus,Matthias Braun,Andreas Maier,Vincent Christlein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Calving front position variation of marine-terminating glaciers is an indicator of ice mass loss and a crucial parameter in numerical glacier models. Deep Learning (DL) systems can automatically extract this position from Synthetic Aperture Radar (SAR) imagery, enabling continuous, weather- and illumination-independent, large-scale monitoring. This study presents the first comparison of DL systems on a common calving front benchmark dataset. A multi-annotator study with ten annotators is performed to contrast the best-performing DL system against human performance. The best DL model’s outputs deviate 221 m on average, while the average deviation of the human annotators is 38 m. This significant difference shows that current DL systems do not yet match human performance and that further research is needed to enable fully automated monitoring of glacier calving fronts. The study of Vision Transformers, foundation models, and the inclusion and processing strategy of more information are identified as avenues for future research.
zh
[CV-19] Solving the Catastrophic Forgetting Problem in Generalized Category Discovery CVPR2024
【速读】:该论文试图解决广义类别发现(Generalized Category Discovery, GCD)中已知类别遗忘(known category forgetting)的问题。在GCD任务中,模型需要在未标注数据集中识别已知类别和未知类别,但现有方法如SimGCD在适应新类别时容易遗忘已知类别的模式,导致新类别分类性能下降。为解决这一问题,论文提出了一种新的学习方法LegoGCD,其核心在于引入了两种技术:局部熵正则化(Local Entropy Regularization, LER)和双视角Kullback-Leibler散度约束(Dual-views Kullback Leibler divergence constraint, DKL)。LER通过优化未标注数据中潜在已知类别样本的分布,确保在学习新类别时保留已知类别的知识;DKL则通过Kullback-Leibler散度约束,促使模型对同一图像的两个视角样本产生相似的预测分布,从而避免预测不匹配并生成更可靠的潜在已知类别样本。实验结果表明,LegoGCD在多个数据集上有效缓解了已知类别遗忘问题,显著提升了已知类别和新类别的分类准确率。
链接: https://arxiv.org/abs/2501.05272
作者: Xinzi Cao,Xiawu Zheng,Guanhong Wang,Weijiang Yu,Yunhang Shen,Ke Li,Yutong Lu,Yonghong Tian
机构: Sun Yat-sen University(中山大学); Peng Cheng Laboratory(鹏城实验室); Xiamen University(厦门大学); Zhejiang University(浙江大学); Tencent Youtu Lab(腾讯优图实验室); Peking University(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2024
Abstract:Generalized Category Discovery (GCD) aims to identify a mix of known and novel categories within unlabeled data sets, providing a more realistic setting for image recognition. Essentially, GCD needs to remember existing patterns thoroughly to recognize novel categories. Recent state-of-the-art method SimGCD transfers the knowledge from known-class data to the learning of novel classes through debiased learning. However, some patterns are catastrophically forgot during adaptation and thus lead to poor performance in novel categories classification. To address this issue, we propose a novel learning approach, LegoGCD, which is seamlessly integrated into previous methods to enhance the discrimination of novel classes while maintaining performance on previously encountered known classes. Specifically, we design two types of techniques termed as Local Entropy Regularization (LER) and Dual-views Kullback Leibler divergence constraint (DKL). The LER optimizes the distribution of potential known class samples in unlabeled data, thus ensuring the preservation of knowledge related to known categories while learning novel classes. Meanwhile, DKL introduces Kullback Leibler divergence to encourage the model to produce a similar prediction distribution of two view samples from the same image. In this way, it successfully avoids mismatched prediction and generates more reliable potential known class samples simultaneously. Extensive experiments validate that the proposed LegoGCD effectively addresses the known category forgetting issue across all datasets, eg, delivering a 7.74% and 2.51% accuracy boost on known and novel classes in CUB, respectively. Our code is available at: this https URL.
zh
[CV-20] CellViT: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation Models
【速读】:该论文旨在解决数字病理学(Digital Pathology)中细胞分割(cell segmentation)的两个主要问题:一是现有方法通常需要大量标注数据进行训练,二是这些方法通常局限于预定义的细胞分类方案。为了解决这些问题,作者提出了\textCellViT^\scriptscriptstyle ++框架,该框架利用基于Vision Transformers的基础模型作为编码器,同时计算深层细胞特征和分割掩码。该框架的关键创新在于其能够适应未见过的细胞类型,并且仅需少量数据进行训练,从而显著减少了碳足迹。此外,\textCellViT^\scriptscriptstyle ++还展示了在七种不同数据集上的卓越性能,涵盖了广泛的细胞类型、器官和临床环境,并实现了显著的零样本分割(zero-shot segmentation)和数据高效的细胞类型分类。该框架还通过免疫荧光染色(immunofluorescence stainings)自动生成训练数据集,无需病理学家标注,且生成的训练数据集质量优于手动标注的数据集。
链接: https://arxiv.org/abs/2501.05269
作者: Fabian Hörst,Moritz Rempe,Helmut Becker,Lukas Heine,Julius Keyl,Jens Kleesiek
机构: Institute for AI in Medicine (IKIM), University Hospital Essen (AöR), Essen, Germany; Cancer Research Center Cologne Essen (CCCE), West German Cancer Center Essen, University Hospital Essen (AöR), Essen, Germany; Department of Physics, TU Dortmund University, Dortmund, Germany; Institute of Pathology, University Hospital Essen (AöR), Essen, Germany; German Cancer Consortium (DKTK, Partner site Essen), Heidelberg, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Digital Pathology is a cornerstone in the diagnosis and treatment of diseases. A key task in this field is the identification and segmentation of cells in hematoxylin and eosin-stained images. Existing methods for cell segmentation often require extensive annotated datasets for training and are limited to a predefined cell classification scheme. To overcome these limitations, we propose \textCellViT^\scriptscriptstyle ++ , a framework for generalized cell segmentation in digital pathology. \textCellViT^\scriptscriptstyle ++ utilizes Vision Transformers with foundation models as encoders to compute deep cell features and segmentation masks simultaneously. To adapt to unseen cell types, we rely on a computationally efficient approach. It requires minimal data for training and leads to a drastically reduced carbon footprint. We demonstrate excellent performance on seven different datasets, covering a broad spectrum of cell types, organs, and clinical settings. The framework achieves remarkable zero-shot segmentation and data-efficient cell-type classification. Furthermore, we show that \textCellViT^\scriptscriptstyle ++ can leverage immunofluorescence stainings to generate training datasets without the need for pathologist annotations. The automated dataset generation approach surpasses the performance of networks trained on manually labeled data, demonstrating its effectiveness in creating high-quality training datasets without expert annotations. To advance digital pathology, \textCellViT^\scriptscriptstyle ++ is available as an open-source framework featuring a user-friendly, web-based interface for visualization and annotation. The code is available under this https URL.
zh
[CV-21] Patch-GAN Transfer Learning with Reconstructive Models for Cloud Removal
【速读】:该论文旨在解决遥感图像分析中的云层去除问题,特别是在准确重建被云层遮挡区域方面面临的挑战。解决方案的关键在于提出了一种基于生成对抗网络(GAN)框架的深度迁移学习方法,并探索了新型的掩码自编码器(MAE)图像重建模型在云层去除任务中的潜力。由于遥感图像的复杂性,论文进一步提出了使用逐块判别器(patch-wise discriminator)来判断图像的每个局部区域是否为真实图像。该方法在云层去除性能上相比其他基于GAN的方法有显著提升,并且在现有基准测试中取得了具有竞争力的结果。
链接: https://arxiv.org/abs/2501.05265
作者: Wanli Ma,Oktay Karakus,Paul L. Rosin
机构: Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cloud removal plays a crucial role in enhancing remote sensing image analysis, yet accurately reconstructing cloud-obscured regions remains a significant challenge. Recent advancements in generative models have made the generation of realistic images increasingly accessible, offering new opportunities for this task. Given the conceptual alignment between image generation and cloud removal tasks, generative models present a promising approach for addressing cloud removal in remote sensing. In this work, we propose a deep transfer learning approach built on a generative adversarial network (GAN) framework to explore the potential of the novel masked autoencoder (MAE) image reconstruction model in cloud removal. Due to the complexity of remote sensing imagery, we further propose using a patch-wise discriminator to determine whether each patch of the image is real or not. The proposed reconstructive transfer learning approach demonstrates significant improvements in cloud removal performance compared to other GAN-based methods. Additionally, whilst direct comparisons with some of the state-of-the-art cloud removal techniques are limited due to unclear details regarding their train/test data splits, the proposed model achieves competitive results based on available benchmarks.
zh
[CV-22] owards Balanced Continual Multi-Modal Learning in Human Pose Estimation
【速读】:该论文旨在解决多模态3D人体姿态估计(3D HPE)中的模态不平衡(modality imbalance)和持续学习(continual learning)问题。具体而言,RGB图像在3D HPE中存在对光照条件敏感和可能引起用户不适的局限性,而多模态感知虽然能够克服这些限制,但仍面临模态贡献不均衡和持续学习中的灾难性遗忘(catastrophic forgetting)等挑战。为此,论文提出了一种基于Shapley值的贡献度量化算法,用于评估各模态的贡献并识别模态不平衡问题。为解决这一问题,作者采用了重新学习策略。此外,针对原始数据易受噪声污染的问题,论文提出了一种新的去噪持续学习方法,结合噪声识别与分离模块,以减少噪声对模型优化的负面影响。最后,通过引入自适应EWC(Elastic Weight Consolidation)机制,进一步缓解了灾难性遗忘问题。实验结果表明,该方法在多模态数据集MM-Fi上显著提升了3D姿态估计的精度,并在复杂场景中有效减轻了灾难性遗忘。
链接: https://arxiv.org/abs/2501.05264
作者: Jiaxuan Peng,Mengshi Qi,Dong Zhao,Huadong Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, RGB images are susceptible to limitations such as sensitivity to lighting conditions and potential user discomfort. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance and the imperative for continual learning. In this work, we introduce a novel balanced continual multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to quantify the contribution of each modality and identify modality imbalance. To address this imbalance, we employ a re-learning strategy. Furthermore, recognizing that raw data is prone to noise contamination, we develop a novel denoising continual learning approach. This approach incorporates a noise identification and separation module to mitigate the adverse effects of noise and collaborates with the balanced learning strategy to enhance optimization. Additionally, an adaptive EWC mechanism is employed to alleviate catastrophic forgetting. We conduct extensive experiments on the widely-adopted multi-modal dataset, MM-Fi, which demonstrate the superiority of our approach in boosting 3D pose estimation and mitigating catastrophic forgetting in complex scenarios. We will release our codes.
zh
[CV-23] Domain-Incremental Semantic Segmentation for Autonomous Driving under Adverse Driving Conditions ICPR
【速读】:该论文旨在解决自动驾驶中语义分割(Semantic Segmentation)在恶劣驾驶条件下的性能下降问题。标准模型在理想条件下训练的数据在不利天气或光照条件下表现显著下降,而传统的微调方法会导致灾难性遗忘(catastrophic forgetting),即新任务或条件的学习会覆盖之前学到的信息。此外,传统的域适应方法虽然在目标域上提升了性能,但会牺牲源域的表现。为解决这些问题,论文提出了一种基于架构的域增量学习方法,称为渐进式语义分割(Progressive Semantic Segmentation, PSS)。PSS的核心在于它是一个任务无关的、动态增长的域特定分割模型集合,通过卷积自编码器(convolutional autoencoders)推断当前域并选择相应的分割模块。该方法在多个数据集上进行了广泛评估,展示了其在相似和未见域上的泛化能力。
链接: https://arxiv.org/abs/2501.05246
作者: Shishir Muralidhara,René Schuster,Didier Stricker
机构: German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); RPTU - University of Kaiserslautern-Landau (凯泽斯劳滕-兰道大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICPRAM 2025
Abstract:Semantic segmentation for autonomous driving is an even more challenging task when faced with adverse driving conditions. Standard models trained on data recorded under ideal conditions show a deteriorated performance in unfavorable weather or illumination conditions. Fine-tuning on the new task or condition would lead to overwriting the previously learned information resulting in catastrophic forgetting. Adapting to the new conditions through traditional domain adaption methods improves the performance on the target domain at the expense of the source domain. Addressing these issues, we propose an architecture-based domain-incremental learning approach called Progressive Semantic Segmentation (PSS). PSS is a task-agnostic, dynamically growing collection of domain-specific segmentation models. The task of inferring the domain and subsequently selecting the appropriate module for segmentation is carried out using a collection of convolutional autoencoders. We extensively evaluate our proposed approach using several datasets at varying levels of granularity in the categorization of adverse driving conditions. Furthermore, we demonstrate the generalization of the proposed approach to similar and unseen domains.
zh
[CV-24] Scaffold-SLAM: Structured 3D Gaussians for Simultaneous Localization and Photorealistic Mapping
【速读】:该论文旨在解决现有基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的同步定位与地图构建(Simultaneous Localization and Mapping, SLAM)方法在单目(monocular)、双目(stereo)和RGB-D相机上无法同时提供高质量新视角渲染的问题。特别是,现有方法在RGB-D相机上表现良好,但在单目相机上渲染质量显著下降。为解决这一问题,论文提出了Scaffold-SLAM,通过两项关键创新实现了跨单目、双目和RGB-D相机的高质量逼真地图构建。首先,提出了“运动到外观嵌入”(Appearance-from-Motion embedding),使3D高斯能够更好地建模不同相机姿态下的图像外观变化。其次,引入了频率正则化金字塔(frequency regularization pyramid),指导高斯的分布,使模型能够有效捕捉场景中的更精细细节。实验结果表明,Scaffold-SLAM在逼真地图构建质量上显著优于现有方法,例如在TUM RGB-D数据集上,单目相机的PSNR提升了16.76%。
链接: https://arxiv.org/abs/2501.05242
作者: Wen Tianci,Liu Zhiang,Lu Biao,Fang Yongchun
机构: Nankai University(南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures
Abstract:3D Gaussian Splatting (3DGS) has recently revolutionized novel view synthesis in the Simultaneous Localization and Mapping (SLAM). However, existing SLAM methods utilizing 3DGS have failed to provide high-quality novel view rendering for monocular, stereo, and RGB-D cameras simultaneously. Notably, some methods perform well for RGB-D cameras but suffer significant degradation in rendering quality for monocular cameras. In this paper, we present Scaffold-SLAM, which delivers simultaneous localization and high-quality photorealistic mapping across monocular, stereo, and RGB-D cameras. We introduce two key innovations to achieve this state-of-the-art visual quality. First, we propose Appearance-from-Motion embedding, enabling 3D Gaussians to better model image appearance variations across different camera poses. Second, we introduce a frequency regularization pyramid to guide the distribution of Gaussians, allowing the model to effectively capture finer details in the scene. Extensive experiments on monocular, stereo, and RGB-D datasets demonstrate that Scaffold-SLAM significantly outperforms state-of-the-art methods in photorealistic mapping quality, e.g., PSNR is 16.76% higher in the TUM RGB-D datasets for monocular cameras.
zh
[CV-25] Contrast-Free Myocardial Scar Segmentation in Cine MRI using Motion and Texture Fusion
【速读】:该论文旨在解决使用晚期钆增强磁共振成像(LGE MRI)检测心肌梗死后心肌瘢痕时,需要注射对比剂所带来的潜在副作用、增加扫描时间和患者不适的问题。论文提出了一种新颖的框架,通过结合电影磁共振成像(cine MRI)中观察到的心脏运动信息和图像纹理信息,来分割左心室中的心肌和瘢痕组织。解决方案的关键在于将心脏运动跟踪问题公式化为一个完整的心脏图像周期配准问题,并通过深度神经网络进行求解。实验结果表明,该方法能够基于非对比电影图像实现与LGE MRI相当的瘢痕分割精度,展示了其作为对比增强技术替代方案的潜力。
链接: https://arxiv.org/abs/2501.05241
作者: Guang Yang,Jingkun Chen,Xicheng Sheng,Shan Yang,Xiahai Zhuang,Betty Raman,Lei Li,Vicente Grau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2figs, 2tables
Abstract:Late gadolinium enhancement MRI (LGE MRI) is the gold standard for the detection of myocardial scars for post myocardial infarction (MI). LGE MRI requires the injection of a contrast agent, which carries potential side effects and increases scanning time and patient discomfort. To address these issues, we propose a novel framework that combines cardiac motion observed in cine MRI with image texture information to segment the myocardium and scar tissue in the left ventricle. Cardiac motion tracking can be formulated as a full cardiac image cycle registration problem, which can be solved via deep neural networks. Experimental results prove that the proposed method can achieve scar segmentation based on non-contrasted cine images with comparable accuracy to LGE MRI. This demonstrates its potential as an alternative to contrast-enhanced techniques for scar detection.
zh
[CV-26] Is Your Autonomous Vehicle Safe? Understanding the Threat of Electromagnetic Signal Injection Attacks on Traffic Scene Perception AAAI2025
【速读】:该论文试图解决自动驾驶车辆中基于摄像头的感知系统在电磁信号注入攻击(Electromagnetic Signal Injection Attacks, ESIA)下的脆弱性问题。ESIA会扭曲摄像头捕获的图像,导致AI模型做出错误决策,进而威胁自动驾驶车辆的安全性。尽管ESIA的潜在影响严重,但目前对其在不同复杂驾驶场景下对AI模型鲁棒性的影响缺乏深入理解。为解决这一问题,论文提出了一种新的ESIA模拟方法,并生成了针对不同驾驶场景的模拟攻击数据集。通过分析不同模型在ESIA下的表现,揭示了它们的脆弱性。该研究的关键在于提供了一个全面的模拟和评估框架,旨在增强AI模型的鲁棒性,推动更安全和可靠的智能系统的发展。
链接: https://arxiv.org/abs/2501.05239
作者: Wenhao Liao,Sineng Yan,Youqian Zhang,Xinwei Zhai,Yuanyuan Wang,Eugene Yujun Fu
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: To appear in AAAI 2025
Abstract:Autonomous vehicles rely on camera-based perception systems to comprehend their driving environment and make crucial decisions, thereby ensuring vehicles to steer safely. However, a significant threat known as Electromagnetic Signal Injection Attacks (ESIA) can distort the images captured by these cameras, leading to incorrect AI decisions and potentially compromising the safety of autonomous vehicles. Despite the serious implications of ESIA, there is limited understanding of its impacts on the robustness of AI models across various and complex driving scenarios. To address this gap, our research analyzes the performance of different models under ESIA, revealing their vulnerabilities to the attacks. Moreover, due to the challenges in obtaining real-world attack data, we develop a novel ESIA simulation method and generate a simulated attack dataset for different driving scenarios. Our research provides a comprehensive simulation and evaluation framework, aiming to enhance the development of more robust AI models and secure intelligent systems, ultimately contributing to the advancement of safer and more reliable technology across various fields.
zh
[CV-27] FOCUS: Towards Universal Foreground Segmentation
【速读】:该论文试图解决计算机视觉中前景分割(Foreground Segmentation)任务缺乏统一框架的问题。传统方法通常为每个任务设计特定的架构,且主要关注前景对象的识别,而未能有效区分前景与背景。论文强调了背景及其与前景关系的重要性,并提出了一种名为FOCUS(Foreground ObjeCts Universal Segmentation)的通用框架,能够处理多种前景分割任务。解决方案的关键在于:1)利用对象边缘信息构建多尺度语义网络,增强图像特征;2)提出一种新颖的蒸馏方法,结合对比学习策略,在多模态特征空间中优化预测掩码,从而实现边界感知的分割。实验结果表明,FOCUS在多个任务和数据集上均优于现有的任务特定模型。
链接: https://arxiv.org/abs/2501.05238
作者: Zuyao You,Lingyu Kong,Lingchen Meng,Zuxuan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foreground segmentation is a fundamental task in computer vision, encompassing various subdivision tasks. Previous research has typically designed task-specific architectures for each task, leading to a lack of unification. Moreover, they primarily focus on recognizing foreground objects without effectively distinguishing them from the background. In this paper, we emphasize the importance of the background and its relationship with the foreground. We introduce FOCUS, the Foreground ObjeCts Universal Segmentation framework that can handle multiple foreground tasks. We develop a multi-scale semantic network using the edge information of objects to enhance image features. To achieve boundary-aware segmentation, we propose a novel distillation method, integrating the contrastive learning strategy to refine the prediction mask in multi-modal feature space. We conduct extensive experiments on a total of 13 datasets across 5 tasks, and the results demonstrate that FOCUS consistently outperforms the state-of-the-art task-specific models on most metrics.
zh
[CV-28] Automated external cervical resorption segmentation in cone-beam CT using local texture features
【速读】:该论文试图解决外部颈吸收(External Cervical Resorption, ECR)在牙齿中的自动识别和量化问题。ECR是一种影响牙齿的吸收过程,可能导致牙齿丧失。目前,锥形束计算机断层扫描(Cone-Beam Computed Tomography, CBCT)是评估ECR的推荐成像方式,但手动识别和测量ECR吸收过程耗时且容易出错。因此,论文提出了一种基于局部提取的体素纹理特征自动二分类的ECR病变分割方法。该方法通过分析CBCT扫描中的纹理特征,能够准确检测ECR引起的细微信号变化,并通过聚类纹理特征对病变进行分层,识别钙化模式。这些方法为开发预测ECR进展或停止的预后生物标志物提供了重要步骤,从而为治疗决策提供依据。
链接: https://arxiv.org/abs/2501.05236
作者: Sadhana Ravikumar,Asma A. Khan,Matthew C. Davis,Beatriz Paniagua
机构: UT Health San Antonio, San Antonio, TX, USA(UT Health 圣安东尼奥, 圣安东尼奥, 德克萨斯州, 美国); Private Practice in Endodontics, Winnetka, IL, USA(私人牙髓病诊所, 温内特卡, 伊利诺伊州, 美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 3 figures, 1 table
Abstract:External cervical resorption (ECR) is a resorptive process affecting teeth. While in some patients, active resorption ceases and gets replaced by osseous tissue, in other cases, the resorption progresses and ultimately results in tooth loss. For proper ECR assessment, cone-beam computed tomography (CBCT) is the recommended imaging modality, enabling a 3-D characterization of these lesions. While it is possible to manually identify and measure ECR resorption in CBCT scans, this process can be time intensive and highly subject to human error. Therefore, there is an urgent need to develop an automated method to identify and quantify the severity of ECR resorption using CBCT. Here, we present a method for ECR lesion segmentation that is based on automatic, binary classification of locally extracted voxel-wise texture features. We evaluate our method on 6 longitudinal CBCT datasets and show that certain texture-features can be used to accurately detect subtle CBCT signal changes due to ECR. We also present preliminary analyses clustering texture features within a lesion to stratify the defects and identify patterns indicative of calcification. These methods are important steps in developing prognostic biomarkers to predict whether ECR will continue to progress or cease, ultimately informing treatment decisions.
zh
[CV-29] Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection
【速读】:该论文试图解决在零样本(zero-shot)离群检测(Out-of-Distribution, OOD)中,现有方法在远距离离群(Far-OOD)检测性能提升的同时,可能牺牲近距离离群(Near-OOD)检测效果的问题。解决方案的关键在于创新性地结合大语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs),通过生成ID标签的超类及其背景描述,利用CLIP进行特征提取,并通过从超类特征中减去背景特征来提取ID数据的核心语义特征。这一精炼的表示有助于从WordNet的候选标签集中选择更合适的负标签用于OOD数据,从而提升零样本OOD检测在Far-OOD和Near-OOD场景中的性能。此外,论文还引入了少量样本提示调优(few-shot prompt tuning)和视觉提示调优(visual prompt tuning)来进一步适应目标分布,增强模型的鲁棒性。实验结果表明,该方法在多个基准测试中均优于当前最先进的方法,显著提升了AUROC并降低了FPR95。
链接: https://arxiv.org/abs/2501.05228
作者: Pei-Kang Lee,Jun-Cheng Chen,Ja-Ling Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures
Abstract:Out-of-distribution (OOD) detection has seen significant advancements with zero-shot approaches by leveraging the powerful Vision-Language Models (VLMs) such as CLIP. However, prior research works have predominantly focused on enhancing Far-OOD performance, while potentially compromising Near-OOD efficacy, as observed from our pilot study. To address this issue, we propose a novel strategy to enhance zero-shot OOD detection performances for both Far-OOD and Near-OOD scenarios by innovatively harnessing Large Language Models (LLMs) and VLMs. Our approach first exploit an LLM to generate superclasses of the ID labels and their corresponding background descriptions followed by feature extraction using CLIP. We then isolate the core semantic features for ID data by subtracting background features from the superclass features. The refined representation facilitates the selection of more appropriate negative labels for OOD data from a comprehensive candidate label set of WordNet, thereby enhancing the performance of zero-shot OOD detection in both scenarios. Furthermore, we introduce novel few-shot prompt tuning and visual prompt tuning to adapt the proposed framework to better align with the target distribution. Experimental results demonstrate that the proposed approach consistently outperforms current state-of-the-art methods across multiple benchmarks, with an improvement of up to 2.9% in AUROC and a reduction of up to 12.6% in FPR95. Additionally, our method exhibits superior robustness against covariate shift across different domains, further highlighting its effectiveness in real-world scenarios.
zh
[CV-30] Light Transport-aware Diffusion Posterior Sampling for Single-View Reconstruction of 3D Volumes
【速读】:该论文旨在解决单视角重建(single-view reconstruction)在存在多重光散射效应(multiple light scattering effects)的体场(volumetric fields)中的挑战,特别是在云层等复杂场景中的应用。传统方法如NeRF(Neural Radiance Fields)在处理这类问题时存在局限性,难以准确捕捉光散射效应。论文提出了一种基于无条件扩散模型(unconditional diffusion model)的解决方案,该模型通过训练一个包含1,000个合成模拟体密度场(volumetric density fields)的新基准数据集来建模未知的体场分布。关键创新在于使用了一种新型的扩散友好单平面表示(diffusion-friendly monoplanar representation),并结合了定制的参数化扩散后验采样技术(parametric diffusion posterior sampling technique)。此外,论文采用了基于物理的可微分体积渲染器(physically-based differentiable volume renderer),在潜在空间中提供光传输的梯度信息,从而使得重建结果与观测数据更加一致。通过实验,论文展示了在单视角重建云层体场时达到了前所未有的质量。
链接: https://arxiv.org/abs/2501.05226
作者: Ludwic Leonard,Nils Thuerey,Ruediger Westermann
机构: Technical University of Munich(慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We introduce a single-view reconstruction technique of volumetric fields in which multiple light scattering effects are omnipresent, such as in clouds. We model the unknown distribution of volumetric fields using an unconditional diffusion model trained on a novel benchmark dataset comprising 1,000 synthetically simulated volumetric density fields. The neural diffusion model is trained on the latent codes of a novel, diffusion-friendly, monoplanar representation. The generative model is used to incorporate a tailored parametric diffusion posterior sampling technique into different reconstruction tasks. A physically-based differentiable volume renderer is employed to provide gradients with respect to light transport in the latent space. This stands in contrast to classic NeRF approaches and makes the reconstructions better aligned with observed data. Through various experiments, we demonstrate single-view reconstruction of volumetric clouds at a previously unattainable quality.
zh
[CV-31] MHAFF: Multi-Head Attention Feature Fusion of CNN and Transformer for Cattle Identification
【速读】:该论文试图解决在基于牛鼻纹图像的牛只识别中,卷积神经网络(CNNs)难以捕捉复杂鼻纹模式中的长程依赖关系的问题。现有的特征融合方法,如加法(addition)和拼接(concatenation),虽然常用,但加法无法保留区分性信息,而拼接则导致维度增加,且这两种方法都无法发现融合特征之间的关系或交互作用。为解决这些问题,论文提出了一种新颖的多头注意力特征融合(Multi-Head Attention Feature Fusion, MHAFF)方法。MHAFF通过捕捉不同类型融合特征之间的关系,同时保留其特征的原始性,从而克服了加法和拼接的局限性。实验结果表明,MHAFF在两个公开的牛只数据集上均优于现有的特征融合方法和牛只识别方法,准确率分别达到了99.88%和99.52%。
链接: https://arxiv.org/abs/2501.05209
作者: Rabin Dulal,Lihong Zheng,Muhammad Ashad Kabir
机构: CSU (Charles Sturt University, 查尔斯斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages
Abstract:Convolutional Neural Networks (CNNs) have drawn researchers’ attention to identifying cattle using muzzle images. However, CNNs often fail to capture long-range dependencies within the complex patterns of the muzzle. The transformers handle these challenges. This inspired us to fuse the strengths of CNNs and transformers in muzzle-based cattle identification. Addition and concatenation have been the most commonly used techniques for feature fusion. However, addition fails to preserve discriminative information, while concatenation results in an increase in dimensionality. Both methods are simple operations and cannot discover the relationships or interactions between fusing features. This research aims to overcome the issues faced by addition and concatenation. This research introduces a novel approach called Multi-Head Attention Feature Fusion (MHAFF) for the first time in cattle identification. MHAFF captures relations between the different types of fusing features while preserving their originality. The experiments show that MHAFF outperformed addition and concatenation techniques and the existing cattle identification methods in accuracy on two publicly available cattle datasets. MHAFF demonstrates excellent performance and quickly converges to achieve optimum accuracy of 99.88% and 99.52% in two cattle datasets simultaneously.
zh
[CV-32] Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning
【速读】:该论文试图解决的问题是:能否通过模仿婴儿学习过程的计算模型(computational model)发展出超越其已听到词汇的广泛视觉概念,类似于婴儿自然学习的方式。解决方案的关键在于引入一种无需额外训练的框架,该框架能够发现隐藏在模型内部表征中的视觉概念神经元(visual concept neurons)。这些神经元能够对超出其原始词汇范围的物体进行分类。此外,论文还通过比较婴儿式模型与现代计算机视觉模型(如CLIP或ImageNet预训练模型)的视觉表征,揭示了两者之间的关键相似性和差异。最终,该研究通过分析基于婴儿视觉和语言输入训练的计算模型的内部表征,将认知科学与计算机视觉领域进行了有效结合。
链接: https://arxiv.org/abs/2501.05205
作者: Xueyi Ke,Satoshi Tsutsui,Yayun Zhang,Bihan Wen
机构: Nanyang Technological University(南洋理工大学); The Max Planck Institute for Psycholinguistics(马克斯·普朗克心理语言学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 11 figures
Abstract:Infants develop complex visual understanding rapidly, even preceding of the acquisition of linguistic inputs. As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights. In this paper, we present an interdisciplinary study exploring this question: can a computational model that imitates the infant learning process develop broader visual concepts that extend beyond the vocabulary it has heard, similar to how infants naturally learn? To investigate this, we analyze a recently published model in Science by Vong et al.,which is trained on longitudinal, egocentric images of a single child paired with transcribed parental speech. We introduce a training-free framework that can discover visual concept neurons hidden in the model’s internal representations. Our findings show that these neurons can classify objects outside its original vocabulary. Furthermore, we compare the visual representations in infant-like models with those in moder computer vision models, such as CLIP or ImageNet pre-trained model, highlighting key similarities and differences. Ultimately, our work bridges cognitive science and computer vision by analyzing the internal representations of a computational model trained on an infant’s visual and linguistic inputs.
zh
[CV-33] HipyrNet: Hypernet-Guided Feature Pyramid network for mixed-exposure correction
【速读】:该论文试图解决混合曝光图像增强中极端曝光变化带来的挑战。由于图像中不同区域的复杂性和对比度不一致性,现有方法在处理这些变化时表现不佳。论文提出的解决方案是HipyrNet,这是一种将超网络(HyperNetwork)集成到基于拉普拉斯金字塔(Laplacian Pyramid)框架中的新方法。关键创新在于利用超网络动态生成权重,从而适应不同的曝光变化。具体来说,超网络用于预测特征金字塔分解的最优核(kernels),使得每个输入图像都能进行定制化和自适应的分解过程。通过多尺度分解和重建,结合动态核预测,HipyrNet能够在不同尺度上捕捉和操纵特征,从而在处理极端曝光变化时表现出色。实验结果表明,HipyrNet在定性和定量评估中均优于现有方法,为混合曝光图像增强设定了新的基准。
链接: https://arxiv.org/abs/2501.05195
作者: Shaurya Singh Rathore,Aravind Shenoy,Krish Didwania,Aditya Kasliwal,Ujjwal Verma
机构: Manipal Institute of Technology, MAHE (马尼帕尔理工学院, MAHE)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in image translation for enhancing mixed-exposure images have demonstrated the transformative potential of deep learning algorithms. However, addressing extreme exposure variations in images remains a significant challenge due to the inherent complexity and contrast inconsistencies across regions. Current methods often struggle to adapt effectively to these variations, resulting in suboptimal performance. In this work, we propose HipyrNet, a novel approach that integrates a HyperNetwork within a Laplacian Pyramid-based framework to tackle the challenges of mixed-exposure image enhancement. The inclusion of a HyperNetwork allows the model to adapt to these exposure variations. HyperNetworks dynamically generates weights for another network, allowing dynamic changes during deployment. In our model, the HyperNetwork employed is used to predict optimal kernels for Feature Pyramid decomposition, which enables a tailored and adaptive decomposition process for each input image. Our enhanced translational network incorporates multiscale decomposition and reconstruction, leveraging dynamic kernel prediction to capture and manipulate features across varying scales. Extensive experiments demonstrate that HipyrNet outperforms existing methods, particularly in scenarios with extreme exposure variations, achieving superior results in both qualitative and quantitative evaluations. Our approach sets a new benchmark for mixed-exposure image enhancement, paving the way for future research in adaptive image translation.
zh
[CV-34] Compression with Global Guidance: Towards Training-free High-Resolution MLLM s Acceleration
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理高分辨率图像时面临的推理效率问题。随着多模态上下文长度的增加,计算复杂度呈二次方增长,导致推理效率显著下降。为了解决这一问题,论文提出了一种新颖的令牌压缩方法——GlobalCom^2,专门针对接收缩略图和多个裁剪图像的高分辨率MLLMs。GlobalCom^2的关键在于将缩略图生成的令牌视为整个令牌压缩过程的“指挥官”,指导每个裁剪图像的保留比例和具体压缩方式。通过这种方式,冗余的令牌被有效消除,同时重要的局部细节得以自适应地最大程度保留。实验结果表明,GlobalCom^2在性能和效率之间实现了最佳平衡,并在多个基准测试中优于现有的最先进令牌压缩方法。
链接: https://arxiv.org/abs/2501.05179
作者: Xuyang Liu,Ziming Wang,Yuhang Han,Yingyao Wang,Jiale Yuan,Jun Song,Bo Zheng,Linfeng Zhang,Siteng Huang,Honggang Chen
机构: Sichuan University (四川大学); Alibaba (阿里巴巴); Northeast Forestry University (东北林业大学); Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); Yunnan University (云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code is released at \url{ this https URL }
Abstract:Multimodal large language models (MLLMs) have attracted considerable attention due to their exceptional performance in visual content understanding and reasoning. However, their inference efficiency has been a notable concern, as the increasing length of multimodal contexts leads to quadratic complexity. Token compression techniques, which reduce the number of visual tokens, have demonstrated their effectiveness in reducing computational costs. Yet, these approaches have struggled to keep pace with the rapid advancements in MLLMs, especially the AnyRes strategy in the context of high-resolution image understanding. In this paper, we propose a novel token compression method, GlobalCom ^2 , tailored for high-resolution MLLMs that receive both the thumbnail and multiple crops. GlobalCom ^2 treats the tokens derived from the thumbnail as the ``commander’’ of the entire token compression process, directing the allocation of retention ratios and the specific compression for each crop. In this way, redundant tokens are eliminated while important local details are adaptively preserved to the highest extent feasible. Empirical results across 10 benchmarks reveal that GlobalCom ^2 achieves an optimal balance between performance and efficiency, and consistently outperforms state-of-the-art token compression methods with LLaVA-NeXT-7B/13B models. Our code is released at \urlthis https URL.
zh
[CV-35] FaceMe: Robust Blind Face Restoration with Personal Identification AAAI2025
【速读】:该论文试图解决盲人脸恢复(Blind Face Restoration)中的身份一致性(identity consistency)问题。现有的方法虽然在生成高质量图像方面表现良好,但往往无法忠实保留个体的身份特征。为此,论文提出了一种基于扩散模型(diffusion model)的个性化人脸恢复方法 FaceMe。其解决方案的关键在于使用身份编码器(identity encoder)从单张或多张参考图像中提取身份相关特征(identity-related features),并将这些特征作为提示(prompts)来引导扩散模型生成高质量且身份一致的人脸图像。通过结合身份相关特征,该方法有效减少了训练过程中身份无关特征(identity-irrelevant features)的影响,并在推理阶段支持任意数量的参考图像输入。此外,得益于身份编码器的鲁棒性,合成图像也可用作训练时的参考图像,且在推理过程中无需微调模型即可实现身份切换。论文还提出了一种构建参考图像训练池的流程,以模拟真实场景中可能出现的姿态和表情。实验结果表明,FaceMe 在恢复高质量人脸图像的同时,能够保持身份一致性,表现出优异的性能和鲁棒性。
链接: https://arxiv.org/abs/2501.05177
作者: Siyu Liu,Zheng-Peng Duan,Jia OuYang,Jiayi Fu,Hyunhee Park,Zikun Liu,Chun-Le Guo,Chongyi Li
机构: 1. 未知; 2. 未知; 3. 未知; 4. 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at AAAI 2025
Abstract:Blind face restoration is a highly ill-posed problem due to the lack of necessary context. Although existing methods produce high-quality outputs, they often fail to faithfully preserve the individual’s identity. In this paper, we propose a personalized face restoration method, FaceMe, based on a diffusion model. Given a single or a few reference images, we use an identity encoder to extract identity-related features, which serve as prompts to guide the diffusion model in restoring high-quality and identity-consistent facial images. By simply combining identity-related features, we effectively minimize the impact of identity-irrelevant features during training and support any number of reference image inputs during inference. Additionally, thanks to the robustness of the identity encoder, synthesized images can be used as reference images during training, and identity changing during inference does not require fine-tuning the model. We also propose a pipeline for constructing a reference image training pool that simulates the poses and expressions that may appear in real-world scenarios. Experimental results demonstrate that our FaceMe can restore high-quality facial images while maintaining identity consistency, achieving excellent performance and robustness.
zh
[CV-36] A Systematic Literature Review on Deep Learning-based Depth Estimation in Computer Vision
【速读】:该论文试图解决的问题是对深度估计(Depth Estimation, DE)领域进行全面系统的文献综述(Systematic Literature Review, SLR),以填补现有文献中对深度估计技术缺乏全面回顾的空白。现有的综述主要集中于单目(monocular)或基于立体视觉(stereo-based)的技术,而缺乏对深度估计领域的整体梳理。为此,作者通过检索电子数据库,筛选出1284篇相关文献,并进一步根据排除和质量标准,最终选择了59篇高质量的研究进行深入分析。
解决方案的关键在于通过系统文献综述的方法,全面梳理了深度估计领域的最新进展,特别是基于深度学习(Deep Learning, DL)的方法。研究总结了三种主要的深度估计类型:单目、立体视觉和多视角(multi-view)估计,并分析了20个公开数据集、29种评估指标以及35种基础模型的使用情况。其中,KITTI、NYU Depth V2和Make 3D是最常用的数据集,而ResNet-50、ResNet-18、ResNet-101、U-Net和VGG-16是最常用的基础模型。此外,研究还指出了当前深度估计领域面临的主要挑战,尤其是缺乏真实数据(ground truth data)的问题。
链接: https://arxiv.org/abs/2501.05147
作者: Ali Rohan,Md Junayed Hasan,Andrei Petrovski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Depth estimation (DE) provides spatial information about a scene and enables tasks such as 3D reconstruction, object detection, and scene understanding. Recently, there has been an increasing interest in using deep learning (DL)-based methods for DE. Traditional techniques rely on handcrafted features that often struggle to generalise to diverse scenes and require extensive manual tuning. However, DL models for DE can automatically extract relevant features from input data, adapt to various scene conditions, and generalise well to unseen environments. Numerous DL-based methods have been developed, making it necessary to survey and synthesize the state-of-the-art (SOTA). Previous reviews on DE have mainly focused on either monocular or stereo-based techniques, rather than comprehensively reviewing DE. Furthermore, to the best of our knowledge, there is no systematic literature review (SLR) that comprehensively focuses on DE. Therefore, this SLR study is being conducted. Initially, electronic databases were searched for relevant publications, resulting in 1284 publications. Using defined exclusion and quality criteria, 128 publications were shortlisted and further filtered to select 59 high-quality primary studies. These studies were analysed to extract data and answer defined research questions. Based on the results, DL methods were developed for mainly three different types of DE: monocular, stereo, and multi-view. 20 publicly available datasets were used to train, test, and evaluate DL models for DE, with KITTI, NYU Depth V2, and Make 3D being the most used datasets. 29 evaluation metrics were used to assess the performance of DE. 35 base models were reported in the primary studies, and the top five most-used base models were ResNet-50, ResNet-18, ResNet-101, U-Net, and VGG-16. Finally, the lack of ground truth data was among the most significant challenges reported by primary studies.
zh
[CV-37] CorrDiff: Adaptive Delay-aware Detector with Temporal Cue Inputs for Real-time Object Detection
【速读】:该论文旨在解决实时目标检测系统中的延迟问题,特别是在自动驾驶系统中的碰撞避免和路径规划等应用中。论文提出了一种名为CorrDiff的新型实时流感知方法,其核心贡献在于自适应延迟感知检测器(adaptive delay-aware detector)。该检测器能够利用运行时估计的时间线索,预测多个未来帧中目标的位置,并选择性地生成与实际时间匹配的预测,从而有效补偿通信和计算延迟。CorrDiff通过运动估计和特征增强,在单帧检测(当前帧或下一帧)和多帧预测方面均优于当前最先进的方法,特别是在mAP(平均精度)和sAP(流场景下的目标检测评估指标,综合考虑延迟和精度)指标上表现出色。该模型在从高性能的Tesla V100到中等性能的RTX 2080Ti等多种设备上均表现出鲁棒性能,满足了严格的实时处理要求,显著提升了自动驾驶等实际系统的安全性和可靠性。
链接: https://arxiv.org/abs/2501.05132
作者: Xiang Zhang,Chenchen Fu,Yufei Cui,Lan Yi,Yuyang Sun,Weiwei Wu,Xue Liu
机构: School of Computer Science, McGill University, Montreal, Quebec, Canada(麦吉尔大学计算机科学学院); School of Computer Science and Engineering, Southeast University, Nanjing, China(东南大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE JSAC Special Issue: Intelligent Communications for Real-Time Computer Vision (Comm4CV)
Abstract:Real-time object detection takes an essential part in the decision-making process of numerous real-world applications, including collision avoidance and path planning in autonomous driving systems. This paper presents a novel real-time streaming perception method named CorrDiff, designed to tackle the challenge of delays in real-time detection systems. The main contribution of CorrDiff lies in its adaptive delay-aware detector, which is able to utilize runtime-estimated temporal cues to predict objects’ locations for multiple future frames, and selectively produce predictions that matches real-world time, effectively compensating for any communication and computational delays. The proposed model outperforms current state-of-the-art methods by leveraging motion estimation and feature enhancement, both for 1) single-frame detection for the current frame or the next frame, in terms of the metric mAP, and 2) the prediction for (multiple) future frame(s), in terms of the metric sAP (The sAP metric is to evaluate object detection algorithms in streaming scenarios, factoring in both latency and accuracy). It demonstrates robust performance across a range of devices, from powerful Tesla V100 to modest RTX 2080Ti, achieving the highest level of perceptual accuracy on all platforms. Unlike most state-of-the-art methods that struggle to complete computation within a single frame on less powerful devices, CorrDiff meets the stringent real-time processing requirements on all kinds of devices. The experimental results emphasize the system’s adaptability and its potential to significantly improve the safety and reliability for many real-world systems, such as autonomous driving. Our code is completely open-sourced and is available at this https URL.
zh
[CV-38] 3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering
【速读】:该论文旨在解决多实例生成(Multi-Instance Generation, MIG)领域中现有方法在每次新模型发布时需重新训练适配器(adapter)导致的资源消耗问题。当前的适配器方法虽然能够实现用户定义的实例布局和属性生成,但在面对更先进的模型时,需要重新训练适配器,增加了计算和时间的开销。为此,论文提出了一种名为深度驱动解耦实例合成(Depth-Driven Decoupled Instance Synthesis, 3DIS)的方法,将MIG过程解耦为两个独立阶段:1)基于深度的场景构建(depth-based scene construction)和2)利用广泛预训练的深度控制模型进行细节渲染(detail rendering)。3DIS方法仅在场景构建阶段需要训练适配器,而在细节渲染阶段则无需重新训练,从而显著减少了资源消耗。论文进一步扩展了3DIS框架,提出了3DIS-FLUX,通过集成FLUX模型(特别是FLUX.1-Depth-dev模型)来增强渲染能力,并引入了一种基于布局信息的细节渲染器,通过操纵FLUX联合注意力机制(Joint Attention mechanism)中的注意力掩码(Attention Mask)来实现每个实例的精细属性渲染。实验结果表明,3DIS-FLUX在性能和图像质量上均优于原始的3DIS方法(基于SD2和SDXL)以及当前最先进的适配器方法。
链接: https://arxiv.org/abs/2501.05131
作者: Dewei Zhou,Ji Xie,Zongxin Yang,Yi Yang
机构: RELER, CCAI, Zhejiang University (浙江大学); DBMI, HMS, Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: tech report
Abstract:The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX’s Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: this https URL.
zh
[CV-39] Optimizing Multitask Industrial Processes with Predictive Action Guidance
【速读】:该论文旨在解决复杂装配过程中由于人类操作的变异性及主观任务偏好导致的准确任务预测和指导困难的问题。为解决这些挑战,论文提出了多模态 Transformer 融合和循环单元(MMTFRU)网络,通过多模态融合提高预测准确性。该系统与操作员动作监控单元(OAMU)集成,提供主动的操作员指导,防止装配过程中的偏差。OAMU采用两种策略:一是结合参考图和动作字典的Top-5 MMTF-RU预测,用于下一步操作推荐;二是结合参考图的Top-1 MMTF-RU预测,用于检测序列偏差并通过熵置信机制预测异常分数。此外,论文还引入了时间加权序列准确性(TWSA)来评估操作员效率并确保任务及时完成。该方案在工业Meccano数据集和大规模EPIC-Kitchens-55数据集上进行了验证,证明了其在动态环境中的有效性。
链接: https://arxiv.org/abs/2501.05108
作者: Naval Kishore Mehta,Arvind,Shyam Sunder Prasad,Sumeet Saurav,Sanjay Singh
机构: CSIR-Central Electronics Engineering Research Institute (CSIR-CEERI), India; Academy of Scientific and Innovative Research (AcSIR), India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monitoring complex assembly processes is critical for maintaining productivity and ensuring compliance with assembly standards. However, variability in human actions and subjective task preferences complicate accurate task anticipation and guidance. To address these challenges, we introduce the Multi-Modal Transformer Fusion and Recurrent Units (MMTFRU) Network for egocentric activity anticipation, utilizing multimodal fusion to improve prediction accuracy. Integrated with the Operator Action Monitoring Unit (OAMU), the system provides proactive operator guidance, preventing deviations in the assembly process. OAMU employs two strategies: (1) Top-5 MMTF-RU predictions, combined with a reference graph and an action dictionary, for next-step recommendations; and (2) Top-1 MMTF-RU predictions, integrated with a reference graph, for detecting sequence deviations and predicting anomaly scores via an entropy-informed confidence mechanism. We also introduce Time-Weighted Sequence Accuracy (TWSA) to evaluate operator efficiency and ensure timely task completion. Our approach is validated on the industrial Meccano dataset and the largescale EPIC-Kitchens-55 dataset, demonstrating its effectiveness in dynamic environments.
zh
[CV-40] Motion-X: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset NEURIPS2023
【速读】:该论文旨在解决现有运动数据集(motion datasets)在捕捉全身运动时的局限性,尤其是缺乏面部表情、手势和细粒度姿态描述,且通常局限于实验室环境并依赖手动标注的文本描述,限制了其可扩展性。为解决这一问题,论文提出了一个可扩展的标注流程(annotation pipeline),能够从RGB视频中自动捕捉3D全身人体运动并生成全面的文本标签,构建了包含81.1K文本-运动对的Motion-X数据集。进一步,通过改进标注流程、引入更多数据模态并扩大数据量,将Motion-X扩展为Motion-X++。Motion-X++提供了19.5M个3D全身姿态标注,覆盖120.5K个运动序列、80.8K个RGB视频、45.3K个音频、19.5M个帧级全身姿态描述和120.5K个序列级语义标签。该解决方案的关键在于其自动化的标注流程和多模态数据的整合,显著提升了生成具有表现力、精确且自然的运动数据的能力,并支持多种下游任务,如文本驱动的全身运动生成、音频驱动的运动生成、3D全身人体网格恢复和2D全身关键点估计等。
链接: https://arxiv.org/abs/2501.05098
作者: Yuhong Zhang,Jing Lin,Ailing Zeng,Guanlin Wu,Shunlin Lu,Yurong Fu,Yuanhao Cai,Ruimao Zhang,Haoqian Wang,Lei Zhang
机构: International Digital Economy Academy (IDEA); Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Department of Computer Vision and Robotics, IDEA (计算机视觉与机器人系, IDEA); Whiting School of Engineering, Johns Hopkins University (约翰霍普金斯大学怀廷工程学院); School of Data Science, The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 14 figures, This work extends and enhances the research published in the NeurIPS 2023 paper, “Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset”. arXiv admin note: substantial text overlap with arXiv:2307.00818
Abstract:In this paper, we introduce Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset. Existing motion datasets predominantly capture body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions, and are typically limited to lab settings with manually labeled text descriptions, thereby restricting their scalability. To address this issue, we develop a scalable annotation pipeline that can automatically capture 3D whole-body human motion and comprehensive textural labels from RGB videos and build the Motion-X dataset comprising 81.1K text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving the annotation pipeline, introducing more data modalities, and scaling up the data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K audios, 19.5M frame-level whole-body pose descriptions, and 120.5K sequence-level semantic labels. Comprehensive experiments validate the accuracy of our annotation pipeline and highlight Motion-X++'s significant benefits for generating expressive, precise, and natural motion with paired multimodal labels supporting several downstream tasks, including text-driven whole-body motion generation,audio-driven motion generation, 3D whole-body human mesh recovery, and 2D whole-body keypoints estimation, etc.
zh
[CV-41] A 1Mb mixed-precision quantized encoder for image classification and patch-based compression
【速读】:该论文旨在解决专用集成电路(ASIC)在边缘计算中应用于图像处理时的局限性问题,特别是如何在硬件资源有限的情况下实现多任务处理(如图像分类和压缩)。解决方案的关键在于设计了一个可重构的混合精度(3b/2b/1b)编码器,该编码器通过权重和激活量化的优化以及卷积层结构剪枝来降低硬件相关的约束(如内存和计算资源)。此外,论文引入了线性对称量化器缩放因子的自动适配,以实现量化级别的均衡化,从而稳定五值和三值权重的训练。同时,提出的层共享位偏移归一化(Bit-Shift Normalization)显著简化了硬件实现复杂的批量归一化(Batch Normalization)。通过这些技术,编码器在仅需1Mb硬件资源的情况下,在CIFAR-10数据集上达到了87.5%的分类准确率,并且能够实现无块状伪影的端到端图像压缩。
链接: https://arxiv.org/abs/2501.05097
作者: Van Thien Nguyen,William Guicquero,Gilles Sicard
机构: CEA-LETI, University Grenoble Alpes (格勒诺布尔阿尔卑斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published at IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
Abstract:Even if Application-Specific Integrated Circuits (ASIC) have proven to be a relevant choice for integrating inference at the edge, they are often limited in terms of applicability. In this paper, we demonstrate that an ASIC neural network accelerator dedicated to image processing can be applied to multiple tasks of different levels: image classification and compression, while requiring a very limited hardware. The key component is a reconfigurable, mixed-precision (3b/2b/1b) encoder that takes advantage of proper weight and activation quantizations combined with convolutional layer structural pruning to lower hardware-related constraints (memory and computing). We introduce an automatic adaptation of linear symmetric quantizer scaling factors to perform quantized levels equalization, aiming at stabilizing quinary and ternary weights training. In addition, a proposed layer-shared Bit-Shift Normalization significantly simplifies the implementation of the hardware-expensive Batch Normalization. For a specific configuration in which the encoder design only requires 1Mb, the classification accuracy reaches 87.5% on CIFAR-10. Besides, we also show that this quantized encoder can be used to compress image patch-by-patch while the reconstruction can performed remotely, by a dedicated full-frame decoder. This solution typically enables an end-to-end compression almost without any block artifacts, outperforming patch-based state-of-the-art techniques employing a patch-constant bitrate.
zh
[CV-42] Advancing ALS Applications with Large-Scale Pre-training: Dataset Development and Downstream Assessment
【速读】:该论文试图解决在机载激光扫描(Airborne Laser Scanning, ALS)领域中预训练和微调范式应用不足的问题。ALS技术在森林管理和城市规划等应用中具有重要意义,但其在预训练和微调方面的研究尚未充分展开。论文通过构建一个大规模的ALS点云数据集,并评估其对下游应用的影响,填补了这一研究空白。
解决方案的关键在于:1)构建了一个覆盖美国本土的大规模ALS点云数据集,数据来源于美国地质调查局的3D高程计划(3D Elevation Program);2)引入了一种基于土地覆盖图和数字高程模型(Digital Elevation Model, DEM)的地理空间采样方法,以确保数据收集的效率和多样性;3)采用BEV-MAE(一种先进的3D室外点云掩码自编码器)作为基线自监督学习模型,并在构建的数据集上进行预训练;4)将预训练模型微调用于下游任务,如树种分类、地形场景识别和点云语义分割。实验结果表明,预训练模型在所有下游任务中均显著优于从头训练的模型,证明了所提出数据集的可迁移性。此外,地理空间采样方法在扩展数据集时持续提升了性能,而随机采样构建的数据集则未能达到类似的改进效果。这些发现凸显了所构建数据集的实用性和采样策略在预训练和微调范式中的有效性。
链接: https://arxiv.org/abs/2501.05095
作者: Haoyi Xiu,Xin Liu,Taehoon Kim,Kyoung-Sook Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The pre-training and fine-tuning paradigm has revolutionized satellite remote sensing applications. However, this approach remains largely underexplored for airborne laser scanning (ALS), an important technology for applications such as forest management and urban planning. In this study, we address this gap by constructing a large-scale ALS point cloud dataset and evaluating its impact on downstream applications. Our dataset comprises ALS point clouds collected across the contiguous United States, provided by the United States Geological Survey’s 3D Elevation Program. To ensure efficient data collection while capturing diverse land cover and terrain types, we introduce a geospatial sampling method that selects point cloud tiles based on land cover maps and digital elevation models. As a baseline self-supervised learning model, we adopt BEV-MAE, a state-of-the-art masked autoencoder for 3D outdoor point clouds, and pre-train it on the constructed dataset. The pre-trained models are subsequently fine-tuned for downstream tasks, including tree species classification, terrain scene recognition, and point cloud semantic segmentation. Our results show that the pre-trained models significantly outperform their scratch counterparts across all downstream tasks, demonstrating the transferability of the representations learned from the proposed dataset. Furthermore, we observe that scaling the dataset using our geospatial sampling method consistently enhances performance, whereas pre-training on datasets constructed with random sampling fails to achieve similar improvements. These findings highlight the utility of the constructed dataset and the effectiveness of our sampling strategy in the pre-training and fine-tuning paradigm. The source code and pre-trained models will be made publicly available at \urlthis https URL.
zh
[CV-43] ResPanDiff: Diffusion Model with Disentangled Modulations for Image Fusion
【速读】:该论文旨在解决基于扩散模型(diffusion model)的全色锐化(pansharpening)任务中推理速度慢的问题。传统扩散模型由于需要大量采样步骤,导致推理速度较慢,而现有的加速采样技术往往在多源图像融合任务中牺牲了性能。为此,论文提出了一种名为“通过推断残差推理的全色锐化扩散模型”(ResPanDiff)的新方法。该模型的关键创新在于引入了一种马尔可夫链(Markov chain),从噪声残差过渡到低分辨率多光谱图像(LRMS)与高分辨率多光谱图像(HRMS)之间的残差,从而显著减少了采样步骤并提升了性能。此外,模型还设计了潜在空间以增强编码阶段的特征提取能力,采用浅层条件注入(Shallow Cond-Injection, SC-I)来获取更高维度的条件注入隐藏特征,并通过损失函数更好地指导残差生成任务。实验结果表明,该方法仅需15个采样步骤,比基准扩散模型减少了90%以上的步骤,同时在性能上优于当前最先进的技术。
链接: https://arxiv.org/abs/2501.05091
作者: Shiqi Cao,Liangjian Deng,Shangqi Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The implementation of diffusion-based pansharpening task is predominantly constrained by its slow inference speed, which results from numerous sampling steps. Despite the existing techniques aiming to accelerate sampling, they often compromise performance when fusing multi-source images. To ease this limitation, we introduce a novel and efficient diffusion model named Diffusion Model for Pansharpening by Inferring Residual Inference (ResPanDiff), which significantly reduces the number of diffusion steps without sacrificing the performance to tackle pansharpening task. In ResPanDiff, we innovatively propose a Markov chain that transits from noisy residuals to the residuals between the LRMS and HRMS images, thereby reducing the number of sampling steps and enhancing performance. Additionally, we design the latent space to help model extract more features at the encoding stage, Shallow Cond-Injection~(SC-I) to help model fetch cond-injected hidden features with higher dimensions, and loss functions to give a better guidance for the residual generation task. enabling the model to achieve superior performance in residual generation. Furthermore, experimental evaluations on pansharpening datasets demonstrate that the proposed method achieves superior outcomes compared to recent state-of-the-art~(SOTA) techniques, requiring only 15 sampling steps, which reduces over 90% step compared with the benchmark diffusion models. Our experiments also include thorough discussions and ablation studies to underscore the effectiveness of our approach.
zh
[CV-44] End-to-End Deep Learning for Interior Tomography with Low-Dose X-ray CT
【速读】:该论文旨在解决X射线计算机断层扫描(CT)中由于稀疏视图(sparse-view CT)、低剂量(low-dose CT)和感兴趣区域(ROI CT,即内部断层扫描)等扫描策略导致的图像伪影问题。具体而言,稀疏视图和低剂量CT设置虽然可以减少辐射剂量,但会导致图像噪声和截断投影(truncated projections),进而引发严重的杯状伪影(cupping artifacts)。尽管图像域深度学习(DL)方法在去除单一伪影方面表现出色,但在处理耦合伪影(coupled artifacts)时效果有限。
论文的关键解决方案是将耦合问题分解为两个子问题:(i) 在截断投影内部进行图像域噪声降低,以解决低剂量CT问题;(ii) 在截断投影外部进行投影外推,以解决ROI CT问题。通过提出一种新颖的双域卷积神经网络(dual-domain CNNs)端到端学习方法,直接解决这两个子问题。实验结果表明,该方法在性能上优于传统的图像域深度学习方法,且投影域CNN在处理耦合伪影时表现更佳。
链接: https://arxiv.org/abs/2501.05085
作者: Yoseob Han,Dufan Wu,Kyungsang Kim,Quanzheng Li
机构: Department of Radiology, Center for Advanced Medical Computing and Analysis (CAMCA), Harvard Medical School and Massachusetts General Hospital (放射科, 高级医学计算与分析中心, 哈佛医学院和马萨诸塞州总医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: Published by Physics in Medicine Biology (2022.5)
Abstract:Objective: There exist several X-ray computed tomography (CT) scanning strategies to reduce a radiation dose, such as (1) sparse-view CT, (2) low-dose CT, and (3) region-of-interest (ROI) CT (called interior tomography). To further reduce the dose, the sparse-view and/or low-dose CT settings can be applied together with interior tomography. Interior tomography has various advantages in terms of reducing the number of detectors and decreasing the X-ray radiation dose. However, a large patient or small field-of-view (FOV) detector can cause truncated projections, and then the reconstructed images suffer from severe cupping artifacts. In addition, although the low-dose CT can reduce the radiation exposure dose, analytic reconstruction algorithms produce image noise. Recently, many researchers have utilized image-domain deep learning (DL) approaches to remove each artifact and demonstrated impressive performances, and the theory of deep convolutional framelets supports the reason for the performance improvement. Approach: In this paper, we found that the image-domain convolutional neural network (CNN) is difficult to solve coupled artifacts, based on deep convolutional framelets. Significance: To address the coupled problem, we decouple it into two sub-problems: (i) image domain noise reduction inside truncated projection to solve low-dose CT problem and (ii) extrapolation of projection outside truncated projection to solve the ROI CT problem. The decoupled sub-problems are solved directly with a novel proposed end-to-end learning using dual-domain CNNs. Main results: We demonstrate that the proposed method outperforms the conventional image-domain deep learning methods, and a projection-domain CNN shows better performance than the image-domain CNNs which are commonly used by many researchers.
zh
[CV-45] pSegNet: Fingertip Segmentation in Contactless Fingerprint Imaging
【速读】:该论文旨在解决非接触式指纹识别系统中指尖检测和分割的精度问题,特别是在复杂背景条件下的挑战。传统接触式指纹识别方法存在卫生和用户体验方面的不足,而非接触式系统虽然提供了更卫生和用户友好的替代方案,但其准确性高度依赖于精确的指尖检测和分割。论文提出的解决方案是TipSegNet,一种新颖的深度学习模型,通过结合ResNeXt-101骨干网络进行鲁棒的特征提取,并利用特征金字塔网络(Feature Pyramid Network, FPN)实现多尺度表示,从而在不同手指姿态和图像质量下实现精确的指尖分割。此外,论文采用了广泛的数据增强策略,进一步提升了模型的泛化能力和鲁棒性。TipSegNet在实验中表现出色,达到了0.987的平均交并比(mIoU)和0.999的准确率,显著提升了非接触式指纹分割的精度,有望在实际应用中大幅提高非接触式生物识别系统的可靠性和有效性。
链接: https://arxiv.org/abs/2501.05076
作者: Laurenz Ruzicka,Bernhard Kohn,Clemens Heitzinger
机构: Austrian Institute of Technology, Vienna (奥地利技术研究院, 维也纳); TU Wien, Vienna (维也纳技术大学, 维也纳)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Contactless fingerprint recognition systems offer a hygienic, user-friendly, and efficient alternative to traditional contact-based methods. However, their accuracy heavily relies on precise fingertip detection and segmentation, particularly under challenging background conditions. This paper introduces TipSegNet, a novel deep learning model that achieves state-of-the-art performance in segmenting fingertips directly from grayscale hand images. TipSegNet leverages a ResNeXt-101 backbone for robust feature extraction, combined with a Feature Pyramid Network (FPN) for multi-scale representation, enabling accurate segmentation across varying finger poses and image qualities. Furthermore, we employ an extensive data augmentation strategy to enhance the model’s generalizability and robustness. TipSegNet outperforms existing methods, achieving a mean Intersection over Union (mIoU) of 0.987 and an accuracy of 0.999, representing a significant advancement in contactless fingerprint segmentation. This enhanced accuracy has the potential to substantially improve the reliability and effectiveness of contactless biometric systems in real-world applications.
zh
[CV-46] A Flexible and Scalable Framework for Video Moment Search
【速读】:该论文旨在解决视频时刻搜索(Video Moment Search)中的关键问题,即如何从视频库中高效地检索与用户查询文本相匹配的相关时刻。现有方法通常假设存在单一完美匹配时刻,且在长视频处理效率和推理速度上存在局限。为此,论文提出了一种灵活且可扩展的框架,称为分段-提案-排序(Segment-Proposal-Ranking, SPR),用于从任意长度的视频集合中检索并排序相关时刻,这一任务被称为排序视频时刻检索(Ranked Video Moment Retrieval, RVMR)。SPR框架的关键在于将搜索过程简化为三个独立阶段:分段检索、提案生成和时刻精炼与重排序。具体而言,视频被划分为等长片段并预计算嵌入(embeddings),以实现高效的离线索引和检索;在线检索时,通过将片段和查询投影到共享特征空间,支持近似最近邻搜索(Approximate Nearest Neighbor, ANN);随后,检索到的片段被合并为粗粒度时刻提案,并通过精炼和重排序模块进行调整和优化。该框架在TVR-Ranking数据集上实现了最先进的性能,并显著降低了计算成本和处理时间。其灵活设计还允许对各阶段进行独立改进,使其适用于大规模应用场景。
链接: https://arxiv.org/abs/2501.05072
作者: Chongzhi Zhang,Xizhou Zhu,Aixin Sun
机构: Nanyang Technological University(南洋理工大学); SenseTime Research(商汤科技)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video moment search, the process of finding relevant moments in a video corpus to match a user’s query, is crucial for various applications. Existing solutions, however, often assume a single perfect matching moment, struggle with inefficient inference, and have limitations with hour-long videos. This paper introduces a flexible and scalable framework for retrieving a ranked list of moments from collection of videos in any length to match a text query, a task termed Ranked Video Moment Retrieval (RVMR). Our framework, called Segment-Proposal-Ranking (SPR), simplifies the search process into three independent stages: segment retrieval, proposal generation, and moment refinement with re-ranking. Specifically, videos are divided into equal-length segments with precomputed embeddings indexed offline, allowing efficient retrieval regardless of video length. For scalable online retrieval, both segments and queries are projected into a shared feature space to enable approximate nearest neighbor (ANN) search. Retrieved segments are then merged into coarse-grained moment proposals. Then a refinement and re-ranking module is designed to reorder and adjust timestamps of the coarse-grained proposals. Evaluations on the TVR-Ranking dataset demonstrate that our framework achieves state-of-the-art performance with significant reductions in computational cost and processing time. The flexible design also allows for independent improvements to each stage, making SPR highly adaptable for large-scale applications.
zh
[CV-47] Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning
【速读】:该论文旨在解决当前大规模视觉-语言模型(VLMs)在常识性视频问答(VQA)任务中存在的虚假相关性(spurious correlations)问题。尽管VLMs在视觉-语言任务中取得了显著进展,但其黑箱性质和基准测试中的偏差导致模型可能学习到视频与答案之间的虚假关联。为此,论文提出了一种基于视频片段的蕴含树推理方法(video-grounded entailment tree reasoning method),该方法通过四个关键步骤实现:蕴含树构建(entailment tree construction)、视频-语言蕴含验证(video-language entailment verification)、树推理(tree reasoning)和动态树扩展(dynamic tree expansion)。该方法的显著优势在于其对现有视频和图像VLMs的通用性,能够适应不同类型的推理任务。此外,论文还设计了一种基于大语言模型的去偏方法(de-biasing procedure),通过重写VQA基准答案集来强制模型进行推理,从而支持公平评估。实验结果表明,该方法在现有和去偏基准测试中对不同VLMs和推理类型均具有显著影响。
链接: https://arxiv.org/abs/2501.05069
作者: Huabin Liu,Filip Ilievski,Cees G. M. Snoek
机构: Shanghai Jiao Tong University (上海交通大学); Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.
zh
[CV-48] LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
【速读】:该论文旨在解决视频多模态大语言模型(LLaVA-Octopus)在处理多模态任务时如何有效整合不同视觉投影器(visual projectors)特征的问题。不同视觉投影器在处理特定任务时表现出不同的特性,例如某些投影器擅长捕捉静态细节,而另一些则更擅长处理时间信息或时间一致性任务。为了解决这一问题,LLaVA-Octopus通过根据用户指令自适应地加权不同视觉投影器的特征,动态选择和组合最合适的特征,从而显著提升模型在多模态任务中的性能。实验结果表明,LLaVA-Octopus在多模态理解、视觉问答和视频理解等任务中表现出色,展示了其广泛的应用潜力。
链接: https://arxiv.org/abs/2501.05067
作者: Jiaxing Zhao,Boyuan Sun,Xiang Chen,Xihan Wei,Qibin Hou
机构: Tongyi Group, Alibaba(阿里巴巴通义集团); VCIP, CS, Nankai University(南开大学计算机科学与技术学院视觉计算与图像处理实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we introduce LLaVA-Octopus, a novel video multimodal large language model. LLaVA-Octopus adaptively weights features from different visual projectors based on user instructions, enabling us to leverage the complementary strengths of each projector. We observe that different visual projectors exhibit distinct characteristics when handling specific tasks. For instance, some projectors excel at capturing static details, while others are more effective at processing temporal information, and some are better suited for tasks requiring temporal coherence. By dynamically adjusting feature weights according to user instructions, LLaVA-Octopus dynamically selects and combines the most suitable features, significantly enhancing the model’s performance in multimodal tasks. Experimental results demonstrate that LLaVA-Octopus achieves excellent performance across multiple benchmarks, especially in tasks such as multimodal understanding, visual question answering, and video understanding, highlighting its broad application potential.
zh
[CV-49] Improving Skeleton-based Action Recognition with Interactive Object Information
【速读】:该论文试图解决基于骨架的动作识别(skeleton-based action recognition)中忽略与人类交互的物体信息的问题,导致在涉及物体交互的动作识别中表现不佳。解决方案的关键在于引入物体节点(object nodes)来补充缺失的交互物体信息,并提出了一种新的动作识别框架,即时空可变图卷积网络(Spatial Temporal Variable Graph Convolutional Networks, ST-VGCN),以有效建模包含物体节点的可变图(Variable Graph, VG)。此外,论文还设计了可变图构建方法,以适应图中节点数量的变化,并首次探讨了引入额外物体信息可能导致的过拟合问题,提出了基于VG的数据增强方法——随机节点攻击(Random Node Attack)。在网络结构方面,引入了两个融合模块(CAF和WNPool)以及一种新的节点平衡损失(Node Balance Loss),以通过有效融合和平衡骨架与物体节点信息来提升综合性能。该方法在多个基于骨架的动作识别基准测试中超越了现有的最先进方法。
链接: https://arxiv.org/abs/2501.05066
作者: Hao Wen,Ziqian Lu,Fengli Shen,Zhe-Ming Lu,Jialin Cui
机构: Zhejiang University(浙江大学); University of Electronic Science and Technology of China(电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Human skeleton information is important in skeleton-based action recognition, which provides a simple and efficient way to describe human pose. However, existing skeleton-based methods focus more on the skeleton, ignoring the objects interacting with humans, resulting in poor performance in recognizing actions that involve object interactions. We propose a new action recognition framework introducing object nodes to supplement absent interactive object information. We also propose Spatial Temporal Variable Graph Convolutional Networks (ST-VGCN) to effectively model the Variable Graph (VG) containing object nodes. Specifically, in order to validate the role of interactive object information, by leveraging a simple self-training approach, we establish a new dataset, JXGC 24, and an extended dataset, NTU RGB+D+Object 60, including more than 2 million additional object nodes. At the same time, we designe the Variable Graph construction method to accommodate a variable number of nodes for graph structure. Additionally, we are the first to explore the overfitting issue introduced by incorporating additional object information, and we propose a VG-based data augmentation method to address this issue, called Random Node Attack. Finally, regarding the network structure, we introduce two fusion modules, CAF and WNPool, along with a novel Node Balance Loss, to enhance the comprehensive performance by effectively fusing and balancing skeleton and object node information. Our method surpasses the previous state-of-the-art on multiple skeleton-based action recognition benchmarks. The accuracy of our method on NTU RGB+D 60 cross-subject split is 96.7%, and on cross-view split, it is 99.2%.
zh
[CV-50] LongViTU: Instruction Tuning for Long-Form Video Understanding
【速读】:该论文旨在解决长视频理解(long-form video understanding)中的关键问题,即如何构建一个高质量、大规模的数据集,以支持复杂的长视频内容理解和推理任务。为此,作者提出了LongViTU数据集,该数据集包含约121,000个问答对(QA pairs)和约900小时的视频内容。解决方案的关键在于:1)采用系统化的方法将视频组织为层次化的树状结构,确保问答对能够捕捉长时上下文(long-term context),平均上下文时长为4.6分钟;2)引入自修正机制(self-revision mechanisms)以提升问答对的质量;3)为相关事件提供明确的时间戳标签,支持精确的事件定位。此外,LongViTU还作为长视频和流媒体视频理解的基准测试,评估了开源模型LongVU和商业模型Gemini-1.5-Pro的性能,结果表明这些模型在长视频理解任务上仍面临显著挑战。通过监督微调(SFT),LongVU在多个基准测试上均取得了性能提升,证明了LongViTU数据集的高质量和强大的跨域泛化能力(OOD generalizability)。
链接: https://arxiv.org/abs/2501.05037
作者: Rujie Wu,Xiaojian Ma,Hai Ci,Yue Fan,Yuxuan Wang,Haozhe Zhao,Qing Li,Yizhou Wang
机构: Peking University(北京大学); BIGAI; National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:This paper introduce LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We developed a systematic approach that organizes videos into a hierarchical tree structure and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.); and 3) explicit timestamp labels for relevant events. LongViTU also serves as a benchmark for instruction following in long-form and streaming video understanding. We evaluate the open-source state-of-the-art long video understanding model, LongVU, and the commercial model, Gemini-1.5-Pro, on our benchmark. They achieve GPT-4 scores of 49.9 and 52.3, respectively, underscoring the substantial challenge posed by our benchmark. Further supervised fine-tuning (SFT) on LongVU led to performance improvements of 12.0% on our benchmark, 2.2% on the in-distribution (ID) benchmark EgoSchema, 1.0%, 2.2% and 1.2% on the out-of-distribution (OOD) benchmarks VideoMME (Long), WorldQA and OpenEQA, respectively. These outcomes demonstrate LongViTU’s high data quality and robust OOD generalizability.
zh
[CV-51] owards Fingerprint Mosaicking Artifact Detection: A Self-Supervised Deep Learning Approach
【速读】:该论文旨在解决指纹拼接(fingerprint mosaicking)过程中产生的拼接伪影(mosaicking artifacts)问题,这些伪影会显著降低指纹图像的质量,进而影响生物识别系统的准确性。论文提出了一种基于深度学习的新方法,通过自监督学习(self-supervised learning)框架在大规模未标注的指纹数据上训练模型,从而无需手动标注伪影即可检测和评分拼接伪影。该方法的关键在于利用自监督学习来消除对人工标注的依赖,并引入了一种新的拼接伪影评分机制,能够量化错误的严重程度,实现指纹图像的自动化评估。该方案在多种指纹模态(如非接触式、滚动式和按压式指纹)上表现出高准确性和对不同数据源的鲁棒性,从而提升了指纹生物识别系统的准确性和可靠性。
链接: https://arxiv.org/abs/2501.05034
作者: Laurenz Ruzicka,Alexander Spenke,Stephan Bergmann,Gerd Nolden,Bernhard Kohn,Clemens Heitzinger
机构: Austrian Institute of Technology, Vienna (奥地利技术研究所, 维也纳); Federal Office for Information Security, Bundesamt für Sicherheit in der Informationstechnik, Bonn (联邦信息安全办公室, 波恩); Center for Artificial Intelligence and Machine Learning (CAIML) and Department of Computer Science, TU Wien, Vienna (人工智能与机器学习中心及计算机科学系, 维也纳技术大学, 维也纳)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Fingerprint mosaicking, which is the process of combining multiple fingerprint images into a single master fingerprint, is an essential process in modern biometric systems. However, it is prone to errors that can significantly degrade fingerprint image quality. This paper proposes a novel deep learning-based approach to detect and score mosaicking artifacts in fingerprint images. Our method leverages a self-supervised learning framework to train a model on large-scale unlabeled fingerprint data, eliminating the need for manual artifact annotation. The proposed model effectively identifies mosaicking errors, achieving high accuracy on various fingerprint modalities, including contactless, rolled, and pressed fingerprints and furthermore proves to be robust to different data sources. Additionally, we introduce a novel mosaicking artifact score to quantify the severity of errors, enabling automated evaluation of fingerprint images. By addressing the challenges of mosaicking artifact detection, our work contributes to improving the accuracy and reliability of fingerprint-based biometric systems.
zh
[CV-52] ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark
【速读】:该论文试图解决当前基于自我中心视频(egocentric videos)的大型视觉-语言模型(LVLMs)在具身认知能力评估方面缺乏全面和系统性框架的问题。具体而言,现有的具身视频问答数据集未能充分涵盖机器人自我认知(robotic self-cognition)、动态场景感知(dynamic scene perception)和幻觉(hallucination)等关键具身认知问题。为解决这些问题,论文提出了ECBench,这是一个高质量的基准测试工具,旨在系统评估LVLMs的具身认知能力。ECBench的关键在于其多样化的场景视频来源、开放且多样的问答格式,以及涵盖30个维度的具身认知评估。为确保数据质量、平衡性和高度视觉依赖性,ECBench采用了类别无关的精细人工标注和多轮问题筛选策略。此外,论文还引入了ECEval,一个全面的评估系统,确保指标的公平性和合理性。通过ECBench,论文对专有、开源和任务特定的LVLMs进行了广泛评估,为开发可靠的具身代理核心模型奠定了坚实基础。
链接: https://arxiv.org/abs/2501.05031
作者: Ronghao Dang,Yuqian Yuan,Wenqi Zhang,Yifei Xin,Boqiang Zhang,Long Li,Liuyi Wang,Qinyang Zeng,Xin Li,Lidong Bing
机构: Alibaba DAMO Academy(阿里巴巴达摩院); Zhejiang University(浙江大学); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question screening strategies. Additionally, we introduce ECEval, a comprehensive evaluation system that ensures the fairness and rationality of the indicators. Utilizing ECBench, we conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of LVLMs, laying a solid foundation for developing reliable core models for embodied agents. All data and code are available at this https URL.
zh
[CV-53] Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
【速读】:该论文试图解决在图像动画中实现相机和物体运动的协同控制问题,特别是在自适应控制粒度方面存在的挑战。现有的方法虽然在控制相机或物体运动方面取得了一定进展,但在支持相机和物体运动的协同控制时仍存在困难。为此,论文提出了一种基于3D感知的运动表示方法,并引入了一个名为“Perception-as-Control”的图像动画框架。该框架的关键在于从参考图像中构建3D感知的运动表示,基于用户意图对其进行操作,并从不同视角感知这些运动表示,从而将相机和物体运动转化为直观且一致的视觉变化。通过将感知结果作为运动控制信号,该框架能够以统一且灵活的方式支持多种与运动相关的视频合成任务。实验结果表明,该框架在实现精细化的协同运动控制方面具有显著优势。
链接: https://arxiv.org/abs/2501.05020
作者: Yingjie Chen,Yifang Men,Yuan Yao,Miaomiao Cui,Liefeng Bo
机构: Institute for Intelligent Computing, Alibaba Tongyi Lab (阿里巴巴通义实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Motion-controllable image animation is a fundamental task with a wide range of potential applications. Recent works have made progress in controlling camera or object motion via various motion representations, while they still struggle to support collaborative camera and object motion control with adaptive control granularity. To this end, we introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control. Specifically, we construct 3D-aware motion representation from a reference image, manipulate it based on interpreted user intentions, and perceive it from different viewpoints. In this way, camera and object motions are transformed into intuitive, consistent visual changes. Then, the proposed framework leverages the perception results as motion control signals, enabling it to support various motion-related video synthesis tasks in a unified and flexible way. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our project webpage: this https URL.
zh
[CV-54] Continuous Knowledge-Preserving Decomposition for Few-Shot Continual Learning
【速读】:该论文试图解决少样本类增量学习(Few-shot Class-Incremental Learning, FSCIL)中的灾难性遗忘(catastrophic forgetting)问题,即在从有限数据中学习新类别时,模型往往会遗忘先前学到的知识。现有的方法要么通过冻结主干网络来保留知识,但这限制了模型的适应性,要么依赖于额外的模块或提示,增加了推理开销。为此,论文提出了一种名为“连续知识保持分解”(Continuous Knowledge-Preserving Decomposition for FSCIL, CKPD-FSCIL)的框架。该框架的关键在于将模型的权重分解为两部分:一部分用于压缩现有知识(知识敏感组件),另一部分则保留冗余容量以容纳新能力(冗余容量组件)。这种分解通过重放样本的协方差矩阵进行指导,确保主成分与分类能力对齐。在适应新任务时,仅调整冗余容量组件,而冻结知识敏感组件,从而在保持模型结构不变且不增加开销的情况下,促进模型的塑性并最小化干扰。此外,CKPD还引入了自适应层选择策略,动态识别具有冗余容量的层并分配适配器。实验结果表明,CKPD-FSCIL在多个基准测试中优于现有的最先进方法。
链接: https://arxiv.org/abs/2501.05017
作者: Xiaojie Li,Yibo Yang,Jianlong Wu,David A. Clifton,Yue Yu,Bernard Ghanem,Min Zhang
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Peng Cheng Laboratory (鹏城实验室); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL
Abstract:Few-shot class-incremental learning (FSCIL) involves learning new classes from limited data while retaining prior knowledge, and often results in catastrophic forgetting. Existing methods either freeze backbone networks to preserve knowledge, which limits adaptability, or rely on additional modules or prompts, introducing inference overhead. To this end, we propose Continuous Knowledge-Preserving Decomposition for FSCIL (CKPD-FSCIL), a framework that decomposes a model’s weights into two parts: one that compacts existing knowledge (knowledge-sensitive components) and another that carries redundant capacity to accommodate new abilities (redundant-capacity components). The decomposition is guided by a covariance matrix from replay samples, ensuring principal components align with classification abilities. During adaptation, we freeze the knowledge-sensitive components and only adapt the redundant-capacity components, fostering plasticity while minimizing interference without changing the architecture or increasing overhead. Additionally, CKPD introduces an adaptive layer selection strategy to identify layers with redundant capacity, dynamically allocating adapters. Experiments on multiple benchmarks show that CKPD-FSCIL outperforms state-of-the-art methods.
zh
[CV-55] A Scalable System for Visual Analysis of Ocean Data
【速读】:该论文旨在解决海洋学家在处理日益增长的高分辨率和复杂海洋数据时面临的挑战,特别是如何有效地进行可视化和交互式分析。由于海洋数据的动态性和多变量关系,传统的可视化工具难以满足需求。论文提出的解决方案是开发一个名为pyParaOcean的可扩展和交互式可视化系统,专门用于海洋数据分析。该系统的关键创新在于其专门设计的模块,如涡流识别和盐度运动追踪,这些模块与ParaView无缝集成,利用其并行化能力和内置的通用可视化功能。此外,通过创建一个辅助数据集并存储为Cinema数据库,系统能够有效解决I/O和网络带宽瓶颈问题,同时支持快速生成概览可视化。论文通过孟加拉湾的案例研究展示了该系统的实用性,并通过扩展研究评估了其效率。
链接: https://arxiv.org/abs/2501.05009
作者: Toshit Jain,Upkar Singh,Varun Singh,Vijay Kumar Boda,Ingrid Hotz,Sathish S. Vadhiyar,P. N. Vinayachandran,Vijay Natarajan
机构: Department of Computer Science and Automation (CSA), Indian Institute of Science Bangalore, India (印度科学研究所计算机科学与自动化系); Department of Science and Technology (ITN), Linköping University, Norrköping, Sweden (瑞典林雪平大学科技学院); Department of Computational and Data Sciences (CDS), Indian Institute of Science Bangalore, India (印度科学研究所计算与数据科学系); Centre for Atmospheric and Oceanic Sciences (CAOS), Indian Institute of Science Bangalore, India (印度科学研究所大气与海洋科学中心); Zuse Institute Berlin, Germany (德国祖斯研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Oceanographers rely on visual analysis to interpret model simulations, identify events and phenomena, and track dynamic ocean processes. The ever increasing resolution and complexity of ocean data due to its dynamic nature and multivariate relationships demands a scalable and adaptable visualization tool for interactive exploration. We introduce pyParaOcean, a scalable and interactive visualization system designed specifically for ocean data analysis. pyParaOcean offers specialized modules for common oceanographic analysis tasks, including eddy identification and salinity movement tracking. These modules seamlessly integrate with ParaView as filters, ensuring a user-friendly and easy-to-use system while leveraging the parallelization capabilities of ParaView and a plethora of inbuilt general-purpose visualization functionalities. The creation of an auxiliary dataset stored as a Cinema database helps address I/O and network bandwidth bottlenecks while supporting the generation of quick overview visualizations. We present a case study on the Bay of Bengal (BoB) to demonstrate the utility of the system and scaling studies to evaluate the efficiency of the system.
zh
[CV-56] A CT Image Classification Network Framework for Lung Tumors Based on Pre-trained MobileNetV2 Model and Transfer learning And Its Application and Market Analysis in the Medical field
【速读】:该论文试图解决传统手动分析方法在肺癌诊断中的准确性和效率问题。为了解决这一问题,论文提出了一种基于预训练MobileNetV2模型的深度学习网络框架。该框架的关键在于使用ImageNet-1K数据集(版本2)的权重进行初始化,并替换了模型的最后一层(全连接层),同时添加了softmax激活函数,以高效分类三种类型的肺癌CT扫描图像。实验结果表明,该模型在测试集上的准确率达到99.6%,在特征提取方面相比传统方法有显著提升。这一解决方案通过引入深度学习技术,显著提高了肺癌诊断的效率和准确性,具有重要的临床应用价值。
链接: https://arxiv.org/abs/2501.04996
作者: Ziyang Gao,Yong Tian,Shih-Chi Lin,Junghua Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the medical field, accurate diagnosis of lung cancer is crucial for treatment. Traditional manual analysis methods have significant limitations in terms of accuracy and efficiency. To address this issue, this paper proposes a deep learning network framework based on the pre-trained MobileNetV2 model, initialized with weights from the ImageNet-1K dataset (version 2). The last layer of the model (the fully connected layer) is replaced with a new fully connected layer, and a softmax activation function is added to efficiently classify three types of lung cancer CT scan images. Experimental results show that the model achieves an accuracy of 99.6% on the test set, with significant improvements in feature extraction compared to traditional this http URL the rapid development of artificial intelligence technologies, deep learning applications in medical image processing are bringing revolutionary changes to the healthcare industry. AI-based lung cancer detection systems can significantly improve diagnostic efficiency, reduce the workload of doctors, and occupy an important position in the global healthcare market. The potential of AI to improve diagnostic accuracy, reduce medical costs, and promote precision medicine will have a profound impact on the future development of the healthcare industry.
zh
[CV-57] IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation AAAI2025
【速读】:该论文旨在解决3D Referring Expression Segmentation (3D-RES) 任务中的两个主要挑战:特征模糊性(feature ambiguity)和意图模糊性(intent ambiguity)。特征模糊性源于点云采集过程中由于光照和视角等限制导致的信息丢失或失真,而意图模糊性则是指模型在解码过程中对所有查询的平等处理,缺乏自上而下的任务特定指导。为解决这些问题,论文提出了Image enhanced Prompt Decoding Network (IPDN),其关键解决方案包括两个模块:Multi-view Semantic Embedding (MSE) 模块和Prompt-Aware Decoder (PAD)。MSE模块通过将多视角2D图像信息注入3D场景,补偿潜在的空间信息丢失;PAD则通过从表达与视觉特征的交互中提取任务驱动的信号,指导解码过程。实验结果表明,IPDN在3D-RES和3D-GRES任务中的mIoU指标上分别比现有最优方法提升了1.9和4.2个百分点。
链接: https://arxiv.org/abs/2501.04995
作者: Qi Chen,Changli Wu,Jiayi Ji,Yiwei Ma,Danni Yang,Xiaoshuai Sun
机构: 1; 2
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2025
Abstract:3D Referring Expression Segmentation (3D-RES) aims to segment point cloud scenes based on a given expression. However, existing 3D-RES approaches face two major challenges: feature ambiguity and intent ambiguity. Feature ambiguity arises from information loss or distortion during point cloud acquisition due to limitations such as lighting and viewpoint. Intent ambiguity refers to the model’s equal treatment of all queries during the decoding process, lacking top-down task-specific guidance. In this paper, we introduce an Image enhanced Prompt Decoding Network (IPDN), which leverages multi-view images and task-driven information to enhance the model’s reasoning capabilities. To address feature ambiguity, we propose the Multi-view Semantic Embedding (MSE) module, which injects multi-view 2D image information into the 3D scene and compensates for potential spatial information loss. To tackle intent ambiguity, we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by deriving task-driven signals from the interaction between the expression and visual features. Comprehensive experiments demonstrate that IPDN outperforms the state-ofthe-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and 3D-GRES tasks, respectively.
zh
[CV-58] V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer AAAI2025
【速读】:该论文试图解决概念瓶颈模型(Concept Bottleneck Models, CBMs)在视觉识别任务中面临的两个主要问题:一是概念标注需要大量专家知识和人工劳动,限制了CBMs的广泛应用;二是现有方法利用大语言模型(Large Language Models, LLMs)生成的概念可能冗长且包含非视觉属性,影响了模型的准确性和可解释性。为解决这些问题,论文提出了一种直接从多模态模型构建CBMs的方法。其关键解决方案是采用常见词汇作为基础概念词汇,并利用未标注的辅助图像构建一个视觉到概念(Vision-to-Concept, V2C)的标记器(Tokenizer),该标记器能够将图像显式量化为最相关的视觉概念,从而创建与多模态模型紧密耦合的视觉导向概念瓶颈。这种V2C-CBM在训练效率、可解释性和准确性方面表现出色,并在多个视觉分类基准测试中达到或超越了基于LLM监督的CBMs,验证了该方法的有效性。
链接: https://arxiv.org/abs/2501.04975
作者: Hangzhou He,Lei Zhu,Xinliang Zhang,Shuang Zeng,Qian Chen,Yanye Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025
Abstract:Concept Bottleneck Models (CBMs) offer inherent interpretability by initially translating images into human-comprehensible concepts, followed by a linear combination of these concepts for classification. However, the annotation of concepts for visual recognition tasks requires extensive expert knowledge and labor, constraining the broad adoption of CBMs. Recent approaches have leveraged the knowledge of large language models to construct concept bottlenecks, with multimodal models like CLIP subsequently mapping image features into the concept feature space for classification. Despite this, the concepts produced by language models can be verbose and may introduce non-visual attributes, which hurts accuracy and interpretability. In this study, we investigate to avoid these issues by constructing CBMs directly from multimodal models. To this end, we adopt common words as base concept vocabulary and leverage auxiliary unlabeled images to construct a Vision-to-Concept (V2C) tokenizer that can explicitly quantize images into their most relevant visual concepts, thus creating a vision-oriented concept bottleneck tightly coupled with the multimodal model. This leads to our V2C-CBM which is training efficient and interpretable with high accuracy. Our V2C-CBM has matched or outperformed LLM-supervised CBMs on various visual classification benchmarks, validating the efficacy of our approach.
zh
[CV-59] AD-L-JEPA: Self-Supervised Spatial World Models with Joint Embedding Predictive Architecture for Autonomous Driving with LiDAR Data
【速读】:该论文旨在解决当前自动驾驶系统对大量标注数据的依赖问题,并提出了一种新的自监督预训练框架AD-L-JEPA(Autonomous Driving with LiDAR data via a Joint Embedding Predictive Architecture),以降低数据需求并提升系统对复杂现实环境的理解能力。与现有方法不同,AD-L-JEPA既不采用生成式(generative)也不采用对比式(contrastive)学习,而是通过联合嵌入预测架构(Joint Embedding Predictive Architecture)学习空间世界模型。该方法的核心在于预测鸟瞰图(Bird’s Eye View, BEV)嵌入,以表示自动驾驶场景的多样性,从而避免了显式生成被掩码的未知区域。此外,AD-L-JEPA无需手动创建正负样本对,简化了实现过程并提升了学习到的表示质量。实验结果表明,AD-L-JEPA在LiDAR 3D目标检测和相关迁移学习等下游任务中表现出色,优于现有的最先进方法(SOTA),如Occupancy-MAE和ALSO。
链接: https://arxiv.org/abs/2501.04969
作者: Haoran Zhu,Zhenyuan Dong,Kristi Topollai,Anna Choromanska
机构: Learning Systems Laboratory, Department of Electrical and Computer Engineering, New York University (纽约大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As opposed to human drivers, current autonomous driving systems still require vast amounts of labeled data to train. Recently, world models have been proposed to simultaneously enhance autonomous driving capabilities by improving the way these systems understand complex real-world environments and reduce their data demands via self-supervised pre-training. In this paper, we present AD-L-JEPA (aka Autonomous Driving with LiDAR data via a Joint Embedding Predictive Architecture), a novel self-supervised pre-training framework for autonomous driving with LiDAR data that, as opposed to existing methods, is neither generative nor contrastive. Our method learns spatial world models with a joint embedding predictive architecture. Instead of explicitly generating masked unknown regions, our self-supervised world models predict Bird’s Eye View (BEV) embeddings to represent the diverse nature of autonomous driving scenes. Our approach furthermore eliminates the need to manually create positive and negative pairs, as is the case in contrastive learning. AD-L-JEPA leads to simpler implementation and enhanced learned representations. We qualitatively and quantitatively demonstrate high-quality of embeddings learned with AD-L-JEPA. We furthermore evaluate the accuracy and label efficiency of AD-L-JEPA on popular downstream tasks such as LiDAR 3D object detection and associated transfer learning. Our experimental evaluation demonstrates that AD-L-JEPA is a plausible approach for self-supervised pre-training in autonomous driving applications and is the best available approach outperforming SOTA, including most recently proposed Occupancy-MAE [1] and ALSO [2]. The source code of AD-L-JEPA is available at this https URL.
zh
[CV-60] Emergence of Painting Ability via Recognition-Driven Evolution
【速读】:该论文试图解决的问题是如何通过模拟进化压力来增强视觉传达效率,从而模拟人类绘画的复杂性和细节表现能力。解决方案的关键在于提出了一种包含笔画分支(stroke branch)和调色板分支(palette branch)的模型。调色板分支学习有限的颜色调色板,而笔画分支则使用贝塞尔曲线(Bézier curves)参数化每一笔,以渲染图像。随后,通过一个高级识别模块对图像进行评估,量化视觉传达效率。模型通过优化每一笔的控制点和颜色选择,以最少的笔画和颜色实现最高的识别准确率。实验结果表明,该模型在高级识别任务中表现出色,尤其在抽象素描中展现了艺术表现力和美学吸引力。此外,该方法还显示出作为高效位级图像压缩技术的潜力,优于传统方法。
链接: https://arxiv.org/abs/2501.04966
作者: Yi Lin,Lin Gu,Ziteng Cui,Shenghan Su,Yumo Hao,Yingtao Tian,Tatsuya Harada,Jianfei Yang
机构: Nanyang Technological University(南洋理工大学); National University of Singapore(新加坡国立大学); RIKEN(理化学研究所); The University of Tokyo(东京大学); Tianjin Academy of Fine Art(天津美术学院); Sakana AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:From Paleolithic cave paintings to Impressionism, human painting has evolved to depict increasingly complex and detailed scenes, conveying more nuanced messages. This paper attempts to emerge this artistic capability by simulating the evolutionary pressures that enhance visual communication efficiency. Specifically, we present a model with a stroke branch and a palette branch that together simulate human-like painting. The palette branch learns a limited colour palette, while the stroke branch parameterises each stroke using Bézier curves to render an image, subsequently evaluated by a high-level recognition module. We quantify the efficiency of visual communication by measuring the recognition accuracy achieved with machine vision. The model then optimises the control points and colour choices for each stroke to maximise recognition accuracy with minimal strokes and colours. Experimental results show that our model achieves superior performance in high-level recognition tasks, delivering artistic expression and aesthetic appeal, especially in abstract sketches. Additionally, our approach shows promise as an efficient bit-level image compression technique, outperforming traditional methods.
zh
[CV-61] Addressing Domain Shift via Imbalance-Aware Domain Adaptation in Embryo Development Assessment
【速读】:该论文旨在解决医学影像中深度学习模型面临的两个主要挑战:领域偏移(domain shift)和类别不平衡(class imbalance)。领域偏移指的是模型在训练环境之外的部署环境中表现不佳,而类别不平衡则是指某些疾病条件在数据集中自然占比较少。为解决这些问题,作者提出了Imbalance-Aware Domain Adaptation (IADA)框架,其关键解决方案包括三个核心组件:(1) 通过类别特定的注意力机制进行自适应特征学习,(2) 通过动态加权的平衡领域对齐,(3) 自适应阈值优化。该框架在四种成像模态的胚胎发育评估实验中表现出显著优于现有方法的性能,准确率提高了25.19%,并在低质量成像系统中展现出强大的泛化能力,AUC提升了12.56%。这些结果表明IADA在开发适用于多样化临床环境的可靠且公平的医学影像系统方面具有潜力。
链接: https://arxiv.org/abs/2501.04958
作者: Lei Li,Xinglin Zhang,Jun Liang,Tao Chen
机构: University of Copenhagen, Copenhagen, Denmark; University of Washington, Seattle, WA, USA; Shanghai Medical Image Insights Intelligent Technology Co., Ltd., Shanghai 200032, China; University of Waterloo, Waterloo, ON, Canada
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:Deep learning models in medical imaging face dual challenges: domain shift, where models perform poorly when deployed in settings different from their training environment, and class imbalance, where certain disease conditions are naturally underrepresented. We present Imbalance-Aware Domain Adaptation (IADA), a novel framework that simultaneously tackles both challenges through three key components: (1) adaptive feature learning with class-specific attention mechanisms, (2) balanced domain alignment with dynamic weighting, and (3) adaptive threshold optimization. Our theoretical analysis establishes convergence guarantees and complexity bounds. Through extensive experiments on embryo development assessment across four imaging modalities, IADA demonstrates significant improvements over existing methods, achieving up to 25.19% higher accuracy while maintaining balanced performance across classes. In challenging scenarios with low-quality imaging systems, IADA shows robust generalization with AUC improvements of up to 12.56%. These results demonstrate IADA’s potential for developing reliable and equitable medical imaging systems for diverse clinical settings. The code is made public available at \urlthis https URL
zh
[CV-62] MORDA: A Synthetic Dataset to Facilitate Adaptation of Object Detectors to Unseen Real-target Domain While Preserving Performance on Real-source Domain ICRA2025
【速读】:该论文试图解决自动驾驶车辆(AVs)中基于深度神经网络(DNN)的感知模型对大规模高质量数据的依赖问题,尤其是在将模型部署到新的地理区域(真实目标域)时,现有数据集(真实源域)无法涵盖新区域的特征,导致需要重新采集和标注数据,增加了时间和成本负担。为解决这一问题,论文提出利用合成环境作为辅助域,通过模拟真实域的特征来生成合成数据,从而间接获取关于真实目标域的经验。具体而言,论文以nuScenes数据集和韩国地区分别代表真实源域和真实目标域,构建了韩国的数字孪生环境,并在模拟器中融合这些组件,生成了名为MORDA(Mixture Of Real-domain characteristics for synthetic-data-assisted Domain Adaptation)的新型驾驶数据集。通过实验验证,仅使用nuScenes和MORDA训练的2D/3D检测器在韩国真实数据集(AI-Hub)上的平均精度(mAP)显著提升,同时保持了在nuScenes数据集上的性能。
链接: https://arxiv.org/abs/2501.04950
作者: Hojun Lim,Heecheol Yoo,Jinwoo Lee,Seungmin Jeon,Hyeongseok Jeon
机构: MORAI Inc., Republic of Korea (MORAI 公司, 韩国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures, 4 tables, This work has been submitted to the IEEE for possible publication (the paper is submitted to the conference ICRA2025 and is under review)
Abstract:Deep neural network (DNN) based perception models are indispensable in the development of autonomous vehicles (AVs). However, their reliance on large-scale, high-quality data is broadly recognized as a burdensome necessity due to the substantial cost of data acquisition and labeling. Further, the issue is not a one-time concern, as AVs might need a new dataset if they are to be deployed to another region (real-target domain) that the in-hand dataset within the real-source domain cannot incorporate. To mitigate this burden, we propose leveraging synthetic environments as an auxiliary domain where the characteristics of real domains are reproduced. This approach could enable indirect experience about the real-target domain in a time- and cost-effective manner. As a practical demonstration of our methodology, nuScenes and South Korea are employed to represent real-source and real-target domains, respectively. That means we construct digital twins for several regions of South Korea, and the data-acquisition framework of nuScenes is reproduced. Blending the aforementioned components within a simulator allows us to obtain a synthetic-fusion domain in which we forge our novel driving dataset, MORDA: Mixture Of Real-domain characteristics for synthetic-data-assisted Domain Adaptation. To verify the value of synthetic features that MORDA provides in learning about driving environments of South Korea, 2D/3D detectors are trained solely on a combination of nuScenes and MORDA. Afterward, their performance is evaluated on the unforeseen real-world dataset (AI-Hub) collected in South Korea. Our experiments present that MORDA can significantly improve mean Average Precision (mAP) on AI-Hub dataset while that on nuScenes is retained or slightly enhanced.
zh
[CV-63] Seeing with Partial Certainty: Conformal Prediction for Robotic Scene Recognition in Built Environments
【速读】:该论文试图解决在辅助机器人(assistive robotics)为残障人士(PWD)服务时,基于视觉语言模型(VLM)的室内场景识别中存在的预测不确定性(uncertainty)和幻觉问题(hallucination)。由于语言接口(language interfaces)在处理视觉场景时可能产生不准确的预测,且人类提供的语言指令往往模糊不清,缺乏对特定位置、物体或动作的精确描述,这些问题进一步加剧了模型的不确定性。论文提出的解决方案是Seeing with Partial Certainty (SwPC)框架,该框架基于保形预测理论(conformal prediction),旨在量化和对齐VLM在场景识别中的不确定性,使模型能够在缺乏信心时识别并寻求必要的帮助。SwPC框架通过实验验证,在Matterport3D数据集上显著提高了场景识别的成功率,并减少了所需的人工干预。该框架无需对VLM进行微调,提供了一种轻量级的不确定性建模方法,能够与基础模型的扩展能力互补并扩展。
链接: https://arxiv.org/abs/2501.04947
作者: Yifan Xu,Vineet Kamat,Carol Menassa
机构: Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, MI, 48109-2125 (密歇根大学土木与环境工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 Figures
Abstract:In assistive robotics serving people with disabilities (PWD), accurate place recognition in built environments is crucial to ensure that robots navigate and interact safely within diverse indoor spaces. Language interfaces, particularly those powered by Large Language Models (LLM) and Vision Language Models (VLM), hold significant promise in this context, as they can interpret visual scenes and correlate them with semantic information. However, such interfaces are also known for their hallucinated predictions. In addition, language instructions provided by humans can also be ambiguous and lack precise details about specific locations, objects, or actions, exacerbating the hallucination issue. In this work, we introduce Seeing with Partial Certainty (SwPC) - a framework designed to measure and align uncertainty in VLM-based place recognition, enabling the model to recognize when it lacks confidence and seek assistance when necessary. This framework is built on the theory of conformal prediction to provide statistical guarantees on place recognition while minimizing requests for human help in complex indoor environment settings. Through experiments on the widely used richly-annotated scene dataset Matterport3D, we show that SwPC significantly increases the success rate and decreases the amount of human intervention required relative to the prior art. SwPC can be utilized with any VLMs directly without requiring model fine-tuning, offering a promising, lightweight approach to uncertainty modeling that complements and scales alongside the expanding capabilities of foundational models.
zh
[CV-64] MambaHSI: Spatial-Spectral Mamba for Hyperspectral Image Classification
【速读】:该论文试图解决高光谱图像(HSI)分类中Transformer模型在计算速度和内存使用方面的挑战,特别是其二次计算复杂度带来的问题。为了解决这些问题,论文提出了一种基于Mamba模型的新型HSI分类模型,命名为MambaHSI。该模型的关键在于能够同时建模整个图像的长程交互,并以自适应方式整合空间和光谱信息。具体而言,论文设计了空间Mamba模块(SpaMB)来在像素级别建模整个图像的长程交互,并提出了光谱Mamba模块(SpeMB)来将光谱向量分成多个组,挖掘不同光谱组之间的关系并提取光谱特征。最后,通过空间-光谱融合模块(SSFM)自适应地整合HSI的空间和光谱特征。这一解决方案首次在图像级别上实现了基于Mamba的HSI分类模型,展示了Mamba作为下一代HSI模型骨干的巨大潜力。
链接: https://arxiv.org/abs/2501.04944
作者: Yapeng Li,Yong Luo,Lefei Zhang,Zengmao Wang,Bo Du
机构: National Engineering Research Center for Multimedia Software, School of Computer Science, Institute of Artificial Intelligence, and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University (武汉大学国家多媒体软件工程技术研究中心、计算机学院、人工智能研究所、湖北省多媒体与网络通信工程重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by IEEE TGRS
Abstract:Transformer has been extensively explored for hyperspectral image (HSI) classification. However, transformer poses challenges in terms of speed and memory usage because of its quadratic computational complexity. Recently, the Mamba model has emerged as a promising approach, which has strong long-distance modeling capabilities while maintaining a linear computational complexity. However, representing the HSI is challenging for the Mamba due to the requirement for an integrated spatial and spectral understanding. To remedy these drawbacks, we propose a novel HSI classification model based on a Mamba model, named MambaHSI, which can simultaneously model long-range interaction of the whole image and integrate spatial and spectral information in an adaptive manner. Specifically, we design a spatial Mamba block (SpaMB) to model the long-range interaction of the whole image at the pixel-level. Then, we propose a spectral Mamba block (SpeMB) to split the spectral vector into multiple groups, mine the relations across different spectral groups, and extract spectral features. Finally, we propose a spatial-spectral fusion module (SSFM) to adaptively integrate spatial and spectral features of a HSI. To our best knowledge, this is the first image-level HSI classification model based on the Mamba. We conduct extensive experiments on four diverse HSI datasets. The results demonstrate the effectiveness and superiority of the proposed model for HSI classification. This reveals the great potential of Mamba to be the next-generation backbone for HSI models. Codes are available at this https URL .
zh
[CV-65] Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation
【速读】:该论文试图解决视频对象分割(Referring Video Object Segmentation)中的两个主要问题:查询不一致性(query inconsistency)和上下文考虑不足(limited consideration of context)。查询不一致性导致视频中间部分不同对象的分割掩码不稳定,而上下文考虑不足则导致无法准确分割与给定文本描述相关的对象。为解决这些问题,论文提出了多上下文时序一致性模块(Multi-context Temporal Consistency Module, MTCM),该模块由对齐器(Aligner)和多上下文增强器(Multi-Context Enhancer, MCE)组成。对齐器通过去除查询中的噪声并对其进行对齐来实现查询一致性,而MCE则通过考虑多上下文来预测与文本相关的查询。该模块在四个不同模型中的应用显著提升了性能,特别是在MeViS数据集上达到了47.6的JF分数。
链接: https://arxiv.org/abs/2501.04939
作者: Sun-Hyuk Choi,Hayoung Jo,Seong-Whan Lee
机构: Korea University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring video object segmentation aims to segment objects within a video corresponding to a given text description. Existing transformer-based temporal modeling approaches face challenges related to query inconsistency and the limited consideration of context. Query inconsistency produces unstable masks of different objects in the middle of the video. The limited consideration of context leads to the segmentation of incorrect objects by failing to adequately account for the relationship between the given text and instances. To address these issues, we propose the Multi-context Temporal Consistency Module (MTCM), which consists of an Aligner and a Multi-Context Enhancer (MCE). The Aligner removes noise from queries and aligns them to achieve query consistency. The MCE predicts text-relevant queries by considering multi-context. We applied MTCM to four different models, increasing performance across all of them, particularly achieving 47.6 JF on the MeViS. Code is available at this https URL.
zh
[CV-66] Plug-and-Play DISep: Separating Dense Instances for Scene-to-Pixel Weakly-Supervised Change Detection in High-Resolution Remote Sensing Images
【速读】:该论文试图解决弱监督变化检测(Weakly-Supervised Change Detection, WSCD)方法在场景级监督下出现的“实例聚合”(instance lumping)问题。具体来说,当变化实例(即变化对象)分布密集时,未变化的像素容易被错误地识别为变化像素,导致多个变化实例被误认为一个,从而影响变化数量的准确量化。为解决这一问题,论文提出了一种密集实例分离(Dense Instance Separation, DISep)方法,作为即插即用的解决方案。DISep的核心在于通过三步迭代训练过程来优化像素特征:1)实例定位(Instance Localization),利用高通量类激活图定位变化像素的候选区域;2)实例检索(Instance Retrieval),通过连通性搜索将变化像素分组并分配实例ID,并基于实例ID提取像素级特征;3)实例分离(Instance Separation),引入分离损失(separation loss)以增强嵌入空间内实例内像素的一致性,从而确保实例特征的可分离性。该方法仅增加少量训练成本且不影响推理成本,能够无缝集成到现有的WSCD方法中,并在多个数据集上实现了最先进的性能。
链接: https://arxiv.org/abs/2501.04934
作者: Zhenghui Zhao,Chen Wu,Lixiang Ru,Di Wang,Hongruixuan Chen,Cuiqun Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ISPRS Journal of Photogrammetry and Remote Sensing
Abstract:Existing Weakly-Supervised Change Detection (WSCD) methods often encounter the problem of “instance lumping” under scene-level supervision, particularly in scenarios with a dense distribution of changed instances (i.e., changed objects). In these scenarios, unchanged pixels between changed instances are also mistakenly identified as changed, causing multiple changes to be mistakenly viewed as one. In practical applications, this issue prevents the accurate quantification of the number of changes. To address this issue, we propose a Dense Instance Separation (DISep) method as a plug-and-play solution, refining pixel features from a unified instance perspective under scene-level supervision. Specifically, our DISep comprises a three-step iterative training process: 1) Instance Localization: We locate instance candidate regions for changed pixels using high-pass class activation maps. 2) Instance Retrieval: We identify and group these changed pixels into different instance IDs through connectivity searching. Then, based on the assigned instance IDs, we extract corresponding pixel-level features on a per-instance basis. 3) Instance Separation: We introduce a separation loss to enforce intra-instance pixel consistency in the embedding space, thereby ensuring separable instance feature representations. The proposed DISep adds only minimal training cost and no inference cost. It can be seamlessly integrated to enhance existing WSCD methods. We achieve state-of-the-art performance by enhancing three Transformer-based and four ConvNet-based methods on the LEVIR-CD, WHU-CD, DSIFN-CD, SYSU-CD, and CDD datasets. Additionally, our DISep can be used to improve fully-supervised change detection methods. Code is available at this https URL.
zh
[CV-67] Image2CADSeq: Computer-Aided Design Sequence and Knowledge Inference from Product Images
【速读】:该论文试图解决的问题是在缺乏数字CAD文件的情况下,如何通过逆向工程(Reverse Engineering, RE)从2D图像数据中重建3D CAD模型。传统的逆向工程方法主要依赖于将点云等3D数据转换为边界表示(B-rep)格式的3D模型,但这种方法不仅获取3D数据具有挑战性,而且B-rep模型无法揭示设计过程中的3D建模知识。为此,论文提出了一种新颖的数据驱动方法,即Image2CADSeq神经网络模型。该模型的关键在于通过处理2D图像作为输入,生成CAD序列(CAD sequence),这些序列可以通过实体建模内核(solid modeling kernel)转换为B-rep模型。与B-rep模型相比,CAD序列提供了更高的灵活性,允许修改模型创建过程中的各个步骤,从而更深入地理解CAD模型的构建过程。为了定量和严格评估Image2CADSeq模型的预测性能,研究还开发了一个多层次评估框架,并在专门合成的数据集上进行了训练和优化。实验和验证结果表明,该模型在从2D图像数据生成CAD序列方面具有巨大潜力。
链接: https://arxiv.org/abs/2501.04928
作者: Xingang Li,Zhenghui Sha
机构: Walker Department of Mechanical Engineering, The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 10 figures, and 6 tables
Abstract:Computer-aided design (CAD) tools empower designers to design and modify 3D models through a series of CAD operations, commonly referred to as a CAD sequence. In scenarios where digital CAD files are not accessible, reverse engineering (RE) has been used to reconstruct 3D CAD models. Recent advances have seen the rise of data-driven approaches for RE, with a primary focus on converting 3D data, such as point clouds, into 3D models in boundary representation (B-rep) format. However, obtaining 3D data poses significant challenges, and B-rep models do not reveal knowledge about the 3D modeling process of designs. To this end, our research introduces a novel data-driven approach with an Image2CADSeq neural network model. This model aims to reverse engineer CAD models by processing images as input and generating CAD sequences. These sequences can then be translated into B-rep models using a solid modeling kernel. Unlike B-rep models, CAD sequences offer enhanced flexibility to modify individual steps of model creation, providing a deeper understanding of the construction process of CAD models. To quantitatively and rigorously evaluate the predictive performance of the Image2CADSeq model, we have developed a multi-level evaluation framework for model assessment. The model was trained on a specially synthesized dataset, and various network architectures were explored to optimize the performance. The experimental and validation results show great potential for the model in generating CAD sequences from 2D image data.
zh
[CV-68] From Mesh Completion to AI Designed Crown
【速读】:该论文旨在解决牙冠设计过程中耗时且劳动密集型的问题,目标是简化牙冠设计并减少手动调整的繁琐性,同时确保高精度和一致性。为此,作者提出了一种新的端到端深度学习方法,称为牙冠网格补全(Dental Mesh Completion, DMC),该方法通过点云上下文生成牙冠网格。解决方案的关键在于将牙冠生成问题转化为点云上下文的补全任务。具体步骤包括:首先,特征提取器将输入的点云转换为表示局部区域的特征向量;接着,这些特征向量被输入到Transformer中,预测缺失区域(即牙冠)的新特征向量;然后,通过点重建头和多层感知器预测带有法线的密集点集;最后,使用可微分的点-网格层重建牙冠表面网格。与基于图的卷积神经网络相比,DMC方法在实验中表现出更高的有效性,达到了0.062的平均Chamfer距离。
链接: https://arxiv.org/abs/2501.04914
作者: Golriz Hosseinimanesh,Farnoosh Ghadiri,Francois Guibault,Farida Cheriet,Julia Keren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Designing a dental crown is a time-consuming and labor intensive process. Our goal is to simplify crown design and minimize the tediousness of making manual adjustments while still ensuring the highest level of accuracy and consistency. To this end, we present a new end- to-end deep learning approach, coined Dental Mesh Completion (DMC), to generate a crown mesh conditioned on a point cloud context. The dental context includes the tooth prepared to receive a crown and its surroundings, namely the two adjacent teeth and the three closest teeth in the opposing jaw. We formulate crown generation in terms of completing this point cloud context. A feature extractor first converts the input point cloud into a set of feature vectors that represent local regions in the point cloud. The set of feature vectors is then fed into a transformer to predict a new set of feature vectors for the missing region (crown). Subsequently, a point reconstruction head, followed by a multi-layer perceptron, is used to predict a dense set of points with normals. Finally, a differentiable point-to-mesh layer serves to reconstruct the crown surface mesh. We compare our DMC method to a graph-based convolutional neural network which learns to deform a crown mesh from a generic crown shape to the target geometry. Extensive experiments on our dataset demonstrate the effectiveness of our method, which attains an average of 0.062 Chamfer this http URL code is available at:this https URL
zh
[CV-69] A Machine Learning Model for Crowd Density Classification in Hajj Video Frames
【速读】:该论文试图解决在朝觐(Hajj)和副朝(Umrah)期间,特别是在关键区域如大清真寺(Grand Mosque)的塔瓦夫(Tawaf)仪式中,由于大量朝觐者聚集而导致的拥挤管理问题。这些密集人群可能引发踩踏、火灾和疫情等公共安全威胁。论文提出了一种基于机器学习的模型,用于实时分类视频帧中的拥挤程度,分为三个等级:中等拥挤、过度拥挤和非常密集。当检测到非常密集的拥挤时,系统会通过闪烁的红灯向组织者发出实时警报。该模型的关键在于集成了局部二值模式(Local Binary Pattern, LBP)纹理分析、边缘密度和基于区域的特征提取,以提高对不同拥挤密度水平的区分能力。通过在KAU-Smart Crowd 'HAJJv2’数据集上的测试,该模型达到了87%的准确率,展示了其有效检测和分类不同拥挤状况的能力,从而有助于提升大规模活动中的拥挤管理和安全性。
链接: https://arxiv.org/abs/2501.04911
作者: Afnan A.Shah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:Managing the massive annual gatherings of Hajj and Umrah presents significant challenges, particularly as the Saudi government aims to increase the number of pilgrims. Currently, around two million pilgrims attend Hajj and 26 million attend Umrah making crowd control especially in critical areas like the Grand Mosque during Tawaf, a major concern. Additional risks arise in managing dense crowds at key sites such as Arafat where the potential for stampedes, fires and pandemics poses serious threats to public safety. This research proposes a machine learning model to classify crowd density into three levels: moderate crowd, overcrowded and very dense crowd in video frames recorded during Hajj, with a flashing red light to alert organizers in real-time when a very dense crowd is detected. While current research efforts in processing Hajj surveillance videos focus solely on using CNN to detect abnormal behaviors, this research focuses more on high-risk crowds that can lead to disasters. Hazardous crowd conditions require a robust method, as incorrect classification could trigger unnecessary alerts and government intervention, while failure to classify could result in disaster. The proposed model integrates Local Binary Pattern (LBP) texture analysis, which enhances feature extraction for differentiating crowd density levels, along with edge density and area-based features. The model was tested on the KAU-Smart Crowd ‘HAJJv2’ dataset which contains 18 videos from various key locations during Hajj including ‘Massaa’, ‘Jamarat’, ‘Arafat’ and ‘Tawaf’. The model achieved an accuracy rate of 87% with a 2.14% error percentage (misclassification rate), demonstrating its ability to detect and classify various crowd conditions effectively. That contributes to enhanced crowd management and safety during large-scale events like Hajj.
zh
[CV-70] opological Classification of points in Z2 by using Topological Numbers for 2D discrete binary images
【速读】:该论文旨在解决二维离散二值图像中点的拓扑分类问题。其核心解决方案是基于拓扑数(topological numbers)的计算,将图像中的点分为六类:孤立点(isolated point)、内部点(interior point)、简单点(simple point)、曲线点(curve point)、三条曲线交点(point of intersection of 3 curves)和四条曲线交点(point of intersection of 4 curves)。通过这种分类方法,论文不仅定义了各类点的特征,还给出了每类点的配置数量,从而为图像分析和处理提供了更为精确的拓扑描述工具。
链接: https://arxiv.org/abs/2501.04878
作者: Christophe Lohou
机构: Université Clermont Auvergne (克莱蒙奥弗涅大学); Clermont Auvergne INP (克莱蒙奥弗涅国立理工学院); CNRS, Institut Pascal (法国国家科学研究中心, 帕斯卡研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
备注: arXiv admin note: substantial text overlap with arXiv:2410.21588
Abstract:In this paper, we propose a topological classification of points for 2D discrete binary images. This classification is based on the values of the calculus of topological numbers. Six classes of points are proposed: isolated point, interior point, simple point, curve point, point of intersection of 3 curves, point of intersection of 4 curves. The number of configurations of each class is also given.
zh
[CV-71] Back Home: A Machine Learning Approach to Seashell Classification and Ecosystem Restoration
【速读】:该论文试图解决哥斯达黎加每年从生态系统中提取约5吨贝壳的问题,这些被没收的贝壳由于缺乏来源识别而无法返回其生态系统。为了解决这一问题,研究团队开发了一种专门用于贝壳识别的卷积神经网络(CNN)。解决方案的关键在于构建了一个包含约19,000张来自太平洋和加勒比海岸贝壳图像的数据集,并利用该数据集训练模型,使其分类准确率超过85%。此外,该模型已被集成到一个用户友好的应用程序中,能够实时处理贝壳图像,并在每张图像3秒内提供结果。为了进一步提高系统的准确性,研究团队还引入了异常检测机制,以过滤掉不相关或异常的输入,确保仅处理有效的贝壳图像。
链接: https://arxiv.org/abs/2501.04873
作者: Alexander Valverde,Luis Solano
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In Costa Rica, an average of 5 tons of seashells are extracted from ecosystems annually. Confiscated seashells, cannot be returned to their ecosystems due to the lack of origin recognition. To address this issue, we developed a convolutional neural network (CNN) specifically for seashell identification. We built a dataset from scratch, consisting of approximately 19000 images from the Pacific and Caribbean coasts. Using this dataset, the model achieved a classification accuracy exceeding 85%. The model has been integrated into a user-friendly application, which has classified over 36,000 seashells to date, delivering real-time results within 3 seconds per image. To further enhance the system’s accuracy, an anomaly detection mechanism was incorporated to filter out irrelevant or anomalous inputs, ensuring only valid seashell images are processed.
zh
[CV-72] LayerMix: Enhanced Data Augmentation through Fractal Integration for Robust Deep Learning
【速读】:该论文试图解决深度学习模型在面对分布偏移(Out-of-Distribution, OOD)样本时性能下降的问题,特别是在自然图像损坏、对抗性扰动和异常模式等情况下。现有的深度学习模型尽管具有复杂的神经网络架构,但在处理这些OOD样本时往往难以保持一致的性能。论文提出的解决方案是LayerMix,一种创新的数据增强方法,通过结构化的分形图像合成(fractal-based image synthesis)来增强模型的鲁棒性。LayerMix的关键在于其结构化的混合管道,能够在保持原始图像语义的同时引入可控的变异性,从而生成语义一致的合成样本,显著提升神经网络的泛化能力。与传统依赖随机变换的数据增强技术不同,LayerMix通过系统地将结构复杂性整合到训练数据集中,有效提高了模型在分类准确性、抗自然图像损坏、对抗攻击鲁棒性、模型校准和预测一致性等关键机器学习安全指标上的表现。
链接: https://arxiv.org/abs/2501.04861
作者: Hafiz Mughees Ahmad,Dario Morle,Afshin Rahimi
机构: University of Windsor, Canada(加拿大温莎大学); IFIVEO CANADA INC.(IFIVEO 加拿大公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning models have demonstrated remarkable performance across various computer vision tasks, yet their vulnerability to distribution shifts remains a critical challenge. Despite sophisticated neural network architectures, existing models often struggle to maintain consistent performance when confronted with Out-of-Distribution (OOD) samples, including natural corruptions, adversarial perturbations, and anomalous patterns. We introduce LayerMix, an innovative data augmentation approach that systematically enhances model robustness through structured fractal-based image synthesis. By meticulously integrating structural complexity into training datasets, our method generates semantically consistent synthetic samples that significantly improve neural network generalization capabilities. Unlike traditional augmentation techniques that rely on random transformations, LayerMix employs a structured mixing pipeline that preserves original image semantics while introducing controlled variability. Extensive experiments across multiple benchmark datasets, including CIFAR-10, CIFAR-100, ImageNet-200, and ImageNet-1K demonstrate LayerMixs superior performance in classification accuracy and substantially enhances critical Machine Learning (ML) safety metrics, including resilience to natural image corruptions, robustness against adversarial attacks, improved model calibration and enhanced prediction consistency. LayerMix represents a significant advancement toward developing more reliable and adaptable artificial intelligence systems by addressing the fundamental challenges of deep learning generalization. The code is available at this https URL.
zh
[CV-73] EDMB: Edge Detector with Mamba
【速读】:该论文旨在解决基于Transformer的模型在边缘检测任务中计算成本过高的问题。为此,作者提出了一种名为EDMB的新型边缘检测器,其核心在于利用视觉Mamba(vision Mamba)高效捕捉长距离依赖关系的能力。EDMB通过结合全局-局部架构,能够同时关注全局信息和细粒度线索,后者在边缘检测中至关重要,但通常被普通Mamba忽略。为了生成高质量的多粒度边缘,作者设计了一种新颖的解码器,通过融合全局特征和细粒度特征来构建可学习的高斯分布,并通过从这些分布中采样来生成多粒度边缘。此外,为了使多粒度边缘适用于单标签数据,作者引入了证据下界(Evidence Lower Bound)损失来监督分布的学习。实验结果表明,EDMB在多标签数据集BSDS500上取得了具有竞争力的单粒度ODS 0.837和多粒度ODS 0.851,且无需多尺度测试或额外的PASCAL-VOC数据。EDMB还可扩展到单标签数据集如NYUDv2和BIPED。
链接: https://arxiv.org/abs/2501.04846
作者: Yachuan Li,Xavier Soria Poma,Yun Bai,Qian Xiao,Chaozhi Yang,Guanlin Li,Zongmin Li
机构: China University of Petroleum (East China) (中国石油大学(华东)); Polytechnic School of Chimborazo (ESPOCH) (厄瓜多尔理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transformer-based models have made significant progress in edge detection, but their high computational cost is prohibitive. Recently, vision Mamba have shown excellent ability in efficiently capturing long-range dependencies. Drawing inspiration from this, we propose a novel edge detector with Mamba, termed EDMB, to efficiently generate high-quality multi-granularity edges. In EDMB, Mamba is combined with a global-local architecture, therefore it can focus on both global information and fine-grained cues. The fine-grained cues play a crucial role in edge detection, but are usually ignored by ordinary Mamba. We design a novel decoder to construct learnable Gaussian distributions by fusing global features and fine-grained features. And the multi-grained edges are generated by sampling from the distributions. In order to make multi-granularity edges applicable to single-label data, we introduce Evidence Lower Bound loss to supervise the learning of the distributions. On the multi-label dataset BSDS500, our proposed EDMB achieves competitive single-granularity ODS 0.837 and multi-granularity ODS 0.851 without multi-scale test or extra PASCAL-VOC data. Remarkably, EDMB can be extended to single-label datasets such as NYUDv2 and BIPED. The source code is available at this https URL.
zh
[CV-74] owards Generalizable Trajectory Prediction Using Dual-Level Representation Learning And Adaptive Prompting
【速读】:该论文旨在解决现有车辆轨迹预测模型在泛化性、预测不确定性和处理复杂交互方面的不足。现有模型通常因针对特定数据集定制的复杂架构和多模态处理效率低下而受限。为此,论文提出了Perceiver with Register queries (PerReg+),一种新颖的轨迹预测框架,其关键解决方案包括:(1) 通过自蒸馏(Self-Distillation, SD)和掩码重建(Masked Reconstruction, MR)实现双层次表示学习,捕捉全局上下文和细粒度细节;(2) 使用基于寄存器的查询和预训练增强多模态处理,无需聚类和抑制;(3) 在微调阶段采用自适应提示调优(Adaptive Prompt Tuning),冻结主架构并优化少量提示以实现高效适应。PerReg+在多个数据集上实现了新的最先进性能,并在跨域测试中显著降低了误差。
链接: https://arxiv.org/abs/2501.04815
作者: Kaouther Messaoud,Matthieu Cord,Alexandre Alahi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions. It is often due to limitations like complex architectures customized for a specific dataset and inefficient multimodal handling. We propose Perceiver with Register queries (PerReg+), a novel trajectory prediction framework that introduces: (1) Dual-Level Representation Learning via Self-Distillation (SD) and Masked Reconstruction (MR), capturing global context and fine-grained details. Additionally, our approach of reconstructing segmentlevel trajectories and lane segments from masked inputs with query drop, enables effective use of contextual information and improves generalization; (2) Enhanced Multimodality using register-based queries and pretraining, eliminating the need for clustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning, freezing the main architecture and optimizing a small number of prompts for efficient adaptation. PerReg+ sets a new state-of-the-art performance on nuScenes [1], Argoverse 2 [2], and Waymo Open Motion Dataset (WOMD) [3]. Remarkable, our pretrained model reduces the error by 6.8% on smaller datasets, and multi-dataset training enhances generalization. In cross-domain tests, PerReg+ reduces B-FDE by 11.8% compared to its non-pretrained variant.
zh
[CV-75] Leverag ing Registers in Vision Transformers for Robust Adaptation ICASSP2025
【速读】:该论文旨在解决视觉Transformer(ViTs)中高范数(high-norm)令牌对无监督对象发现的干扰问题,并探索其在分布外(OOD)场景中的泛化能力。具体而言,高范数令牌可能会影响模型对全局图像信息的捕捉,从而干扰对象发现任务。为解决这一问题,论文提出使用“寄存器”(registers)来隔离高范数补丁令牌,同时保留全局图像信息。关键解决方案是将常用的CLS令牌嵌入与平均池化的寄存器嵌入相结合,生成特征表示,用于训练下游分类器。实验结果表明,该方法在不增加计算开销的情况下,显著提升了OOD泛化能力和异常检测的准确性,同时保持了分布内(ID)性能。
链接: https://arxiv.org/abs/2501.04784
作者: Srikar Yellapragada,Kowshik Thopalli,Vivek Narayanaswamy,Wesam Sakla,Yang Liu,Yamen Mubarka,Dimitris Samaras,Jayaraman J. Thiagarajan
机构: Stony Brook University(石溪大学); Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICASSP 2025
Abstract:Vision Transformers (ViTs) have shown success across a variety of tasks due to their ability to capture global image representations. Recent studies have identified the existence of high-norm tokens in ViTs, which can interfere with unsupervised object discovery. To address this, the use of “registers” which are additional tokens that isolate high norm patch tokens while capturing global image-level information has been proposed. While registers have been studied extensively for object discovery, their generalization properties particularly in out-of-distribution (OOD) scenarios, remains underexplored. In this paper, we examine the utility of register token embeddings in providing additional features for improving generalization and anomaly rejection. To that end, we propose a simple method that combines the special CLS token embedding commonly employed in ViTs with the average-pooled register embeddings to create feature representations which are subsequently used for training a downstream classifier. We find that this enhances OOD generalization and anomaly rejection, while maintaining in-distribution (ID) performance. Extensive experiments across multiple ViT backbones trained with and without registers reveal consistent improvements of 2-4% in top-1 OOD accuracy and a 2-3% reduction in false positive rates for anomaly detection. Importantly, these gains are achieved without additional computational overhead.
zh
[CV-76] GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting
【速读】:该论文旨在解决动态视频场景中高效神经表示(efficient neural representations)所面临的高内存占用、长时间训练和时序一致性(temporal consistency)等问题。为了解决这些问题,论文提出了一种结合3D高斯溅射(3D Gaussian splatting)和连续相机运动建模(continuous camera motion modeling)的新型神经视频表示方法。其关键解决方案包括利用神经常微分方程(Neural ODEs)来学习平滑的相机轨迹,同时通过高斯分布保持显式的3D场景表示。此外,论文还引入了一种时空分层学习策略(spatiotemporal hierarchical learning strategy),逐步优化空间和时间特征,以提高重建质量并加速收敛。这种内存高效的方法在保持高质量渲染的同时,显著提升了处理速度。实验结果表明,该方法的层次化学习与鲁棒的相机运动建模相结合,能够有效捕捉复杂的动态场景,并在高运动和低运动场景中均实现了最先进的性能。
链接: https://arxiv.org/abs/2501.04782
作者: Andrew Bond,Jui-Hsien Wang,Long Mai,Erkut Erdem,Aykut Erdem
机构: Koç University; Adobe Research; Hacettepe University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 10 figures
Abstract:Efficient neural representations for dynamic video scenes are critical for applications ranging from video compression to interactive simulations. Yet, existing methods often face challenges related to high memory usage, lengthy training times, and temporal consistency. To address these issues, we introduce a novel neural video representation that combines 3D Gaussian splatting with continuous camera motion modeling. By leveraging Neural ODEs, our approach learns smooth camera trajectories while maintaining an explicit 3D scene representation through Gaussians. Additionally, we introduce a spatiotemporal hierarchical learning strategy, progressively refining spatial and temporal features to enhance reconstruction quality and accelerate convergence. This memory-efficient approach achieves high-quality rendering at impressive speeds. Experimental results show that our hierarchical learning, combined with robust camera motion modeling, captures complex dynamic scenes with strong temporal consistency, achieving state-of-the-art performance across diverse video datasets in both high- and low-motion scenarios.
zh
[CV-77] READ: Token Routing for Efficient Architecture-agnostic Diffusion Training
【速读】:该论文旨在解决扩散模型(Diffusion Models)在视觉生成任务中存在的样本效率低下和训练成本高的问题,特别是在标准的扩散变压器(Diffusion Transformer, DiT)架构中,由于其相对于输入长度的二次复杂度,这一问题尤为突出。论文提出的解决方案的关键在于通过预定义路径(predefined routes)来存储信息,并在模型的更深层重新引入这些信息,而不是完全丢弃这些令牌(tokens)。此外,论文还结合了多条路径,并引入了一种适应性的辅助损失函数(auxiliary loss),以考虑所有应用的路径。该方法不仅适用于常见的基于变压器的模型,还可以应用于状态空间模型(state-space models),且无需对架构进行修改。最终,该方法在标准基准测试ImageNet-1K 256 x 256的类条件合成任务中,不仅降低了计算成本,还提升了模型性能,实现了9.55倍的收敛速度提升(与DiT在40万次训练迭代时相比)和25.39倍的提升(与DiT在700万次训练迭代时的最佳性能相比)。
链接: https://arxiv.org/abs/2501.04765
作者: Felix Krause,Timy Phan,Vincent Tao Hu,Björn Ommer
机构: CompVis @ LMU Munich (慕尼黑大学计算机视觉组)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models have emerged as the mainstream approach for visual generation. However, these models usually suffer from sample inefficiency and high training costs. This issue is particularly pronounced in the standard diffusion transformer architecture due to its quadratic complexity relative to input length. Recent works have addressed this by reducing the number of tokens processed in the model, often through masking. In contrast, this work aims to improve the training efficiency of the diffusion backbone by using predefined routes that store this information until it is reintroduced to deeper layers of the model, rather than discarding these tokens entirely. Further, we combine multiple routes and introduce an adapted auxiliary loss that accounts for all applied routes. Our method is not limited to the common transformer-based model - it can also be applied to state-space models. Unlike most current approaches, TREAD achieves this without architectural modifications. Finally, we show that our method reduces the computational cost and simultaneously boosts model performance on the standard benchmark ImageNet-1K 256 x 256 in class-conditional synthesis. Both of these benefits multiply to a convergence speedup of 9.55x at 400K training iterations compared to DiT and 25.39x compared to the best benchmark performance of DiT at 7M training iterations.
zh
[CV-78] Video Summarisation with Incident and Context Information using Generative AI
【速读】:该论文旨在解决视频内容生产激增带来的数据分析效率低下和资源利用不足的问题。为了解决这一问题,论文提出了一种基于生成式人工智能(Generative Artificial Intelligence, GenAI)的新方法,旨在通过生成定制化的文本摘要来简化视频分析过程。该工具的核心在于利用YOLO-V8进行目标检测,并结合Gemini进行视频和文本的全面分析,从而提升上下文准确性。通过这种结合,用户能够从大量的闭路电视(CCTV)录像中快速提取和验证相关事件,而无需进行繁琐的手动审查。定量评估显示,该方法的相似度为72.8%,定性评估的准确率为85%,证明了其有效性和实用性。
链接: https://arxiv.org/abs/2501.04764
作者: Ulindu De Silva,Leon Fernando,Kalinga Bandara,Rashmika Nawaratne
机构: Department of Electronic and Telecommunication Engineering, University of Moratuwa, Sri Lanka (莫拉图瓦大学电子与通信工程系, 斯里兰卡); Research Centre for Data Analytics and Cognition, La Trobe University, Victoria, Australia (拉筹伯大学数据分析与认知研究中心, 维多利亚, 澳大利亚)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The proliferation of video content production has led to vast amounts of data, posing substantial challenges in terms of analysis efficiency and resource utilization. Addressing this issue calls for the development of robust video analysis tools. This paper proposes a novel approach leveraging Generative Artificial Intelligence (GenAI) to facilitate streamlined video analysis. Our tool aims to deliver tailored textual summaries of user-defined queries, offering a focused insight amidst extensive video datasets. Unlike conventional frameworks that offer generic summaries or limited action recognition, our method harnesses the power of GenAI to distil relevant information, enhancing analysis precision and efficiency. Employing YOLO-V8 for object detection and Gemini for comprehensive video and text analysis, our solution achieves heightened contextual accuracy. By combining YOLO with Gemini, our approach furnishes textual summaries extracted from extensive CCTV footage, enabling users to swiftly navigate and verify pertinent events without the need for exhaustive manual review. The quantitative evaluation revealed a similarity of 72.8%, while the qualitative assessment rated an accuracy of 85%, demonstrating the capability of the proposed method.
zh
[CV-79] Efficient License Plate Recognition in Videos Using Visual Rhythm and Accumulative Line Analysis
【速读】:该论文旨在解决基于视频的自动车牌识别(ALPR)系统中计算资源需求高的问题。传统方法通常依赖高端计算资源,并通过处理多帧图像来识别车牌,导致计算开销较大。论文提出了两种解决方案:第一种方法利用视觉节奏(Visual Rhythm, VR)从视频中生成时空图像;第二种方法采用基于单线视频处理的累积线分析(Accumulative Line Analysis, ALA)算法,实现实时操作。这两种方法均通过YOLO进行车牌检测,并结合卷积神经网络(CNN)进行光学字符识别(OCR),从而从单帧图像中提取车牌文本信息。实验结果表明,所提出的方法在保持与传统逐帧处理相当识别效果的同时,处理速度提高了三倍。
链接: https://arxiv.org/abs/2501.04750
作者: Victor Nascimento Ribeiro,Nina S. T. Hirata
机构: University of São Paulo - USP (圣保罗大学); Institute of Mathematics and Statistics (数学与统计学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for presentation at the Conference on Graphics, Patterns and Images (SIBGRAPI) 2024
Abstract:Video-based Automatic License Plate Recognition (ALPR) involves extracting vehicle license plate text information from video captures. Traditional systems typically rely heavily on high-end computing resources and utilize multiple frames to recognize license plates, leading to increased computational overhead. In this paper, we propose two methods capable of efficiently extracting exactly one frame per vehicle and recognizing its license plate characters from this single image, thus significantly reducing computational demands. The first method uses Visual Rhythm (VR) to generate time-spatial images from videos, while the second employs Accumulative Line Analysis (ALA), a novel algorithm based on single-line video processing for real-time operation. Both methods leverage YOLO for license plate detection within the frame and a Convolutional Neural Network (CNN) for Optical Character Recognition (OCR) to extract textual information. Experiments on real videos demonstrate that the proposed methods achieve results comparable to traditional frame-by-frame approaches, with processing speeds three times faster.
zh
[CV-80] Flatland Vision
【速读】:该论文试图解决的问题是:在什么条件下,可以将位于两个射影平面(projective planes)上的两组标记点投影到射影直线(projective line)上的同一图像。论文的核心解决方案是:存在这样的投影中心(projection centers)使得两组标记点能够投影到同一图像,当且仅当这两组标记点本身是射影空间中某个共同点集(common pointset)的投影。这一结论揭示了投影几何中的一种基本关系,即两组标记点的投影一致性取决于它们是否源自同一个射影空间中的点集。
链接: https://arxiv.org/abs/2501.05429
作者: Sameer Agarwal,Erin Connelly,Annalisa Crannell,Timothy Duff,Rekha R. Thomas
机构: 未知
类目: Algebraic Geometry (math.AG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:When is it possible to project two sets of labeled points lying in a pair of projective planes to the same image on a projective line? We give a complete answer to this question and describe the loci of the projection centers that enable a common image. In particular, we find that there exists a solution to this problem if and only if these two sets are themselves images of a common pointset in projective space.
zh
[CV-81] Optimized Sampling for Non-Line-of-Sight Imaging Using Modified Fast Fourier Transforms
链接: https://arxiv.org/abs/2501.05244
作者: Talha Sultan,Alex Bocchieri,Chaoying Gu,Xiaochun Liu,Pavel Polynkin,Andreas Velten
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:
[CV-82] Improving the U-Net Configuration for Automated Delineation of Head and Neck Cancer on MRI
【速读】:该论文旨在解决头颈部肿瘤在MRI图像上的自动分割问题,这一问题在临床环境中通常需要手动完成,耗时且具有挑战性。论文的解决方案关键在于对传统U-Net架构的改进,而非设计新的任务特定卷积神经网络。具体改进包括:1)在训练和滑动窗口推理中使用逐块归一化(patch-wise normalization),2)在训练过程中应用计划性数据增强策略(scheduled data augmentation policy),3)在滑动窗口推理中使用高斯加权(Gaussian weighting)来结合单个块的预测结果。通过这些改进,模型在MICCAI HNTS-MRG 2024挑战赛中的任务1和任务2上分别获得了0.749和0.710的聚合Dice相似系数(DSCagg),并在50名患者的私有测试集上表现出了一致的结果。
链接: https://arxiv.org/abs/2501.05120
作者: Andrei Iantsen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Tumor volume segmentation on MRI is a challenging and time-consuming process that is performed manually in typical clinical settings. This work presents an approach to automated delineation of head and neck tumors on MRI scans, developed in the context of the MICCAI Head and Neck Tumor Segmentation for MR-Guided Applications (HNTS-MRG) 2024 Challenge. Rather than designing a new, task-specific convolutional neural network, the focus of this research was to propose improvements to the configuration commonly used in medical segmentation tasks, relying solely on the traditional U-Net architecture. The empirical results presented in this article suggest the superiority of patch-wise normalization used for both training and sliding window inference. They also indicate that the performance of segmentation models can be enhanced by applying a scheduled data augmentation policy during training. Finally, it is shown that a small improvement in quality can be achieved by using Gaussian weighting to combine predictions for individual patches during sliding window inference. The model with the best configuration obtained an aggregated Dice Similarity Coefficient (DSCagg) of 0.749 in Task 1 and 0.710 in Task 2 on five cross-validation folds. The ensemble of five models (one best model per validation fold) showed consistent results on a private test set of 50 patients with an DSCagg of 0.752 in Task 1 and 0.718 in Task 2 (team name: this http URL). The source code and model weights are freely available at this http URL.
zh
[CV-83] A Steerable Deep Network for Model-Free Diffusion MRI Registration
【速读】:该论文旨在解决扩散磁共振成像(dMRI)中的非刚性配准(nonrigid registration)问题,特别是在高维度和方向依赖性数据上的挑战。传统方法虽然准确但计算量大,而深度学习方法在dMRI非刚性配准中的应用相对较少。论文提出了一种新颖的深度学习框架,直接在原始dMRI数据上进行模型无关的非刚性配准,无需显式重新定向。该框架的关键在于将配准问题表述为位置和方向空间的等变微分同胚(equivariant diffeomorphism),并引入了一个SE(3)-等变的UNet网络,用于生成速度场,同时保留原始dMRI数据的几何特性。此外,论文提出了一种基于傅里叶空间中最大均值差异(maximum mean discrepancy)的新损失函数,隐式地匹配图像间的整体平均传播器(ensemble average propagators)。实验结果表明,该方法在Human Connectome Project的dMRI数据上表现优异,且避免了估计衍生表示的开销,为直接在采集空间中进行数据驱动、几何感知的dMRI配准奠定了基础。
链接: https://arxiv.org/abs/2501.04794
作者: Gianfranco Cortes,Baba C. Vemuri
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Nonrigid registration is vital to medical image analysis but remains challenging for diffusion MRI (dMRI) due to its high-dimensional, orientation-dependent nature. While classical methods are accurate, they are computationally demanding, and deep neural networks, though efficient, have been underexplored for nonrigid dMRI registration compared to structural imaging. We present a novel, deep learning framework for model-free, nonrigid registration of raw diffusion MRI data that does not require explicit reorientation. Unlike previous methods relying on derived representations such as diffusion tensors or fiber orientation distribution functions, in our approach, we formulate the registration as an equivariant diffeomorphism of position-and-orientation space. Central to our method is an \mathsfSE(3) -equivariant UNet that generates velocity fields while preserving the geometric properties of a raw dMRI’s domain. We introduce a new loss function based on the maximum mean discrepancy in Fourier space, implicitly matching ensemble average propagators across images. Experimental results on Human Connectome Project dMRI data demonstrate competitive performance compared to state-of-the-art approaches, with the added advantage of bypassing the overhead for estimating derived representations. This work establishes a foundation for data-driven, geometry-aware dMRI registration directly in the acquisition space.
zh
[CV-84] opology-based deep-learning segmentation method for deep anterior lamellar keratoplasty (DALK) surgical guidance using M-mode OCT data
【速读】:该论文试图解决在深前板层角膜移植术(Deep Anterior Lamellar Keratoplasty, DALK)中使用光学相干断层扫描(Optical Coherence Tomography, OCT)引导的机器人进行角膜层分割时,由于信号噪声和不稳定性导致的传统深度学习分割方法性能不佳的问题。具体表现为角膜层的粗糙和不准确检测,影响了手术的精确性和稳定性。
解决方案的关键在于开发了一种基于拓扑的深度学习分割方法。该方法通过将拓扑损失函数与改进的网络架构相结合,有效减少了噪声的影响,并提高了分割的速度、精度和稳定性。通过使用体内、体外和混合兔眼数据集进行验证,该方法在分割上皮层和Descemet膜(Descemet’s membrane, DM)方面表现出色,优于传统的基于损失的技术,从而为手术提供了快速、准确和鲁棒的引导。
链接: https://arxiv.org/abs/2501.04735
作者: J. Yu,H. Yi,Y. Wang,J. D. Opfermann,W. G. Gensheimer,A. Krieger,J. U. Kang
机构: Johns Hopkins University(约翰斯·霍普金斯大学); White River Junction VA Medical Center(怀特河交汇处退伍军人医疗中心); Dartmouth-Hitchcock Medical Center(达特茅斯-希区柯克医疗中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep Anterior Lamellar Keratoplasty (DALK) is a partial-thickness corneal transplant procedure used to treat corneal stromal diseases. A crucial step in this procedure is the precise separation of the deep stroma from Descemet’s membrane (DM) using the Big Bubble technique. To simplify the tasks of needle insertion and pneumo-dissection in this technique, we previously developed an Optical Coherence Tomography (OCT)-guided, eye-mountable robot that uses real-time tracking of corneal layers from M-mode OCT signals for control. However, signal noise and instability during manipulation of the OCT fiber sensor-integrated needle have hindered the performance of conventional deep-learning segmentation methods, resulting in rough and inaccurate detection of corneal layers. To address these challenges, we have developed a topology-based deep-learning segmentation method that integrates a topological loss function with a modified network architecture. This approach effectively reduces the effects of noise and improves segmentation speed, precision, and stability. Validation using in vivo, ex vivo, and hybrid rabbit eye datasets demonstrates that our method outperforms traditional loss-based techniques, providing fast, accurate, and robust segmentation of the epithelium and DM to guide surgery.
zh
人工智能
[AI-0] From Simple to Complex Skills: The Case of In-Hand Object Reorientation
链接: https://arxiv.org/abs/2501.05439
作者: Haozhi Qi,Brent Yi,Mike Lambeta,Yi Ma,Roberto Calandra,Jitendra Malik
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: website: this https URL
Abstract:Learning policies in simulation and transferring them to the real world has become a promising approach in dexterous manipulation. However, bridging the sim-to-real gap for each new task requires substantial human effort, such as careful reward engineering, hyperparameter tuning, and system identification. In this work, we present a system that leverages low-level skills to address these challenges for more complex tasks. Specifically, we introduce a hierarchical policy for in-hand object reorientation based on previously acquired rotation skills. This hierarchical policy learns to select which low-level skill to execute based on feedback from both the environment and the low-level skill policies themselves. Compared to learning from scratch, the hierarchical policy is more robust to out-of-distribution changes and transfers easily from simulation to real-world environments. Additionally, we propose a generalizable object pose estimator that uses proprioceptive information, low-level skill predictions, and control errors as inputs to estimate the object pose over time. We demonstrate that our system can reorient objects, including symmetrical and textureless ones, to a desired pose.
[AI-1] Neuro-Symbolic AI in 2024: A Systematic Review
链接: https://arxiv.org/abs/2501.05435
作者: Brandon C. Colelough,William Regli
类目: Artificial Intelligence (cs.AI)
*备注: 19 pages
Abstract:Background: The field of Artificial Intelligence has undergone cyclical periods of growth and decline, known as AI summers and winters. Currently, we are in the third AI summer, characterized by significant advancements and commercialization, particularly in the integration of Symbolic AI and Sub-Symbolic AI, leading to the emergence of Neuro-Symbolic AI. Methods: The review followed the PRISMA methodology, utilizing databases such as IEEE Explore, Google Scholar, arXiv, ACM, and SpringerLink. The inclusion criteria targeted peer-reviewed papers published between 2020 and 2024. Papers were screened for relevance to Neuro-Symbolic AI, with further inclusion based on the availability of associated codebases to ensure reproducibility. Results: From an initial pool of 1,428 papers, 167 met the inclusion criteria and were analyzed in detail. The majority of research efforts are concentrated in the areas of learning and inference (63%), logic and reasoning (35%), and knowledge representation (44%). Explainability and trustworthiness are less represented (28%), with Meta-Cognition being the least explored area (5%). The review identifies significant interdisciplinary opportunities, particularly in integrating explainability and trustworthiness with other research areas. Conclusion: Neuro-Symbolic AI research has seen rapid growth since 2020, with concentrated efforts in learning and inference. Significant gaps remain in explainability, trustworthiness, and Meta-Cognition. Addressing these gaps through interdisciplinary research will be crucial for advancing the field towards more intelligent, reliable, and context-aware AI systems. Comments: 19 pages Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2501.05435 [cs.AI] (or arXiv:2501.05435v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.05435 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Brandon Colelough [view email] [v1] Thu, 9 Jan 2025 18:48:35 UTC (785 KB)
[AI-2] meRL: Efficient Deep Reinforcement Learning with Polyhedral Dependence Graphs
链接: https://arxiv.org/abs/2501.05408
作者: Pedro F. Silvestre,Peter Pietzuch
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 11 figures, 5 bibliography pages
Abstract:Modern deep learning (DL) workloads increasingly use complex deep reinforcement learning (DRL) algorithms that generate training data within the learning loop. This results in programs with several nested loops and dynamic data dependencies between tensors. While DL systems with eager execution support such dynamism, they lack the optimizations and smart scheduling of graph-based execution. Graph-based execution, however, cannot express dynamic tensor shapes, instead requiring the use of multiple static subgraphs. Either execution model for DRL thus leads to redundant computation, reduced parallelism, and less efficient memory management. We describe TimeRL, a system for executing dynamic DRL programs that combines the dynamism of eager execution with the whole-program optimizations and scheduling of graph-based execution. TimeRL achieves this by introducing the declarative programming model of recurrent tensors, which allows users to define dynamic dependencies as intuitive recurrence equations. TimeRL translates recurrent tensors into a polyhedral dependence graph (PDG) with dynamic dependencies as symbolic expressions. Through simple PDG transformations, TimeRL applies whole-program optimizations, such as automatic vectorization, incrementalization, and operator fusion. The PDG also allows for the computation of an efficient program-wide execution schedule, which decides on buffer deallocations, buffer donations, and GPU/CPU memory swapping. We show that TimeRL executes current DRL algorithms up to 47 \times faster than existing DRL systems, while using 16 \times less GPU peak memory. Comments: 17 pages, 11 figures, 5 bibliography pages Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: D.4.7, I.1.1, I.1.3, I.2.5 ACMclasses: D.4.7; I.1.1; I.1.3; I.2.5 Cite as: arXiv:2501.05408 [cs.DC] (or arXiv:2501.05408v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2501.05408 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-3] On-line Policy Improvement using Monte-Carlo Search NEURIPS1996 NIPS
链接: https://arxiv.org/abs/2501.05407
作者: Gerald Tesauro,Gregory R. Galperin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accompanied by oral presentation by Gregory Galperin at NeurIPS 1996 (then known as NIPS96)
Abstract:We present a Monte-Carlo simulation algorithm for real-time policy improvement of an adaptive controller. In the Monte-Carlo simulation, the long-term expected reward of each possible action is statistically measured, using the initial policy to make decisions in each step of the simulation. The action maximizing the measured expected reward is then taken, resulting in an improved policy. Our algorithm is easily parallelizable and has been implemented on the IBM SP1 and SP2 parallel-RISC supercomputers. We have obtained promising initial results in applying this algorithm to the domain of backgammon. Results are reported for a wide variety of initial policies, ranging from a random policy to TD-Gammon, an extremely strong multi-layer neural network. In each case, the Monte-Carlo algorithm gives a substantial reduction, by as much as a factor of 5 or more, in the error rate of the base players. The algorithm is also potentially useful in many other adaptive control applications in which it is possible to simulate the environment. Comments: Accompanied by oral presentation by Gregory Galperin at NeurIPS 1996 (then known as NIPS*96) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.05407 [cs.LG] (or arXiv:2501.05407v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.05407 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Advances in Neural Information Processing 9 (NIPS 1996 Proceedings published 1997)
[AI-4] meDP: Learning to Generate Multi-Domain Time Series with Domain Prompts AAAI2025
链接: https://arxiv.org/abs/2501.05403
作者: Yu-Hao Huang,Chang Xu,Yueying Wu,Wu-Jun Li,Jiang Bian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: AAAI 2025
Abstract:Time series generation models are crucial for applications like data augmentation and privacy preservation. Most existing time series generation models are typically designed to generate data from one specified domain. While leveraging data from other domain for better generalization is proved to work in other application areas, this approach remains challenging for time series modeling due to the large divergence in patterns among different real world time series categories. In this paper, we propose a multi-domain time series diffusion model with domain prompts, named TimeDP. In TimeDP, we utilize a time series semantic prototype module which defines time series prototypes to represent time series basis, each prototype vector serving as “word” representing some elementary time series feature. A prototype assignment module is applied to extract the extract domain specific prototype weights, for learning domain prompts as generation condition. During sampling, we extract “domain prompt” with few-shot samples from the target domain and use the domain prompts as condition to generate time series samples. Experiments demonstrate that our method outperforms baselines to provide the state-of-the-art in-domain generation quality and strong unseen domain generation capability.
[AI-5] BRATI: Bidirectional Recurrent Attention for Time-Series Imputation
链接: https://arxiv.org/abs/2501.05401
作者: Armando Collado-Villaverde,Pablo Muñoz,Maria D. R-Moreno
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Missing data in time-series analysis poses significant challenges, affecting the reliability of downstream applications. Imputation, the process of estimating missing values, has emerged as a key solution. This paper introduces BRATI, a novel deep-learning model designed to address multivariate time-series imputation by combining Bidirectional Recurrent Networks and Attention mechanisms. BRATI processes temporal dependencies and feature correlations across long and short time horizons, utilizing two imputation blocks that operate in opposite temporal directions. Each block integrates recurrent layers and attention mechanisms to effectively resolve long-term dependencies. We evaluate BRATI on three real-world datasets under diverse missing-data scenarios: randomly missing values, fixed-length missing sequences, and variable-length missing sequences. Our findings demonstrate that BRATI consistently outperforms state-of-the-art models, delivering superior accuracy and robustness in imputing multivariate time-series data. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.05401 [cs.LG] (or arXiv:2501.05401v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.05401 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-6] Mechanistic understanding and validation of large AI models with SemanticLens
链接: https://arxiv.org/abs/2501.05398
作者: Maximilian Dreyer,Jim Berend,Tobias Labarta,Johanna Vielhaben,Thomas Wiegand,Sebastian Lapuschkin,Wojciech Samek
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 74 pages (18 pages manuscript, 7 pages references, 49 pages appendix)
[AI-7] he global consensus on the risk management of autonomous driving
链接: https://arxiv.org/abs/2501.05391
作者: Sebastian Krügel,Matthias Uhl
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:
Abstract:Every maneuver of a vehicle redistributes risks between road users. While human drivers do this intuitively, autonomous vehicles allow and require deliberative algorithmic risk management. But how should traffic risks be distributed among road users? In a global experimental study in eight countries with different cultural backgrounds and almost 11,000 participants, we compared risk distribution preferences. It turns out that risk preferences in road traffic are strikingly similar between the cultural zones. The vast majority of participants in all countries deviates from a guiding principle of minimizing accident probabilities in favor of weighing up the probability and severity of accidents. At the national level, the consideration of accident probability and severity hardly differs between countries. The social dilemma of autonomous vehicles detected in deterministic crash scenarios disappears in risk assessments of everyday traffic situations in all countries. In no country do cyclists receive a risk bonus that goes beyond their higher vulnerability. In sum, our results suggest that a global consensus on the risk ethics of autonomous driving is easier to establish than on the ethics of crashing.
[AI-8] Developing a Foundation of Vector Symbolic Architectures Using Category Theory
链接: https://arxiv.org/abs/2501.05368
作者: Nolan P Shaw,P Michael Furlong,Britt Anderson,Jeff Orchard
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, no figures, 2 tables, one appendix
Abstract:At the risk of overstating the case, connectionist approaches to machine learning, i.e. neural networks, are enjoying a small vogue right now. However, these methods require large volumes of data and produce models that are uninterpretable to humans. An alternative framework that is compatible with neural networks and gradient-based learning, but explicitly models compositionality, is Vector Symbolic Architectures (VSAs). VSAs are a family of algebras on high-dimensional vector representations. They arose in cognitive science from the need to unify neural processing and the kind of symbolic reasoning that humans perform. While machine learning methods have benefited from category theoretical analyses, VSAs have not yet received similar treatment. In this paper, we present a first attempt at applying category theory to VSAs. Specifically, we conduct a brief literature survey demonstrating the lacking intersection of these two topics, provide a list of desiderata for VSAs, and propose that VSAs may be understood as a (division) rig in a category enriched over a monoid in Met (the category of Lawvere metric spaces). This final contribution suggests that VSAs may be generalised beyond current implementations. It is our hope that grounding VSAs in category theory will lead to more rigorous connections with other research, both within and beyond, learning and cognition.
[AI-9] On Corrigibility and Alignment in Multi Agent Games
链接: https://arxiv.org/abs/2501.05360
作者: Edmund Dable-Heath,Boyko Vodenicharski,James Bishop
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:
Abstract:Corrigibility of autonomous agents is an under explored part of system design, with previous work focusing on single agent systems. It has been suggested that uncertainty over the human preferences acts to keep the agents corrigible, even in the face of human irrationality. We present a general framework for modelling corrigibility in a multi-agent setting as a 2 player game in which the agents always have a move in which they can ask the human for supervision. This is formulated as a Bayesian game for the purpose of introducing uncertainty over the human beliefs. We further analyse two specific cases. First, a two player corrigibility game, in which we want corrigibility displayed in both agents for both common payoff (monotone) games and harmonic games. Then we investigate an adversary setting, in which one agent is considered to be a defending' agent and the other an
adversary’. A general result is provided for what belief over the games and human rationality the defending agent is required to have to induce corrigibility.
[AI-10] he Bakers and Millers Game with Restricted Locations AAMAS2025
链接: https://arxiv.org/abs/2501.05334
作者: Simon Krogmann,Pascal Lenzner,Alexander Skopalik
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注: To appear at the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)
Abstract:We study strategic location choice by customers and sellers, termed the Bakers and Millers Game in the literature. In our generalized setting, each miller can freely choose any location for setting up a mill, while each baker is restricted in the choice of location for setting up a bakery. For optimal bargaining power, a baker would like to select a location with many millers to buy flour from and with little competition from other bakers. Likewise, a miller aims for a location with many bakers and few competing millers. Thus, both types of agents choose locations to optimize the ratio of agents of opposite type divided by agents of the same type at their chosen location. Originally raised in the context of Fractional Hedonic Games, the Bakers and Millers Game has applications that range from commerce to product design. We study the impact of location restrictions on the properties of the game. While pure Nash equilibria trivially exist in the setting without location restrictions, we show via a sophisticated, efficient algorithm that even the more challenging restricted setting admits equilibria. Moreover, the computed equilibrium approximates the optimal social welfare by a factor of at most 2\left(\fracee-1\right) . Furthermore, we give tight bounds on the price of anarchy/stability. On the conceptual side, the location choice feature adds a new layer to the standard setting of Hedonic Games, in the sense that agents that select the same location form a coalition. This allows to naturally restrict the possible coalitions that can be formed. With this, our model generalizes simple symmetric Fractional Hedonic Games on complete bipartite valuation graphs and also Hedonic Diversity Games with utilities single-peaked at 0. We believe that this generalization is also a very interesting direction for other types of Hedonic Games. Comments: To appear at the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025) Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.05334 [cs.GT] (or arXiv:2501.05334v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2501.05334 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-11] AnCoGen: Analysis Control and Generation of Speech with a Masked Autoencoder
链接: https://arxiv.org/abs/2501.05332
作者: Samir Sadok,Simon Leglaive,Laurent Girin,Gaël Richard,Xavier Alameda-Pineda
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注: 5 pages, this https URL
Abstract:This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysis-resynthesis, pitch estimation, pitch modification, and speech enhancement.
[AI-12] Off-Policy Evaluation and Counterfactual Methods in Dynamic Auction Environments
链接: https://arxiv.org/abs/2501.05278
作者: Ritam Guha,Nilavra Pathak
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 15 figures, IEEE format
Abstract:Counterfactual estimators are critical for learning and refining policies using logged data, a process known as Off-Policy Evaluation (OPE). OPE allows researchers to assess new policies without costly experiments, speeding up the evaluation process. Online experimental methods, such as A/B tests, are effective but often slow, thus delaying the policy selection and optimization process. In this work, we explore the application of OPE methods in the context of resource allocation in dynamic auction environments. Given the competitive nature of environments where rapid decision-making is crucial for gaining a competitive edge, the ability to quickly and accurately assess algorithmic performance is essential. By utilizing counterfactual estimators as a preliminary step before conducting A/B tests, we aim to streamline the evaluation process, reduce the time and resources required for experimentation, and enhance confidence in the chosen policies. Our investigation focuses on the feasibility and effectiveness of using these estimators to predict the outcomes of potential resource allocation strategies, evaluate their performance, and facilitate more informed decision-making in policy selection. Motivated by the outcomes of our initial study, we envision an advanced analytics system designed to seamlessly and dynamically assess new resource allocation strategies and policies. Comments: 9 pages, 15 figures, IEEE format Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2501.05278 [cs.AI] (or arXiv:2501.05278v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.05278 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-13] Automating the Detection of Code Vulnerabilities by Analyzing GitHub Issues
链接: https://arxiv.org/abs/2501.05258
作者: Daniele Cipollone,Changjie Wang,Mariano Scazzariello,Simone Ferlin,Maliheh Izadi,Dejan Kostic,Marco Chiesa
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:
Abstract:In today’s digital landscape, the importance of timely and accurate vulnerability detection has significantly increased. This paper presents a novel approach that leverages transformer-based models and machine learning techniques to automate the identification of software vulnerabilities by analyzing GitHub issues. We introduce a new dataset specifically designed for classifying GitHub issues relevant to vulnerability detection. We then examine various classification techniques to determine their effectiveness. The results demonstrate the potential of this approach for real-world application in early vulnerability detection, which could substantially reduce the window of exploitation for software vulnerabilities. This research makes a key contribution to the field by providing a scalable and computationally efficient framework for automated detection, enabling the prevention of compromised software usage before official notifications. This work has the potential to enhance the security of open-source software ecosystems.
[AI-14] From Scientific Texts to Verifiable Code: Automating the Process with Transformers
链接: https://arxiv.org/abs/2501.05252
作者: Changjie Wang,Mariano Scazzariello,Marco Chiesa
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:
Abstract:Despite the vast body of research literature proposing algorithms with formal guarantees, the amount of verifiable code in today’s systems remains minimal. This discrepancy stems from the inherent difficulty of verifying code, particularly due to the time-consuming nature and strict formalism of proof details that formal verification tools require. However, the emergence of transformers in Large Language Models presents a promising solution to this challenge. In this position paper, we believe that transformers have the potential to read research papers that propose algorithms with formal proofs and translate these proofs into verifiable code. We leverage transformers to first build a formal structure of the proof using the original text from the paper, and then to handle the tedious, low-level aspects of proofs that are often omitted by humans. We argue that this approach can significantly reduce the barrier to formal verification. The above idea of reading papers to write verifiable code opens new avenues for automating the verification of complex systems, enabling a future where formally verified algorithms from academic research can more seamlessly transition into real-world software systems, thereby improving code reliability and security.
[AI-15] RAG -WM: An Efficient Black-Box Watermarking Approach for Retrieval-Augmented Generation of Large Language Models
链接: https://arxiv.org/abs/2501.05249
作者: Peizhuo Lv,Mengjie Sun,Hao Wang,Xiaofeng Wang,Shengzhi Zhang,Yuxuan Chen,Kai Chen,Limin Sun
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
Abstract:In recent years, tremendous success has been witnessed in Retrieval-Augmented Generation (RAG), widely used to enhance Large Language Models (LLMs) in domain-specific, knowledge-intensive, and privacy-sensitive tasks. However, attackers may steal those valuable RAGs and deploy or commercialize them, making it essential to detect Intellectual Property (IP) infringement. Most existing ownership protection solutions, such as watermarks, are designed for relational databases and texts. They cannot be directly applied to RAGs because relational database watermarks require white-box access to detect IP infringement, which is unrealistic for the knowledge base in RAGs. Meanwhile, post-processing by the adversary’s deployed LLMs typically destructs text watermark information. To address those problems, we propose a novel black-box “knowledge watermark” approach, named RAG-WM, to detect IP infringement of RAGs. RAG-WM uses a multi-LLM interaction framework, comprising a Watermark Generator, Shadow LLM RAG, and Watermark Discriminator, to create watermark texts based on watermark entity-relationship tuples and inject them into the target RAG. We evaluate RAG-WM across three domain-specific and two privacy-sensitive tasks on four benchmark LLMs. Experimental results show that RAG-WM effectively detects the stolen RAGs in various deployed LLMs. Furthermore, RAG-WM is robust against paraphrasing, unrelated content removal, knowledge insertion, and knowledge expansion attacks. Lastly, RAG-WM can also evade watermark detection approaches, highlighting its promising application in detecting IP infringement of RAG systems.
[AI-16] Deriving Coding-Specific Sub-Models from LLM s using Resource-Efficient Pruning
链接: https://arxiv.org/abs/2501.05248
作者: Laura Puccioni,Alireza Farshin,Mariano Scazzariello,Changjie Wang,Marco Chiesa,Dejan Kostic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:
Abstract:Large Language Models (LLMs) have demonstrated their exceptional performance in various complex code generation tasks. However, their broader adoption is limited by significant computational demands and high resource requirements, particularly memory and processing power. To mitigate such requirements, model pruning techniques are used to create more compact models with significantly fewer parameters. However, current approaches do not focus on the efficient extraction of programming-language-specific sub-models. In this work, we explore the idea of efficiently deriving coding-specific sub-models through unstructured pruning (i.e., Wanda). We investigate the impact of different domain-specific calibration datasets on pruning outcomes across three distinct domains and extend our analysis to extracting four language-specific sub-models: Python, Java, C++, and JavaScript. We are the first to efficiently extract programming-language-specific sub-models using appropriate calibration datasets while maintaining acceptable accuracy w.r.t. full models. We are also the first to provide analytical evidence that domain-specific tasks activate distinct regions within LLMs, supporting the creation of specialized sub-models through unstructured pruning. We believe that this work has significant potential to enhance LLM accessibility for coding by reducing computational requirements to enable local execution on consumer-grade hardware, and supporting faster inference times critical for real-time development feedback.
[AI-17] Online Prompt and Solver Selection for Program Synthesis AAAI AAAI-25
链接: https://arxiv.org/abs/2501.05247
作者: Yixuan Li,Lewis Frampton,Federico Mora,Elizabeth Polgreen
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at the 39th AAAI Conference on Artificial Intelligence (AAAI-25) Main Track
Abstract:Large Language Models (LLMs) demonstrate impressive capabilities in the domain of program synthesis. This level of performance is not, however, universal across all tasks, all LLMs and all prompting styles. There are many areas where one LLM dominates, one prompting style dominates, or where calling a symbolic solver is a better choice than an LLM. A key challenge for the user then, is to identify not only when an LLM is the right choice of solver, and the appropriate LLM to call for a given synthesis task, but also the right way to call it. A non-expert user who makes the wrong choice, incurs a cost both in terms of results (number of tasks solved, and the time it takes to solve them) and financial cost, if using a closed-source language model via a commercial API. We frame this choice as an online learning problem. We use a multi-armed bandit algorithm to select which symbolic solver, or LLM and prompt combination to deploy in order to maximize a given reward function (which may prioritize solving time, number of synthesis tasks solved, or financial cost of solving). We implement an instance of this approach, called CYANEA, and evaluate it on synthesis queries from the literature in ranking function synthesis, from the syntax-guided synthesis competition, and fresh, unseen queries generated from SMT problems. CYANEA solves 37.2% more queries than the best single solver and achieves results within 4% of the virtual best solver.
[AI-18] An Algorithmic Approach for Causal Health Equity: A Look at Race Differentials in Intensive Care Unit (ICU) Outcomes
链接: https://arxiv.org/abs/2501.05197
作者: Drago Plecko,Paul Secombe,Andrea Clarke,Amelia Fiske,Samarra Toby,Donisha Duff,David Pilcher,Leo Anthony Celi,Rinaldo Bellomo,Elias Bareinboim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME)
*备注:
Abstract:The new era of large-scale data collection and analysis presents an opportunity for diagnosing and understanding the causes of health inequities. In this study, we describe a framework for systematically analyzing health disparities using causal inference. The framework is illustrated by investigating racial and ethnic disparities in intensive care unit (ICU) outcome between majority and minority groups in Australia (Indigenous vs. Non-Indigenous) and the United States (African-American vs. White). We demonstrate that commonly used statistical measures for quantifying inequity are insufficient, and focus on attributing the observed disparity to the causal mechanisms that generate it. We find that minority patients are younger at admission, have worse chronic health, are more likely to be admitted for urgent and non-elective reasons, and have higher illness severity. At the same time, however, we find a protective direct effect of belonging to a minority group, with minority patients showing improved survival compared to their majority counterparts, with all other variables kept equal. We demonstrate that this protective effect is related to the increased probability of being admitted to ICU, with minority patients having an increased risk of ICU admission. We also find that minority patients, while showing improved survival, are more likely to be readmitted to ICU. Thus, due to worse access to primary health care, minority patients are more likely to end up in ICU for preventable conditions, causing a reduction in the mortality rates and creating an effect that appears to be protective. Since the baseline risk of ICU admission may serve as proxy for lack of access to primary care, we developed the Indigenous Intensive Care Equity (IICE) Radar, a monitoring system for tracking the over-utilization of ICU resources by the Indigenous population of Australia across geographical areas.
[AI-19] Explainable AI based System for Supply Air Temperature Forecast
链接: https://arxiv.org/abs/2501.05163
作者: Marika Eik,Ahmet Kose,Hossein Nourollahi Hokmabad,Juri Belikov
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注: 5 pages, 7 figures, 1 table, conference paper
Abstract:This paper explores the application of Explainable AI (XAI) techniques to improve the transparency and understanding of predictive models in control of automated supply air temperature (ASAT) of Air Handling Unit (AHU). The study focuses on forecasting of ASAT using a linear regression with Huber loss. However, having only a control curve without semantic and/or physical explanation is often not enough. The present study employs one of the XAI methods: Shapley values, which allows to reveal the reasoning and highlight the contribution of each feature to the final ASAT forecast. In comparison to other XAI methods, Shapley values have solid mathematical background, resulting in interpretation transparency. The study demonstrates the contrastive explanations–slices, for each control value of ASAT, which makes it possible to give the client objective justifications for curve changes.
[AI-20] Multimodal-to-Text Prompt Engineering in Large Language Models Using Feature Embeddings for GNSS Interference Characterization
链接: https://arxiv.org/abs/2501.05079
作者: Harshith Manjunath,Lucas Heublein,Tobias Feigl,Felix Ott
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:Large language models (LLMs) are advanced AI systems applied across various domains, including NLP, information retrieval, and recommendation systems. Despite their adaptability and efficiency, LLMs have not been extensively explored for signal processing tasks, particularly in the domain of global navigation satellite system (GNSS) interference monitoring. GNSS interference monitoring is essential to ensure the reliability of vehicle localization on roads, a critical requirement for numerous applications. However, GNSS-based positioning is vulnerable to interference from jamming devices, which can compromise its accuracy. The primary objective is to identify, classify, and mitigate these interferences. Interpreting GNSS snapshots and the associated interferences presents significant challenges due to the inherent complexity, including multipath effects, diverse interference types, varying sensor characteristics, and satellite constellations. In this paper, we extract features from a large GNSS dataset and employ LLaVA to retrieve relevant information from an extensive knowledge base. We employ prompt engineering to interpret the interferences and environmental factors, and utilize t-SNE to analyze the feature embeddings. Our findings demonstrate that the proposed method is capable of visual and logical reasoning within the GNSS context. Furthermore, our pipeline outperforms state-of-the-art machine learning models in interference classification tasks.
[AI-21] Analyzing Memorization in Large Language Models through the Lens of Model Attribution
链接: https://arxiv.org/abs/2501.05078
作者: Tarun Ram Menta,Susmit Agrawal,Chirag Agarwal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:Large Language Models (LLMs) are prevalent in modern applications but often memorize training data, leading to privacy breaches and copyright issues. Existing research has mainly focused on posthoc analyses, such as extracting memorized content or developing memorization metrics, without exploring the underlying architectural factors that contribute to memorization. In this work, we investigate memorization from an architectural lens by analyzing how attention modules at different layers impact its memorization and generalization performance. Using attribution techniques, we systematically intervene in the LLM architecture by bypassing attention modules at specific blocks while keeping other components like layer normalization and MLP transformations intact. We provide theorems analyzing our intervention mechanism from a mathematical view, bounding the difference in layer outputs with and without our attributions. Our theoretical and empirical analyses reveal that attention modules in deeper transformer blocks are primarily responsible for memorization, whereas earlier blocks are crucial for the models generalization and reasoning capabilities. We validate our findings through comprehensive experiments on different LLM families (Pythia and GPTNeo) and five benchmark datasets. Our insights offer a practical approach to mitigate memorization in LLMs while preserving their performance, contributing to safer and more ethical deployment in real world applications.
[AI-22] A Text-Based Knowledge-Embedded Soft Sensing Modeling Approach for General Industrial Process Tasks Based on Large Language Model
链接: https://arxiv.org/abs/2501.05075
作者: Shuo Tong,Han Liu,Runyuan Guo,Xueqiong Tian,Wenqing Wang,Ding Liu,Youmin Zhang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:Data-driven soft sensors (DDSS) have become mainstream methods for predicting key performance indicators in process industries. However, DDSS development requires complex and costly customized designs tailored to various tasks during the modeling process. Moreover, DDSS are constrained to a single structured data modality, limiting their ability to incorporate additional contextual knowledge. Furthermore, DDSSs’ limited representation learning leads to weak predictive performance with scarce data. To address these challenges, we propose a general framework named LLM-TKESS (large language model for text-based knowledge-embedded soft sensing), harnessing the powerful general problem-solving capabilities, cross-modal knowledge transfer abilities, and few-shot capabilities of LLM for enhanced soft sensing modeling. Specifically, an auxiliary variable series encoder (AVS Encoder) is proposed to unleash LLM’s potential for capturing temporal relationships within series and spatial semantic relationships among auxiliary variables. Then, we propose a two-stage fine-tuning alignment strategy: in the first stage, employing parameter-efficient fine-tuning through autoregressive training adjusts LLM to rapidly accommodate process variable data, resulting in a soft sensing foundation model (SSFM). Subsequently, by training adapters, we adapt the SSFM to various downstream tasks without modifying its architecture. Then, we propose two text-based knowledge-embedded soft sensors, integrating new natural language modalities to overcome the limitations of pure structured data models. Furthermore, benefiting from LLM’s pre-existing world knowledge, our model demonstrates outstanding predictive capabilities in small sample conditions. Using the thermal deformation of air preheater rotor as a case study, we validate through extensive experiments that LLM-TKESS exhibits outstanding performance.
[AI-23] D3RM: A Discrete Denoising Diffusion Refinement Model for Piano Transcription ICASSP2025
链接: https://arxiv.org/abs/2501.05068
作者: Hounsu Kim,Taegyun Kwon,Juhan Nam
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to ICASSP 2025
Abstract:Diffusion models have been widely used in the generative domain due to their convincing performance in modeling complex data distributions. Moreover, they have shown competitive results on discriminative tasks, such as image segmentation. While diffusion models have also been explored for automatic music transcription, their performance has yet to reach a competitive level. In this paper, we focus on discrete diffusion model’s refinement capabilities and present a novel architecture for piano transcription. Our model utilizes Neighborhood Attention layers as the denoising module, gradually predicting the target high-resolution piano roll, conditioned on the finetuned features of a pretrained acoustic model. To further enhance refinement, we devise a novel strategy which applies distinct transition states during training and inference stage of discrete diffusion models. Experiments on the MAESTRO dataset show that our approach outperforms previous diffusion-based piano transcription models and the baseline model in terms of F1 score. Our code is available in this https URL.
[AI-24] APFed: Threshold Secure Aggregation for Privacy-Preserving Federated Learning
链接: https://arxiv.org/abs/2501.05053
作者: Runhua Xu,Bo Li,Chao Li,James B.D. Joshi,Shuai Ma,Jianxin Li
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: The paper has been published in IEEE TDSC
Abstract:Federated learning is a computing paradigm that enhances privacy by enabling multiple parties to collaboratively train a machine learning model without revealing personal data. However, current research indicates that traditional federated learning platforms are unable to ensure privacy due to privacy leaks caused by the interchange of gradients. To achieve privacy-preserving federated learning, integrating secure aggregation mechanisms is essential. Unfortunately, existing solutions are vulnerable to recently demonstrated inference attacks such as the disaggregation attack. This paper proposes TAPFed, an approach for achieving privacy-preserving federated learning in the context of multiple decentralized aggregators with malicious actors. TAPFed uses a proposed threshold functional encryption scheme and allows for a certain number of malicious aggregators while maintaining security and privacy. We provide formal security and privacy analyses of TAPFed and compare it to various baselines through experimental evaluation. Our results show that TAPFed offers equivalent performance in terms of model quality compared to state-of-the-art approaches while reducing transmission overhead by 29%-45% across different model training scenarios. Most importantly, TAPFed can defend against recently demonstrated inference attacks caused by curious aggregators, which the majority of existing approaches are susceptible to.
[AI-25] A General Retrieval-Augmented Generation Framework for Multimodal Case-Based Reasoning Applications
链接: https://arxiv.org/abs/2501.05030
作者: Ofir Marom
类目: Artificial Intelligence (cs.AI)
*备注: 15 pages, 7 figures
Abstract:Case-based reasoning (CBR) is an experience-based approach to problem solving, where a repository of solved cases is adapted to solve new cases. Recent research shows that Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) can support the Retrieve and Reuse stages of the CBR pipeline by retrieving similar cases and using them as additional context to an LLM query. Most studies have focused on text-only applications, however, in many real-world problems the components of a case are multimodal. In this paper we present MCBR-RAG, a general RAG framework for multimodal CBR applications. The MCBR-RAG framework converts non-text case components into text-based representations, allowing it to: 1) learn application-specific latent representations that can be indexed for retrieval, and 2) enrich the query provided to the LLM by incorporating all case components for better context. We demonstrate MCBR-RAG’s effectiveness through experiments conducted on a simplified Math-24 application and a more complex Backgammon application. Our empirical results show that MCBR-RAG improves generation quality compared to a baseline LLM with no contextual information provided.
[AI-26] Finding Needles in Emb(a)dding Haystacks: Legal Document Retrieval via Bagging and SVR Ensembles
链接: https://arxiv.org/abs/2501.05018
作者: Kevin Bönisch,Alexander Mehler
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
Abstract:We introduce a retrieval approach leveraging Support Vector Regression (SVR) ensembles, bootstrap aggregation (bagging), and embedding spaces on the German Dataset for Legal Information Retrieval (GerDaLIR). By conceptualizing the retrieval task in terms of multiple binary needle-in-a-haystack subtasks, we show improved recall over the baselines (0.849 0.803 | 0.829) using our voting ensemble, suggesting promising initial results, without training or fine-tuning any deep learning models. Our approach holds potential for further enhancement, particularly through refining the encoding models and optimizing hyperparameters.
[AI-27] On Measuring Unnoticeability of Graph Adversarial Attacks: Observations New Measure and Applications KDD2025
链接: https://arxiv.org/abs/2501.05015
作者: Hyeonsoo Jo,Hyunjin Hwang,Fanchen Bu,Soo Yong Lee,Chanyoung Park,Kijung Shin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: KDD 2025
Abstract:Adversarial attacks are allegedly unnoticeable. Prior studies have designed attack noticeability measures on graphs, primarily using statistical tests to compare the topology of original and (possibly) attacked graphs. However, we observe two critical limitations in the existing measures. First, because the measures rely on simple rules, attackers can readily enhance their attacks to bypass them, reducing their attack “noticeability” and, yet, maintaining their attack performance. Second, because the measures naively leverage global statistics, such as degree distributions, they may entirely overlook attacks until severe perturbations occur, letting the attacks be almost “totally unnoticeable.” To address the limitations, we introduce HideNSeek, a learnable measure for graph attack noticeability. First, to mitigate the bypass problem, HideNSeek learns to distinguish the original and (potential) attack edges using a learnable edge scorer (LEO), which scores each edge on its likelihood of being an attack. Second, to mitigate the overlooking problem, HideNSeek conducts imbalance-aware aggregation of all the edge scores to obtain the final noticeability score. Using six real-world graphs, we empirically demonstrate that HideNSeek effectively alleviates the observed limitations, and LEO (i.e., our learnable edge scorer) outperforms eleven competitors in distinguishing attack edges under five different attack methods. For an additional application, we show that LEO boost the performance of robust GNNs by removing attack-like edges.
[AI-28] UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation
链接: https://arxiv.org/abs/2501.05014
作者: Oleg Sautenkov,Yasheerah Yaqoot,Artem Lykov,Muhammad Ahsan Mustafa,Grik Tadevosyan,Aibek Akhmetkazy,Miguel Altamirano Cabrera,Mikhail Martynov,Sausar Karaf,Dzmitry Tsetserukou
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: HRI 2025
Abstract:The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate communication with aerial robots. By integrating satellite imagery processing with the Visual Language Model (VLM) and the powerful capabilities of GPT, UAV-VLA enables users to generate general flight paths-and-action plans through simple text requests. This system leverages the rich contextual information provided by satellite images, allowing for enhanced decision-making and mission planning. The combination of visual analysis by VLM and natural language processing by GPT can provide the user with the path-and-action set, making aerial operations more efficient and accessible. The newly developed method showed the difference in the length of the created trajectory in 22% and the mean error in finding the objects of interest on a map in 34.22 m by Euclidean distance in the K-Nearest Neighbors (KNN) approach.
[AI-29] GiNet: Integrating Sequential and Context-Aware Learning for Battery Capacity Prediction
链接: https://arxiv.org/abs/2501.04997
作者: Sara Sameer,Wei Zhang,Xin Lou,Qingyu Yan,Terence Goh,Yulin Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages
Abstract:The surging demand for batteries requires advanced battery management systems, where battery capacity modelling is a key functionality. In this paper, we aim to achieve accurate battery capacity prediction by learning from historical measurements of battery dynamics. We propose GiNet, a gated recurrent units enhanced Informer network, for predicting battery’s capacity. The novelty and competitiveness of GiNet lies in its capability of capturing sequential and contextual information from raw battery data and reflecting the battery’s complex behaviors with both temporal dynamics and long-term dependencies. We conducted an experimental study based on a publicly available dataset to showcase GiNet’s strength of gaining a holistic understanding of battery behavior and predicting battery capacity accurately. GiNet achieves 0.11 mean absolute error for predicting the battery capacity in a sequence of future time slots without knowing the historical battery capacity. It also outperforms the latest algorithms significantly with 27% error reduction on average compared to Informer. The promising results highlight the importance of customized and optimized integration of algorithm and battery knowledge and shed light on other industry applications as well.
[AI-30] CuRLA: Curriculum Learning Based Deep Reinforcement Learning for Autonomous Driving
链接: https://arxiv.org/abs/2501.04982
作者: Bhargava Uppuluri,Anjel Patel,Neil Mehta,Sridhar Kamath,Pratyush Chakraborty
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: To be published in the 17th International Conference on Agents and Artificial Intelligence (ICAART), Feb 2025
Abstract:In autonomous driving, traditional Computer Vision (CV) agents often struggle in unfamiliar situations due to biases in the training data. Deep Reinforcement Learning (DRL) agents address this by learning from experience and maximizing rewards, which helps them adapt to dynamic environments. However, ensuring their generalization remains challenging, especially with static training environments. Additionally, DRL models lack transparency, making it difficult to guarantee safety in all scenarios, particularly those not seen during training. To tackle these issues, we propose a method that combines DRL with Curriculum Learning for autonomous driving. Our approach uses a Proximal Policy Optimization (PPO) agent and a Variational Autoencoder (VAE) to learn safe driving in the CARLA simulator. The agent is trained using two-fold curriculum learning, progressively increasing environment difficulty and incorporating a collision penalty in the reward function to promote safety. This method improves the agent’s adaptability and reliability in complex environments, and understand the nuances of balancing multiple reward components from different feedback signals in a single scalar reward function. Keywords: Computer Vision, Deep Reinforcement Learning, Variational Autoencoder, Proximal Policy Optimization, Curriculum Learning, Autonomous Driving.
[AI-31] Battling the Non-stationarity in Time Series Forecasting via Test-time Adaptation AAAI2025
链接: https://arxiv.org/abs/2501.04970
作者: HyunGi Kim,Siwon Kim,Jisoo Mok,Sungroh Yoon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at AAAI 2025
Abstract:Deep Neural Networks have spearheaded remarkable advancements in time series forecasting (TSF), one of the major tasks in time series modeling. Nonetheless, the non-stationarity of time series undermines the reliability of pre-trained source time series forecasters in mission-critical deployment settings. In this study, we introduce a pioneering test-time adaptation framework tailored for TSF (TSF-TTA). TAFAS, the proposed approach to TSF-TTA, flexibly adapts source forecasters to continuously shifting test distributions while preserving the core semantic information learned during pre-training. The novel utilization of partially-observed ground truth and gated calibration module enables proactive, robust, and model-agnostic adaptation of source forecasters. Experiments on diverse benchmark datasets and cutting-edge architectures demonstrate the efficacy and generality of TAFAS, especially in long-term forecasting scenarios that suffer from significant distribution shifts. The code is available at this https URL.
[AI-32] Quantifying Itch and its Impact on Sleep Using Machine Learning and Radio Signals
链接: https://arxiv.org/abs/2501.04896
作者: Michail Ouroutzoglou,Mingmin Zhao,Joshua Hellerstein,Hariharan Rahul,Asima Badic,Brian S. Kim,Dina Katabi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:
Abstract:Chronic itch affects 13% of the US population, is highly debilitating, and underlies many medical conditions. A major challenge in clinical care and new therapeutics development is the lack of an objective measure for quantifying itch, leading to reliance on subjective measures like patients’ self-assessment of itch severity. In this paper, we show that a home radio device paired with artificial intelligence (AI) can concurrently capture scratching and evaluate its impact on sleep quality by analyzing radio signals bouncing in the environment. The device eliminates the need for wearable sensors or skin contact, enabling monitoring of chronic itch over extended periods at home without burdening patients or interfering with their skin condition. To validate the technology, we conducted an observational clinical study of chronic pruritus patients, monitored at home for one month using both the radio device and an infrared camera. Comparing the output of the device to ground truth data from the camera demonstrates its feasibility and accuracy (ROC AUC = 0.997, sensitivity = 0.825, specificity = 0.997). The results reveal a significant correlation between scratching and low sleep quality, manifested as a reduction in sleep efficiency (R = 0.6, p 0.001) and an increase in sleep latency (R = 0.68, p 0.001). Our study underscores the potential of passive, long-term, at-home monitoring of chronic scratching and its sleep implications, offering a valuable tool for both clinical care of chronic itch patients and pharmaceutical clinical trials.
[AI-33] Reach Measurement Optimization and Frequency Capping In Targeted Online Advertising Under k-Anonymity
链接: https://arxiv.org/abs/2501.04882
作者: Yuan Gao,Mu Qiao
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:The growth in the use of online advertising to foster brand awareness over recent years is largely attributable to the ubiquity of social media. One pivotal technology contributing to the success of online brand advertising is frequency capping, a mechanism that enables marketers to control the number of times an ad is shown to a specific user. However, the very foundation of this technology is being scrutinized as the industry gravitates towards advertising solutions that prioritize user privacy. This paper delves into the issue of reach measurement and optimization within the context of k -anonymity, a privacy-preserving model gaining traction across major online advertising platforms. We outline how to report reach within this new privacy landscape and demonstrate how probabilistic discounting, a probabilistic adaptation of traditional frequency capping, can be employed to optimize campaign performance. Experiments are performed to assess the trade-off between user privacy and the efficacy of online brand advertising. Notably, we discern a significant dip in performance as long as privacy is introduced, yet this comes with a limited additional cost for advertising platforms to offer their users more privacy.
[AI-34] Exploring Large Language Models for Semantic Analysis and Categorization of Android Malware
链接: https://arxiv.org/abs/2501.04848
作者: Brandon J Walton,Mst Eshita Khatun,James M Ghawaly,Aisha Ali-Gombe
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
Abstract:Malware analysis is a complex process of examining and evaluating malicious software’s functionality, origin, and potential impact. This arduous process typically involves dissecting the software to understand its components, infection vector, propagation mechanism, and payload. Over the years, deep reverse engineering of malware has become increasingly tedious, mainly due to modern malicious codebases’ fast evolution and sophistication. Essentially, analysts are tasked with identifying the elusive needle in the haystack within the complexities of zero-day malware, all while under tight time constraints. Thus, in this paper, we explore leveraging Large Language Models (LLMs) for semantic malware analysis to expedite the analysis of known and novel samples. Built on GPT-4o-mini model, \msp is designed to augment malware analysis for Android through a hierarchical-tiered summarization chain and strategic prompt engineering. Additionally, \msp performs malware categorization, distinguishing potential malware from benign applications, thereby saving time during the malware reverse engineering process. Despite not being fine-tuned for Android malware analysis, we demonstrate that through optimized and advanced prompt engineering \msp can achieve up to 77% classification accuracy while providing highly robust summaries at functional, class, and package levels. In addition, leveraging the backward tracing of the summaries from package to function levels allowed us to pinpoint the precise code snippets responsible for malicious behavior.
[AI-35] Do Code LLM s Understand Design Patterns? ICSE2025
链接: https://arxiv.org/abs/2501.04835
作者: Zhenyu Pan,Xuefeng Song,Yunkun Wang,Rongyu Cao,Binhua Li,Yongbin Li,Han Liu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: accpeted by llm4code workshop in ICSE 2025
Abstract:Code Large Language Models (LLMs) demonstrate great versatility in adapting to various downstream tasks, including code generation and completion, as well as bug detection and fixing. However, Code LLMs often fail to capture existing coding standards, leading to the generation of code that conflicts with the required design patterns for a given project. As a result, developers must post-process to adapt the generated code to the project’s design norms. In this work, we empirically investigate the biases of Code LLMs in software development. Through carefully designed experiments, we assess the models’ understanding of design patterns across recognition, comprehension, and generation. Our findings reveal that biases in Code LLMs significantly affect the reliability of downstream tasks.
[AI-36] ActPC-Geom: Towards Scalable Online Neural-Symbolic Learning via Accelerating Active Predictive Coding with Information Geometry Diverse Cognitive Mechanisms
链接: https://arxiv.org/abs/2501.04832
作者: Ben Goertzel
类目: Artificial Intelligence (cs.AI)
*备注:
Abstract:This paper introduces ActPC-Geom, an approach to accelerate Active Predictive Coding (ActPC) in neural networks by integrating information geometry, specifically using Wasserstein-metric-based methods for measure-dependent gradient flows. We propose replacing KL-divergence in ActPC’s predictive error assessment with the Wasserstein metric, suggesting this may enhance network robustness. To make this computationally feasible, we present strategies including: (1) neural approximators for inverse measure-dependent Laplacians, (2) approximate kernel PCA embeddings for low-rank approximations feeding into these approximators, and (3) compositional hypervector embeddings derived from kPCA outputs, with algebra optimized for fuzzy FCA lattices learned through neural architectures analyzing network states. This results in an ActPC architecture capable of real-time online learning and integrating continuous (e.g., transformer-like or Hopfield-net-like) and discrete symbolic ActPC networks, including frameworks like OpenCog Hyperon or ActPC-Chem for algorithmic chemistry evolution. Shared probabilistic, concept-lattice, and hypervector models enable symbolic-subsymbolic integration. Key features include (1) compositional reasoning via hypervector embeddings in transformer-like architectures for tasks like commonsense reasoning, and (2) Hopfield-net dynamics enabling associative long-term memory and attractor-driven cognitive features. We outline how ActPC-Geom combines few-shot learning with online weight updates, enabling deliberative thinking and seamless symbolic-subsymbolic reasoning. Ideas from Galois connections are explored for efficient hybrid ActPC/ActPC-Chem processing. Finally, we propose a specialized HPC design optimized for real-time focused attention and deliberative reasoning tailored to ActPC-Geom’s demands. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2501.04832 [cs.AI] (or arXiv:2501.04832v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.04832 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Benjamin Goertzel [view email] [v1] Wed, 8 Jan 2025 20:38:02 UTC (204 KB) Full-text links: Access Paper: View a PDF of the paper titled ActPC-Geom: Towards Scalable Online Neural-Symbolic Learning via Accelerating Active Predictive Coding with Information Geometry Diverse Cognitive Mechanisms, by Ben GoertzelView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-01 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[AI-37] Intelligent Gradient Boosting Algorithms for Estimating Strength of Modified Subgrade Soil
链接: https://arxiv.org/abs/2501.04826
作者: Ismail B. Mustapha,Muyideen Abdulkareem,Shafaatunnur Hasan,Abideen Ganiyu,Hatem Nabus,Jin Chai Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注: 17 pages
Abstract:The performance of pavement under loading depends on the strength of the subgrade. However, experimental estimation of properties of pavement strengths such as California bearing ratio (CBR), unconfined compressive strength (UCS) and resistance value ® are often tedious, time-consuming and costly, thereby inspiring a growing interest in machine learning based tools which are simple, cheap and fast alternatives. Thus, the potential application of two boosting techniques; categorical boosting (CatBoost) and extreme gradient boosting (XGBoost) and support vector regression (SVR), is similarly explored in this study for estimation of properties of subgrade soil modified with hydrated lime activated rice husk ash (HARSH). Using 121 experimental data samples of varying proportions of HARSH, plastic limit, liquid limit, plasticity index, clay activity, optimum moisture content, and maximum dry density as input for CBR, UCS and R estimation, four evaluation metrics namely coefficient of determination (R2), root mean squared error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) are used to evaluate the models’ performance. The results indicate that XGBoost outperformed CatBoost and SVR in estimating these properties, yielding R2 of 0.9994, 0.9995 and 0.9999 in estimating the CBR, UCS and R respectively. Also, SVR outperformed CatBoost in estimating the CBR and R with R2 of 0.9997 respectively. On the other hand, CatBoost outperformed SVR in estimating the UCS with R2 of 0.9994. Feature sensitivity analysis shows that the three machine learning techniques are unanimous that increasing HARSH proportion lead to values of the estimated properties respectively. A comparison with previous results also shows superiority of XGBoost in estimating subgrade properties.
[AI-38] Planing It by Ear: Convolutional Neural Networks for Acoustic Anomaly Detection in Industrial Wood Planers
链接: https://arxiv.org/abs/2501.04819
作者: Anthony Deschênes,Rémi Georges,Cem Subakan,Bruna Ugulino,Antoine Henry,Michael Morin
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注:
Abstract:In recent years, the wood product industry has been facing a skilled labor shortage. The result is more frequent sudden failures, resulting in additional costs for these companies already operating in a very competitive market. Moreover, sawmills are challenging environments for machinery and sensors. Given that experienced machine operators may be able to diagnose defects or malfunctions, one possible way of assisting novice operators is through acoustic monitoring. As a step towards the automation of wood-processing equipment and decision support systems for machine operators, in this paper, we explore using a deep convolutional autoencoder for acoustic anomaly detection of wood planers on a new real-life dataset. Specifically, our convolutional autoencoder with skip connections (Skip-CAE) and our Skip-CAE transformer outperform the DCASE autoencoder baseline, one-class SVM, isolation forest and a published convolutional autoencoder architecture, respectively obtaining an area under the ROC curve of 0.846 and 0.875 on a dataset of real-factory planer sounds. Moreover, we show that adding skip connections and attention mechanism under the form of a transformer encoder-decoder helps to further improve the anomaly detection capabilities.
[AI-39] Decentralised Resource Sharing in TinyML: Wireless Bilayer Gossip Parallel SGD for Collaborative Learning
链接: https://arxiv.org/abs/2501.04817
作者: Ziyuan Bao,Eiman Kanjo,Soumya Banerjee,Hasib-Al Rashid,Tinoosh Mohsenin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
Abstract:With the growing computational capabilities of microcontroller units (MCUs), edge devices can now support machine learning models. However, deploying decentralised federated learning (DFL) on such devices presents key challenges, including intermittent connectivity, limited communication range, and dynamic network topologies. This paper proposes a novel framework, bilayer Gossip Decentralised Parallel Stochastic Gradient Descent (GD PSGD), designed to address these issues in resource-constrained environments. The framework incorporates a hierarchical communication structure using Distributed Kmeans (DKmeans) clustering for geographic grouping and a gossip protocol for efficient model aggregation across two layers: intra-cluster and inter-cluster. We evaluate the framework’s performance against the Centralised Federated Learning (CFL) baseline using the MCUNet model on the CIFAR-10 dataset under IID and Non-IID conditions. Results demonstrate that the proposed method achieves comparable accuracy to CFL on IID datasets, requiring only 1.8 additional rounds for convergence. On Non-IID datasets, the accuracy loss remains under 8% for moderate data imbalance. These findings highlight the framework’s potential to support scalable and privacy-preserving learning on edge devices with minimal performance trade-offs.
[AI-40] Discovering new robust local search algorithms with neuro-evolution
链接: https://arxiv.org/abs/2501.04747
作者: Mohamed Salim Amri Sakhri,Adrien Goëffon,Olivier Goudet,Frédéric Saubion,Chaïmaâ Touhami
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:
Abstract:This paper explores a novel approach aimed at overcoming existing challenges in the realm of local search algorithms. Our aim is to improve the decision process that takes place within a local search algorithm so as to make the best possible transitions in the neighborhood at each iteration. To improve this process, we propose to use a neural network that has the same input information as conventional local search algorithms. In this paper, which is an extension of the work [Goudet et al. 2024] presented at EvoCOP2024, we investigate different ways of representing this information so as to make the algorithm as efficient as possible but also robust to monotonic transformations of the problem objective function. To assess the efficiency of this approach, we develop an experimental setup centered around NK landscape problems, offering the flexibility to adjust problem size and ruggedness. This approach offers a promising avenue for the emergence of new local search algorithms and the improvement of their problem-solving capabilities for black-box problems.
[AI-41] AI-Driven Reinvention of Hydrological Modeling for Accurate Predictions and Interpretation to Transform Earth System Modeling
链接: https://arxiv.org/abs/2501.04733
作者: Cuihui Xia,Lei Yue,Deliang Chen,Yuyang Li,Hongqiang Yang,Ancheng Xue,Zhiqiang Li,Qing He,Guoqing Zhang,Dambaru Ballab Kattel,Lei Lei,Ming Zhou
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Traditional equation-driven hydrological models often struggle to accurately predict streamflow in challenging regional Earth systems like the Tibetan Plateau, while hybrid and existing algorithm-driven models face difficulties in interpreting hydrological behaviors. This work introduces HydroTrace, an algorithm-driven, data-agnostic model that substantially outperforms these approaches, achieving a Nash-Sutcliffe Efficiency of 98% and demonstrating strong generalization on unseen data. Moreover, HydroTrace leverages advanced attention mechanisms to capture spatial-temporal variations and feature-specific impacts, enabling the quantification and spatial resolution of streamflow partitioning as well as the interpretation of hydrological behaviors such as glacier-snow-streamflow interactions and monsoon dynamics. Additionally, a large language model (LLM)-based application allows users to easily understand and apply HydroTrace’s insights for practical purposes. These advancements position HydroTrace as a transformative tool in hydrological and broader Earth system modeling, offering enhanced prediction accuracy and interpretability.
[AI-42] SNR-EQ-JSCC: Joint Source-Channel Coding with SNR-Based Embedding and Query
链接: https://arxiv.org/abs/2501.04732
作者: Hongwei Zhang,Meixia Tao
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
*备注:
Abstract:Coping with the impact of dynamic channels is a critical issue in joint source-channel coding (JSCC)-based semantic communication systems. In this paper, we propose a lightweight channel-adaptive semantic coding architecture called SNR-EQ-JSCC. It is built upon the generic Transformer model and achieves channel adaptation (CA) by Embedding the signal-to-noise ratio (SNR) into the attention blocks and dynamically adjusting attention scores through channel-adaptive Queries. Meanwhile, penalty terms are introduced in the loss function to stabilize the training process. Considering that instantaneous SNR feedback may be imperfect, we propose an alternative method that uses only the average SNR, which requires no retraining of SNR-EQ-JSCC. Simulation results conducted on image transmission demonstrate that the proposed SNR-EQJSCC outperforms the state-of-the-art SwinJSCC in peak signal-to-noise ratio (PSNR) and perception metrics while only requiring 0.05% of the storage overhead and 6.38% of the computational complexity for CA. Moreover, the channel-adaptive query method demonstrates significant improvements in perception metrics. When instantaneous SNR feedback is imperfect, SNR-EQ-JSCC using only the average SNR still surpasses baseline schemes.
[AI-43] One Node One Model: Featuring the Missing-Half for Graph Clustering AAAI2025
链接: https://arxiv.org/abs/2412.09902
作者: Xuanting Xie,Bingheng Li,Erlin Pan,Zhaochen Guo,Zhao Kang,Wenyu Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Social and Information Networks (cs.SI)
*备注: Accepted by AAAI 2025
Abstract:Most existing graph clustering methods primarily focus on exploiting topological structure, often neglecting the missing-half" node feature information, especially how these features can enhance clustering performance. This issue is further compounded by the challenges associated with high-dimensional features. Feature selection in graph clustering is particularly difficult because it requires simultaneously discovering clusters and identifying the relevant features for these clusters. To address this gap, we introduce a novel paradigm called
one node one model", which builds an exclusive model for each node and defines the node label as a combination of predictions for node groups. Specifically, the proposed ``Feature Personalized Graph Clustering (FPGC)" method identifies cluster-relevant features for each node using a squeeze-and-excitation block, integrating these features into each model to form the final representations. Additionally, the concept of feature cross is developed as a data augmentation technique to learn low-order feature interactions. Extensive experimental results demonstrate that FPGC outperforms state-of-the-art clustering methods. Moreover, the plug-and-play nature of our method provides a versatile solution to enhance GNN-based models from a feature perspective.
[AI-44] Large Physics Models: Towards a collaborative approach with Large Language Models and Foundation Models
链接: https://arxiv.org/abs/2501.05382
作者: Kristian G. Barman,Sascha Caron,Emily Sullivan,Henk W. de Regt,Roberto Ruiz de Austri,Mieke Boon,Michael Färber,Stefan Fröse,Faegheh Hasibi,Andreas Ipp,Rukshak Kapoor,Gregor Kasieczka,Daniel Kostić,Michael Krämer,Tobias Golling,Luis G. Lopez,Jesus Marco,Sydney Otten,Pawel Pawlowski,Pietro Vischia,Erik Weber,Christoph Weniger
类目: Data Analysis, Statistics and Probability (physics.data-an); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph); Computational Physics (physics.comp-ph); History and Philosophy of Physics (physics.hist-ph)
*备注:
Abstract:This paper explores ideas and provides a potential roadmap for the development and evaluation of physics-specific large-scale AI models, which we call Large Physics Models (LPMs). These models, based on foundation models such as Large Language Models (LLMs) - trained on broad data - are tailored to address the demands of physics research. LPMs can function independently or as part of an integrated framework. This framework can incorporate specialized tools, including symbolic reasoning modules for mathematical manipulations, frameworks to analyse specific experimental and simulated data, and mechanisms for synthesizing theories and scientific literature. We begin by examining whether the physics community should actively develop and refine dedicated models, rather than relying solely on commercial LLMs. We then outline how LPMs can be realized through interdisciplinary collaboration among experts in physics, computer science, and philosophy of science. To integrate these models effectively, we identify three key pillars: Development, Evaluation, and Philosophical Reflection. Development focuses on constructing models capable of processing physics texts, mathematical formulations, and diverse physical data. Evaluation assesses accuracy and reliability by testing and benchmarking. Finally, Philosophical Reflection encompasses the analysis of broader implications of LLMs in physics, including their potential to generate new scientific understanding and what novel collaboration dynamics might arise in research. Inspired by the organizational structure of experimental collaborations in particle physics, we propose a similarly interdisciplinary and collaborative approach to building and refining Large Physics Models. This roadmap provides specific objectives, defines pathways to achieve them, and identifies challenges that must be addressed to realise physics-specific large scale AI models.
[AI-45] Constrained Optimization of Charged Particle Tracking with Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2501.05113
作者: Tobias Kortus,Ralf Keidel,Nicolas R. Gauger,Jan Kieseler(for the Bergen pCT Collaboration)
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning demonstrated immense success in modelling complex physics-driven systems, providing end-to-end trainable solutions by interacting with a simulated or real environment, maximizing a scalar reward signal. In this work, we propose, building upon previous work, a multi-agent reinforcement learning approach with assignment constraints for reconstructing particle tracks in pixelated particle detectors. Our approach optimizes collaboratively a parametrized policy, functioning as a heuristic to a multidimensional assignment problem, by jointly minimizing the total amount of particle scattering over the reconstructed tracks in a readout frame. To satisfy constraints, guaranteeing a unique assignment of particle hits, we propose a safety layer solving a linear assignment problem for every joint action. Further, to enforce cost margins, increasing the distance of the local policies predictions to the decision boundaries of the optimizer mappings, we recommend the use of an additional component in the blackbox gradient estimation, forcing the policy to solutions with lower total assignment costs. We empirically show on simulated data, generated for a particle detector developed for proton imaging, the effectiveness of our approach, compared to multiple single- and multi-agent baselines. We further demonstrate the effectiveness of constraints with cost margins for both optimization and generalization, introduced by wider regions with high reconstruction performance as well as reduced predictive instabilities. Our results form the basis for further developments in RL-based tracking, offering both enhanced performance with constrained policies and greater flexibility in optimizing tracking algorithms through the option for individual and team rewards.
[AI-46] Simultaneous emulation and downscaling with physically-consistent deep learning-based regional ocean emulators
链接: https://arxiv.org/abs/2501.05058
作者: Leonard Lupin-Jimenez,Moein Darman,Subhashis Hazarika,Tianning Wu,Michael Gray,Ruyoing He,Anthony Wong,Ashesh Chattopadhyay
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Geophysics (physics.geo-ph)
*备注:
Abstract:Building on top of the success in AI-based atmospheric emulation, we propose an AI-based ocean emulation and downscaling framework focusing on the high-resolution regional ocean over Gulf of Mexico. Regional ocean emulation presents unique challenges owing to the complex bathymetry and lateral boundary conditions as well as from fundamental biases in deep learning-based frameworks, such as instability and hallucinations. In this paper, we develop a deep learning-based framework to autoregressively integrate ocean-surface variables over the Gulf of Mexico at 8 Km spatial resolution without unphysical drifts over decadal time scales and simulataneously downscale and bias-correct it to 4 Km resolution using a physics-constrained generative model. The framework shows both short-term skills as well as accurate long-term statistics in terms of mean and variability.
[AI-47] Quantum-enhanced causal discovery for a small number of samples
链接: https://arxiv.org/abs/2501.05007
作者: Yota Maeda,Ken Arai,Yu Tanaka,Yu Terada,Hiroshi Ueno,Hiroyuki Tezuka
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 19 pages, 8 figures
Abstract:The discovery of causal relationships from observed data has attracted significant interest from disciplines such as economics, social sciences, epidemiology, and biology. In practical applications, considerable knowledge of the underlying systems is often unavailable, and real data are often associated with nonlinear causal structures, which make the direct use of most conventional causality analysis methods difficult. This study proposes a novel quantum Peter-Clark (qPC) algorithm for causal discovery that does not assume any underlying model structures. Based on the independence conditional tests in a class of reproducing kernel Hilbert spaces characterized by quantum circuits, the proposed qPC algorithm can explore causal relationships from the observed data drawn from arbitrary distributions. We conducted systematic experiments on fundamental graph parts of causal structures, demonstrating that the qPC algorithm exhibits a significantly better performance, particularly with smaller sample sizes compared to its classical counterpart. Furthermore, we proposed a novel optimization approach based on Kernel Target Alignment (KTA) for determining hyperparameters of quantum kernels. This method effectively reduced the risk of false positives in causal discovery, enabling more reliable inference. Our theoretical and experimental results demonstrate that the proposed quantum algorithm can empower classical algorithms for robust and accurate inference in causal discovery, supporting them in regimes where classical algorithms typically fail. Additionally, the effectiveness of this method was validated using the Boston Housing dataset as a real-world application. These findings demonstrate the new potential of quantum circuit-based causal discovery methods in addressing practical challenges, particularly in small-sample scenarios where traditional approaches have shown limitations.
[AI-48] Generative Style Transfer for MRI Image Segmentation: A Case of Glioma Segmentation in Sub-Saharan Africa
链接: https://arxiv.org/abs/2501.04734
作者: Rancy Chepchirchir,Jill Sunday,Raymond Confidence,Dong Zhang,Talha Chaudhry,Udunna C. Anazodo,Kendi Muchungi,Yujing Zou
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:
Abstract:In Sub-Saharan Africa (SSA), the utilization of lower-quality Magnetic Resonance Imaging (MRI) technology raises questions about the applicability of machine learning methods for clinical tasks. This study aims to provide a robust deep learning-based brain tumor segmentation (BraTS) method tailored for the SSA population using a threefold approach. Firstly, the impact of domain shift from the SSA training data on model efficacy was examined, revealing no significant effect. Secondly, a comparative analysis of 3D and 2D full-resolution models using the nnU-Net framework indicates similar performance of both the models trained for 300 epochs achieving a five-fold cross-validation score of 0.93. Lastly, addressing the performance gap observed in SSA validation as opposed to the relatively larger BraTS glioma (GLI) validation set, two strategies are proposed: fine-tuning SSA cases using the GLI+SSA best-pretrained 2D fullres model at 300 epochs, and introducing a novel neural style transfer-based data augmentation technique for the SSA cases. This investigation underscores the potential of enhancing brain tumor prediction within SSA’s unique healthcare landscape.
[AI-49] Calculating Customer Lifetime Value and Churn using Beta Geometric Negative Binomial and Gamma-Gamma Distribution in a NFT based setting
链接: https://arxiv.org/abs/2501.04719
作者: Sagarnil Das
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
*备注: 10 pages, 8 figures
Abstract:Customer Lifetime Value (CLV) is an important metric that measures the total value a customer will bring to a business over their lifetime. The Beta Geometric Negative Binomial Distribution (BGNBD) and Gamma Gamma Distribution are two models that can be used to calculate CLV, taking into account both the frequency and value of customer transactions. This article explains the BGNBD and Gamma Gamma Distribution models, and how they can be used to calculate CLV for NFT (Non-Fungible Token) transaction data in a blockchain setting. By estimating the parameters of these models using historical transaction data, businesses can gain insights into the lifetime value of their customers and make data-driven decisions about marketing and customer retention strategies.
[AI-50] Knowledge-Guided Biomarker Identification for Label-Free Single-Cell RNA-Seq Data: A Reinforcement Learning Perspective
链接: https://arxiv.org/abs/2501.04718
作者: Meng Xiao,Weiliang Zhang,Xiaohan Huang,Hengshu Zhu,Min Wu,Xiaoli Li,Yuanchun Zhou
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
*备注: 20 pages. arXiv admin note: substantial text overlap with arXiv:2406.07418
Abstract:Gene panel selection aims to identify the most informative genomic biomarkers in label-free genomic datasets. Traditional approaches, which rely on domain expertise, embedded machine learning models, or heuristic-based iterative optimization, often introduce biases and inefficiencies, potentially obscuring critical biological signals. To address these challenges, we present an iterative gene panel selection strategy that harnesses ensemble knowledge from existing gene selection algorithms to establish preliminary boundaries or prior knowledge, which guide the initial search space. Subsequently, we incorporate reinforcement learning through a reward function shaped by expert behavior, enabling dynamic refinement and targeted selection of gene panels. This integration mitigates biases stemming from initial boundaries while capitalizing on RL’s stochastic adaptability. Comprehensive comparative experiments, case studies, and downstream analyses demonstrate the effectiveness of our method, highlighting its improved precision and efficiency for label-free biomarker discovery. Our results underscore the potential of this approach to advance single-cell genomics data analysis.
机器学习
[LG-0] Entangled Mean Estimation in High-Dimensions
链接: https://arxiv.org/abs/2501.05425
作者: Ilias Diakonikolas,Daniel M. Kane,Sihan Liu,Thanasis Pittas
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
Abstract:We study the task of high-dimensional entangled mean estimation in the subset-of-signals model. Specifically, given N independent random points x_1,\ldots,x_N in \mathbbR^D and a parameter \alpha \in (0, 1) such that each x_i is drawn from a Gaussian with mean \mu and unknown covariance, and an unknown \alpha -fraction of the points have identity-bounded covariances, the goal is to estimate the common mean \mu . The one-dimensional version of this task has received significant attention in theoretical computer science and statistics over the past decades. Recent work [LY20; CV24] has given near-optimal upper and lower bounds for the one-dimensional setting. On the other hand, our understanding of even the information-theoretic aspects of the multivariate setting has remained limited. In this work, we design a computationally efficient algorithm achieving an information-theoretically near-optimal error. Specifically, we show that the optimal error (up to polylogarithmic factors) is f(\alpha,N) + \sqrtD/(\alpha N) , where the term f(\alpha,N) is the error of the one-dimensional problem and the second term is the sub-Gaussian error rate. Our algorithmic approach employs an iterative refinement strategy, whereby we progressively learn more accurate approximations \hat \mu to \mu . This is achieved via a novel rejection sampling procedure that removes points significantly deviating from \hat \mu , as an attempt to filter out unusually noisy samples. A complication that arises is that rejection sampling introduces bias in the distribution of the remaining points. To address this issue, we perform a careful analysis of the bias, develop an iterative dimension-reduction strategy, and employ a novel subroutine inspired by list-decodable learning that leverages the one-dimensional result. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2501.05425 [cs.DS] (or arXiv:2501.05425v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2501.05425 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-1] Using LLM s to Infer Non-Binary COVID-19 Sentiments of Chinese Micro-bloggers
链接: https://arxiv.org/abs/2501.05423
作者: Jerry Chongyi Hu,Mohammed Shahid Modi,Boleslaw K. Szymanski
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures
Abstract:Studying public sentiment during crises is crucial for understanding how opinions and sentiments shift, resulting in polarized societies. We study Weibo, the most popular microblogging site in China, using posts made during the outbreak of the COVID-19 crisis. The study period includes the pre-COVID-19 stage, the outbreak stage, and the early stage of epidemic prevention. We use Llama 3 8B, a Large Language Model, to analyze users’ sentiments on the platform by classifying them into positive, negative, sarcastic, and neutral categories. Analyzing sentiment shifts on Weibo provides insights into how social events and government actions influence public opinion. This study contributes to understanding the dynamics of social sentiments during health crises, fulfilling a gap in sentiment analysis for Chinese platforms. By examining these dynamics, we aim to offer valuable perspectives on digital communication’s role in shaping society’s responses during unprecedented global challenges.
[LG-2] Uncertainty-aware Knowledge Tracing AAAI2025
链接: https://arxiv.org/abs/2501.05415
作者: Weihua Cheng,Hanwen Du,Chunxiao Li,Ersheng Ni,Liangdi Tan,Tianqi Xu,Yongxin Ni
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025
Abstract:Knowledge Tracing (KT) is crucial in education assessment, which focuses on depicting students’ learning states and assessing students’ mastery of subjects. With the rise of modern online learning platforms, particularly massive open online courses (MOOCs), an abundance of interaction data has greatly advanced the development of the KT technology. Previous research commonly adopts deterministic representation to capture students’ knowledge states, which neglects the uncertainty during student interactions and thus fails to model the true knowledge state in learning process. In light of this, we propose an Uncertainty-Aware Knowledge Tracing model (UKT) which employs stochastic distribution embeddings to represent the uncertainty in student interactions, with a Wasserstein self-attention mechanism designed to capture the transition of state distribution in student learning behaviors. Additionally, we introduce the aleatory uncertainty-aware contrastive learning loss, which strengthens the model’s robustness towards different types of uncertainties. Extensive experiments on six real-world datasets demonstrate that UKT not only significantly surpasses existing deep learning-based models in KT prediction, but also shows unique advantages in handling the uncertainty of student interactions.
[LG-3] Integrating Explainable AI for Effective Malware Detection in Encrypted Network Traffic
链接: https://arxiv.org/abs/2501.05387
作者: Sileshi Nibret Zeleke,Amsalu Fentie Jember,Mario Bochicchio
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted and presented on PanAfriCon AI 2024
Abstract:Encrypted network communication ensures confidentiality, integrity, and privacy between endpoints. However, attackers are increasingly exploiting encryption to conceal malicious behavior. Detecting unknown encrypted malicious traffic without decrypting the payloads remains a significant challenge. In this study, we investigate the integration of explainable artificial intelligence (XAI) techniques to detect malicious network traffic. We employ ensemble learning models to identify malicious activity using multi-view features extracted from various aspects of encrypted communication. To effectively represent malicious communication, we compiled a robust dataset with 1,127 unique connections, more than any other available open-source dataset, and spanning 54 malware families. Our models were benchmarked against the CTU-13 dataset, achieving performance of over 99% accuracy, precision, and F1-score. Additionally, the eXtreme Gradient Boosting (XGB) model demonstrated 99.32% accuracy, 99.53% precision, and 99.43% F1-score on our custom dataset. By leveraging Shapley Additive Explanations (SHAP), we identified that the maximum packet size, mean inter-arrival time of packets, and transport layer security version used are the most critical features for the global model explanation. Furthermore, key features were identified as important for local explanations across both datasets for individual traffic samples. These insights provide a deeper understanding of the model decision-making process, enhancing the transparency and reliability of detecting malicious encrypted traffic.
[LG-4] Accelerated Diffusion Models via Speculative Sampling
链接: https://arxiv.org/abs/2501.05370
作者: Valentin De Bortoli,Alexandre Galashov,Arthur Gretton,Arnaud Doucet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Speculative sampling is a popular technique for accelerating inference in Large Language Models by generating candidate tokens using a fast draft model and accepting or rejecting them based on the target model’s distribution. While speculative sampling was previously limited to discrete sequences, we extend it to diffusion models, which generate samples via continuous, vector-valued Markov chains. In this context, the target model is a high-quality but computationally expensive diffusion model. We propose various drafting strategies, including a simple and effective approach that does not require training a draft model and is applicable out of the box to any diffusion model. Our experiments demonstrate significant generation speedup on various diffusion models, halving the number of function evaluations, while generating exact samples from the target model.
[LG-5] No-Regret Linear Bandits under Gap-Adjusted Misspecification
链接: https://arxiv.org/abs/2501.05361
作者: Chong Liu,Dan Qiao,Ming Yin,Ilija Bogunovic,Yu-Xiang Wang
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2302.13252
Abstract:This work studies linear bandits under a new notion of gap-adjusted misspecification and is an extension of Liu et al. (2023). When the underlying reward function is not linear, existing linear bandits work usually relies on a uniform misspecification parameter \epsilon that measures the sup-norm error of the best linear approximation. This results in an unavoidable linear regret whenever \epsilon 0 . We propose a more natural model of misspecification which only requires the approximation error at each input x to be proportional to the suboptimality gap at x . It captures the intuition that, for optimization problems, near-optimal regions should matter more and we can tolerate larger approximation errors in suboptimal regions. Quite surprisingly, we show that the classical LinUCB algorithm – designed for the realizable case – is automatically robust against such \rho -gap-adjusted misspecification with parameter \rho diminishing at O(1/(d \sqrt\log T)) . It achieves a near-optimal O(\sqrtT) regret for problems that the best-known regret is almost linear in time horizon T . We further advance this frontier by presenting a novel phased elimination-based algorithm whose gap-adjusted misspecification parameter \rho = O(1/\sqrtd) does not scale with T . This algorithm attains optimal O(\sqrtT) regret and is deployment-efficient, requiring only \log T batches of exploration. It also enjoys an adaptive O(\log T) regret when a constant suboptimality gap exists. Technically, our proof relies on a novel self-bounding argument that bounds the part of the regret due to misspecification by the regret itself, and a new inductive lemma that limits the misspecification error within the suboptimality gap for all valid actions in each batch selected by G-optimal design. Comments: arXiv admin note: substantial text overlap with arXiv:2302.13252 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.05361 [cs.LG] (or arXiv:2501.05361v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.05361 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-6] Stability and List-Replicability for Agnostic Learners
链接: https://arxiv.org/abs/2501.05333
作者: Ari Blonda,Shan Gao,Hamed Hatami,Pooya Hatami
类目: Machine Learning (cs.LG)
*备注:
Abstract:Two seminal papers–Alon, Livni, Malliaris, Moran (STOC 2019) and Bun, Livni, and Moran (FOCS 2020)–established the equivalence between online learnability and globally stable PAC learnability in binary classification. However, Chase, Chornomaz, Moran, and Yehudayoff (STOC 2024) recently showed that this equivalence does not hold in the agnostic setting. Specifically, they proved that in the agnostic setting, only finite hypothesis classes are globally stable learnable. Therefore, agnostic global stability is too restrictive to capture interesting hypothesis classes. To address this limitation, Chase \emphet al. introduced two relaxations of agnostic global stability. In this paper, we characterize the classes that are learnable under their proposed relaxed conditions, resolving the two open problems raised in their work. First, we prove that in the setting where the stability parameter can depend on the excess error (the gap between the learner’s error and the best achievable error by the hypothesis class), agnostic stability is fully characterized by the Littlestone dimension. Consequently, as in the realizable case, this form of learnability is equivalent to online learnability. As part of the proof of this theorem, we strengthen the celebrated result of Bun et al. by showing that classes with infinite Littlestone dimension are not stably PAC learnable, even if we allow the stability parameter to depend on the excess error. For the second relaxation proposed by Chase et al., we prove that only finite hypothesis classes are globally stable learnable even if we restrict the agnostic setting to distributions with small population loss. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.05333 [cs.LG] (or arXiv:2501.05333v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.05333 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hamed Hatami [view email] [v1] Thu, 9 Jan 2025 15:59:15 UTC (23 KB) Full-text links: Access Paper: View a PDF of the paper titled Stability and List-Replicability for Agnostic Learners, by Ari Blonda and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-01 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-7] Knowledge Transfer in Model-Based Reinforcement Learning Agents for Efficient Multi-Task Learning AAMAS2025
链接: https://arxiv.org/abs/2501.05329
作者: Dmytro Kuzmenko,Nadiya Shvai
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Preprint of an extended abstract accepted to AAMAS 2025
Abstract:We propose an efficient knowledge transfer approach for model-based reinforcement learning, addressing the challenge of deploying large world models in resource-constrained environments. Our method distills a high-capacity multi-task agent (317M parameters) into a compact 1M parameter model, achieving state-of-the-art performance on the MT30 benchmark with a normalized score of 28.45, a substantial improvement over the original 1M parameter model’s score of 18.93. This demonstrates the ability of our distillation technique to consolidate complex multi-task knowledge effectively. Additionally, we apply FP16 post-training quantization, reducing the model size by 50% while maintaining performance. Our work bridges the gap between the power of large models and practical deployment constraints, offering a scalable solution for efficient and accessible multi-task reinforcement learning in robotics and other resource-limited domains.
[LG-8] he explanation dialogues: an expert focus study to understand requirements towards explanations within the GDPR
链接: https://arxiv.org/abs/2501.05325
作者: Laura State,Alejandra Bringas Colmenarejo,Andrea Beretta,Salvatore Ruggieri,Franco Turini,Stephanie Law
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Artificial Intelligence and Law (Springer Nature)
Abstract:Explainable AI (XAI) provides methods to understand non-interpretable machine learning models. However, we have little knowledge about what legal experts expect from these explanations, including their legal compliance with, and value against European Union legislation. To close this gap, we present the Explanation Dialogues, an expert focus study to uncover the expectations, reasoning, and understanding of legal experts and practitioners towards XAI, with a specific focus on the European General Data Protection Regulation. The study consists of an online questionnaire and follow-up interviews, and is centered around a use-case in the credit domain. We extract both a set of hierarchical and interconnected codes using grounded theory, and present the standpoints of the participating experts towards XAI. We find that the presented explanations are hard to understand and lack information, and discuss issues that can arise from the different interests of the data controller and subject. Finally, we present a set of recommendations for developers of XAI methods, and indications of legal areas of discussion. Among others, recommendations address the presentation, choice, and content of an explanation, technical risks as well as the end-user, while we provide legal pointers to the contestability of explanations, transparency thresholds, intellectual property rights as well as the relationship between involved parties.
[LG-9] Distributed Learning and Inference Systems: A Networking Perspective
链接: https://arxiv.org/abs/2501.05323
作者: Hesham G. Moussa,Arashmid Akhavain,S. Maryam Hosseini,Bill McCormick
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: This paper has been submitted to IEEE Network magazine and is still under review
Abstract:Machine learning models have achieved, and in some cases surpassed, human-level performance in various tasks, mainly through centralized training of static models and the use of large models stored in centralized clouds for inference. However, this centralized approach has several drawbacks, including privacy concerns, high storage demands, a single point of failure, and significant computing requirements. These challenges have driven interest in developing alternative decentralized and distributed methods for AI training and inference. Distribution introduces additional complexity, as it requires managing multiple moving parts. To address these complexities and fill a gap in the development of distributed AI systems, this work proposes a novel framework, Data and Dynamics-Aware Inference and Training Networks (DA-ITN). The different components of DA-ITN and their functions are explored, and the associated challenges and research areas are highlighted.
[LG-10] Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing
链接: https://arxiv.org/abs/2501.05313
作者: Mengfan Liu,Wei Wang,Chuan Wu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:With the advancement of serverless computing, running machine learning (ML) inference services over a serverless platform has been advocated, given its labor-free scalability and cost effectiveness. Mixture-of-Experts (MoE) models have been a dominant type of model architectures to enable large models nowadays, with parallel expert networks. Serving large MoE models on serverless computing is potentially beneficial, but has been underexplored due to substantial challenges in handling the skewed expert popularity and scatter-gather communication bottleneck in MoE model execution, for cost-efficient serverless MoE deployment and performance guarantee. We study optimized MoE model deployment and distributed inference serving on a serverless platform, that effectively predict expert selection, pipeline communication with model execution, and minimize the overall billed cost of serving MoE models. Especially, we propose a Bayesian optimization framework with multi-dimensional epsilon-greedy search to learn expert selections and optimal MoE deployment achieving optimal billed cost, including: 1) a Bayesian decision-making method for predicting expert popularity; 2) flexibly pipelined scatter-gather communication; and 3) an optimal model deployment algorithm for distributed MoE serving. Extensive experiments on AWS Lambda show that our designs reduce the billed cost of all MoE layers by at least 75.67% compared to CPU clusters while maintaining satisfactory inference throughput. As compared to LambdaML in serverless computing, our designs achieves 43.41% lower cost with a throughput decrease of at most 18.76%.
[LG-11] Private Selection with Heterogeneous Sensitivities
链接: https://arxiv.org/abs/2501.05309
作者: Daniela Antonova,Allegra Laro,Audra McMillan,Lorenz Wolf
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 21 pages, 18 figures
Abstract:Differentially private (DP) selection involves choosing a high-scoring candidate from a finite candidate pool, where each score depends on a sensitive dataset. This problem arises naturally in a variety of contexts including model selection, hypothesis testing, and within many DP algorithms. Classical methods, such as Report Noisy Max (RNM), assume all candidates’ scores are equally sensitive to changes in a single individual’s data, but this often isn’t the case. To address this, algorithms like the Generalised Exponential Mechanism (GEM) leverage variability in candidate sensitivities. However, we observe that while these algorithms can outperform RNM in some situations, they may underperform in others - they can even perform worse than random selection. In this work, we explore how the distribution of scores and sensitivities impacts DP selection mechanisms. In all settings we study, we find that there exists a mechanism that utilises heterogeneity in the candidate sensitivities that outperforms standard mechanisms like RNM. However, no single mechanism uniformly outperforms RNM. We propose using the correlation between the scores and sensitivities as the basis for deciding which DP selection mechanism to use. Further, we design a slight variant of GEM, modified GEM that generally performs well whenever GEM performs poorly. Relying on the correlation heuristic we propose combined GEM, which adaptively chooses between GEM and modified GEM and outperforms both in polarised settings.
[LG-12] Learning convolution operators on compact Abelian groups
链接: https://arxiv.org/abs/2501.05279
作者: Emilia Magnani,Ernesto De Vito,Philipp Hennig,Lorenzo Rosasco
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We consider the problem of learning convolution operators associated to compact Abelian groups. We study a regularization-based approach and provide corresponding learning guarantees, discussing natural regularity condition on the convolution kernel. More precisely, we assume the convolution kernel is a function in a translation invariant Hilbert space and analyze a natural ridge regression (RR) estimator. Building on existing results for RR, we characterize the accuracy of the estimator in terms of finite sample bounds. Interestingly, regularity assumptions which are classical in the analysis of RR, have a novel and natural interpretation in terms of space/frequency localization. Theoretical results are illustrated by numerical simulations.
[LG-13] EVA-S2PLoR: A Secure Element-wise Multiplication Meets Logistic Regression on Heterogeneous Database
链接: https://arxiv.org/abs/2501.05223
作者: Tianle Tao,Shizhao Peng,Tianyu Mei,Shoumo Li,Haogang Zhu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Accurate nonlinear computation is a key challenge in privacy-preserving machine learning (PPML). Most existing frameworks approximate it through linear operations, resulting in significant precision loss. This paper proposes an efficient, verifiable and accurate security 2-party logistic regression framework (EVA-S2PLoR), which achieves accurate nonlinear function computation through a novel secure element-wise multiplication protocol and its derived protocols. Our framework primarily includes secure 2-party vector element-wise multiplication, addition to multiplication, reciprocal, and sigmoid function based on data disguising technology, where high efficiency and accuracy are guaranteed by the simple computation flow based on the real number domain and the few number of fixed communication rounds. We provide secure and robust anomaly detection through dimension transformation and Monte Carlo methods. EVA-S2PLoR outperforms many advanced frameworks in terms of precision (improving the performance of the sigmoid function by about 10 orders of magnitude compared to most frameworks) and delivers the best overall performance in secure logistic regression experiments.
[LG-14] CoDe: Communication Delay-Tolerant Multi-Agent Collaboration via Dual Alignment of Intent and Timeliness AAAI2025
链接: https://arxiv.org/abs/2501.05207
作者: Shoucheng Song,Youfang Lin,Sheng Han,Chang Yao,Hao Wu,Shuo Wang,Kai Lv
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: AAAI 2025 Accepted
Abstract:Communication has been widely employed to enhance multi-agent collaboration. Previous research has typically assumed delay-free communication, a strong assumption that is challenging to meet in practice. However, real-world agents suffer from channel delays, receiving messages sent at different time points, termed \itAsynchronous Communication, leading to cognitive biases and breakdowns in collaboration. This paper first defines two communication delay settings in MARL and emphasizes their harm to collaboration. To handle the above delays, this paper proposes a novel framework, Communication Delay-tolerant Multi-Agent Collaboration (CoDe). At first, CoDe learns an intent representation as messages through future action inference, reflecting the stable future behavioral trends of the agents. Then, CoDe devises a dual alignment mechanism of intent and timeliness to strengthen the fusion process of asynchronous messages. In this way, agents can extract the long-term intent of others, even from delayed messages, and selectively utilize the most recent messages that are relevant to their intent. Experimental results demonstrate that CoDe outperforms baseline algorithms in three MARL benchmarks without delay and exhibits robustness under fixed and time-varying delays.
[LG-15] Design and Control of a Bipedal Robotic Character
链接: https://arxiv.org/abs/2501.05204
作者: Ruben Grandia,Espen Knoop,Michael A. Hopkins,Georg Wiedebach,Jared Bishop,Steven Pickles,David Müller,Moritz Bächer
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Legged robots have achieved impressive feats in dynamic locomotion in challenging unstructured terrain. However, in entertainment applications, the design and control of these robots face additional challenges in appealing to human audiences. This work aims to unify expressive, artist-directed motions and robust dynamic mobility for legged robots. To this end, we introduce a new bipedal robot, designed with a focus on character-driven mechanical features. We present a reinforcement learning-based control architecture to robustly execute artistic motions conditioned on command signals. During runtime, these command signals are generated by an animation engine which composes and blends between multiple animation sources. Finally, an intuitive operator interface enables real-time show performances with the robot. The complete system results in a believable robotic character, and paves the way for enhanced human-robot engagement in various contexts, in entertainment robotics and beyond.
[LG-16] De-centering the (Traditional) User: Multistakeholder Evaluation of Recommender Systems
链接: https://arxiv.org/abs/2501.05170
作者: Robin Burke,Gediminas Adomavicius,Toine Bogers,Tommaso Di Noia,Dominik Kowald,Julia Neidhardt,Özlem Özgöbek,Maria Soledad Pera,Nava Tintarev,Jürgen Ziegler
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Preprint submitted to Elsevier, “Re-centering the User in Recommender System Research” special issue of the International Journal of Human-Computer Studies (IJHCS)
Abstract:Multistakeholder recommender systems are those that account for the impacts and preferences of multiple groups of individuals, not just the end users receiving recommendations. Due to their complexity, evaluating these systems cannot be restricted to the overall utility of a single stakeholder, as is often the case of more mainstream recommender system applications. In this article, we focus our discussion on the intricacies of the evaluation of multistakeholder recommender systems. We bring attention to the different aspects involved in the evaluation of multistakeholder recommender systems - from the range of stakeholders involved (including but not limited to producers and consumers) to the values and specific goals of each relevant stakeholder. Additionally, we discuss how to move from theoretical principles to practical implementation, providing specific use case examples. Finally, we outline open research directions for the RecSys community to explore. We aim to provide guidance to researchers and practitioners about how to think about these complex and domain-dependent issues of evaluation in the course of designing, developing, and researching applications with multistakeholder aspects.
[LG-17] Learning In-Distribution Representations for Anomaly Detection
链接: https://arxiv.org/abs/2501.05130
作者: William T. Lunardi,Abdulrahman Banabila,Dania Herzalla,Martin L. Andreoni
类目: Machine Learning (cs.LG)
*备注:
Abstract:Anomaly detection involves identifying data patterns that deviate from the anticipated norm. Traditional methods struggle in high-dimensional spaces due to the curse of dimensionality. In recent years, self-supervised learning, particularly through contrastive objectives, has driven advances in anomaly detection. However, vanilla contrastive learning struggles to align with the unique demands of anomaly detection, as it lacks a pretext task tailored to the homogeneous nature of In-Distribution (ID) data and the diversity of Out-of-Distribution (OOD) anomalies. Methods that attempt to address these challenges, such as introducing hard negatives through synthetic outliers, Outlier Exposure (OE), and supervised objectives, often rely on pretext tasks that fail to balance compact clustering of ID samples with sufficient separation from OOD data. In this work, we propose Focused In-distribution Representation Modeling (FIRM), a contrastive learning objective specifically designed for anomaly detection. Unlike existing approaches, FIRM incorporates synthetic outliers into its pretext task in a way that actively shapes the representation space, promoting compact clustering of ID samples while enforcing strong separation from outliers. This formulation addresses the challenges of class collision, enhancing both the compactness of ID representations and the discriminative power of the learned feature space. We show that FIRM surpasses other contrastive methods in standard benchmarks, significantly enhancing anomaly detection compared to both traditional and supervised contrastive learning objectives. Our ablation studies confirm that FIRM consistently improves the quality of representations and shows robustness across a range of scoring methods. The code is available at: this https URL.
[LG-18] EquiBoost: An Equivariant Boosting Approach to Molecular Conformation Generation
链接: https://arxiv.org/abs/2501.05109
作者: Yixuan Yang,Xingyu Fang,Zhaowen Cheng,Pengju Yan,Xiaolin Li
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:
Abstract:Molecular conformation generation plays key roles in computational drug design. Recently developed deep learning methods, particularly diffusion models have reached competitive performance over traditional cheminformatical approaches. However, these methods are often time-consuming or require extra support from traditional methods. We propose EquiBoost, a boosting model that stacks several equivariant graph transformers as weak learners, to iteratively refine 3D conformations of molecules. Without relying on diffusion techniques, EquiBoost balances accuracy and efficiency more effectively than diffusion-based methods. Notably, compared to the previous state-of-the-art diffusion method, EquiBoost improves generation quality and preserves diversity, achieving considerably better precision of Average Minimum RMSD (AMR) on the GEOM datasets. This work rejuvenates boosting and sheds light on its potential to be a robust alternative to diffusion models in certain scenarios.
[LG-19] Hierarchical Decomposed Dual-domain Deep Learning for Sparse-View CT Reconstruction
链接: https://arxiv.org/abs/2501.05093
作者: Yoseob Han
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Published by Physics in Medicine Biology (2024.4)
Abstract:Objective: X-ray computed tomography employing sparse projection views has emerged as a contemporary technique to mitigate radiation dose. However, due to the inadequate number of projection views, an analytic reconstruction method utilizing filtered backprojection results in severe streaking artifacts. Recently, deep learning strategies employing image-domain networks have demonstrated remarkable performance in eliminating the streaking artifact caused by analytic reconstruction methods with sparse projection views. Nevertheless, it is difficult to clarify the theoretical justification for applying deep learning to sparse view CT reconstruction, and it has been understood as restoration by removing image artifacts, not reconstruction. Approach: By leveraging the theory of deep convolutional framelets and the hierarchical decomposition of measurement, this research reveals the constraints of conventional image- and projection-domain deep learning methodologies, subsequently, the research proposes a novel dual-domain deep learning framework utilizing hierarchical decomposed measurements. Specifically, the research elucidates how the performance of the projection-domain network can be enhanced through a low-rank property of deep convolutional framelets and a bowtie support of hierarchical decomposed measurement in the Fourier domain. Main Results: This study demonstrated performance improvement of the proposed framework based on the low-rank property, resulting in superior reconstruction performance compared to conventional analytic and deep learning methods. Significance: By providing a theoretically justified deep learning approach for sparse-view CT reconstruction, this study not only offers a superior alternative to existing methods but also opens new avenues for research in medical imaging. Comments: Published by Physics in Medicine Biology (2024.4) Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2501.05093 [cs.LG] (or arXiv:2501.05093v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.05093 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yoseob Han [view email] [v1] Thu, 9 Jan 2025 09:19:05 UTC (18,673 KB)
[LG-20] Enhanced Quantile Regression with Spiking Neural Networks for Long-Term System Health Prognostics
链接: https://arxiv.org/abs/2501.05087
作者: David J Poland
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a novel predictive maintenance framework centered on Enhanced Quantile Regression Neural Networks EQRNNs, for anticipating system failures in industrial robotics. We address the challenge of early failure detection through a hybrid approach that combines advanced neural architectures. The system leverages dual computational stages: first implementing an EQRNN optimized for processing multi-sensor data streams including vibration, thermal, and power signatures, followed by an integrated Spiking Neural Network SNN, layer that enables microsecond-level response times. This architecture achieves notable accuracy rates of 92.3% in component failure prediction with a 90-hour advance warning window. Field testing conducted on an industrial scale with 50 robotic systems demonstrates significant operational improvements, yielding a 94% decrease in unexpected system failures and 76% reduction in maintenance-related downtimes. The framework’s effectiveness in processing complex, multi-modal sensor data while maintaining computational efficiency validates its applicability for Industry 4.0 manufacturing environments.
[LG-21] DriVLM: Domain Adaptation of Vision-Language Models in Autonomous Driving
链接: https://arxiv.org/abs/2501.05081
作者: Xuran Zheng,Chang D. Yoo
类目: Machine Learning (cs.LG)
*备注:
Abstract:In recent years, large language models have had a very impressive performance, which largely contributed to the development and application of artificial intelligence, and the parameters and performance of the models are still growing rapidly. In particular, multimodal large language models (MLLM) can combine multiple modalities such as pictures, videos, sounds, texts, etc., and have great potential in various tasks. However, most MLLMs require very high computational resources, which is a major challenge for most researchers and developers. In this paper, we explored the utility of small-scale MLLMs and applied small-scale MLLMs to the field of autonomous driving. We hope that this will advance the application of MLLMs in real-world scenarios.
[LG-22] LearningFlow: Automated Policy Learning Workflow for Urban Driving with Large Language Models
链接: https://arxiv.org/abs/2501.05057
作者: Zengqi Peng,Yubin Wang,Xu Han,Lei Zheng,Jun Ma
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Recent advancements in reinforcement learning (RL) demonstrate the significant potential in autonomous driving. Despite this promise, challenges such as the manual design of reward functions and low sample efficiency in complex environments continue to impede the development of safe and effective driving policies. To tackle these issues, we introduce LearningFlow, an innovative automated policy learning workflow tailored to urban driving. This framework leverages the collaboration of multiple large language model (LLM) agents throughout the RL training process. LearningFlow includes a curriculum sequence generation process and a reward generation process, which work in tandem to guide the RL policy by generating tailored training curricula and reward functions. Particularly, each process is supported by an analysis agent that evaluates training progress and provides critical insights to the generation agent. Through the collaborative efforts of these LLM agents, LearningFlow automates policy learning across a series of complex driving tasks, and it significantly reduces the reliance on manual reward function design while enhancing sample efficiency. Comprehensive experiments are conducted in the high-fidelity CARLA simulator, along with comparisons with other existing methods, to demonstrate the efficacy of our proposed approach. The results demonstrate that LearningFlow excels in generating rewards and curricula. It also achieves superior performance and robust generalization across various driving tasks, as well as commendable adaptation to different RL algorithms.
[LG-23] A High-accuracy Calibration Method of Transient TSEPs for Power Semiconductor Devices
链接: https://arxiv.org/abs/2501.05005
作者: Qinghao Zhang,Wenrui Li,Pinjia Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The thermal sensitive electrical parameter (TSEP) method is crucial for enhancing the reliability of power devices through junction temperature monitoring. The TSEP method comprises three key processes: calibration, regression, and application. While significant efforts have been devoted to improving regression algorithms and increasing TSEP sensitivity to enhance junction temperature monitoring accuracy, these approaches have reached a bottleneck. In reality, the calibration method significantly influences monitoring accuracy, an aspect often overlooked in conventional TSEP methods. To address this issue, we propose a high-accuracy calibration method for transient TSEPs. First, a temperature compensation strategy based on thermal analysis is introduced to mitigate the temperature difference caused by load current during dual pulse tests. Second, the impact of stray parameters is analyzed to identify coupled parameters, which are typically neglected in existing methods. Third, it is observed that random errors follow a logarithm Gaussian distribution, covering a hidden variable. A neural network is used to obtain the junction temperature predictive model. The proposed calibration method is experimental validated in threshold voltage as an example. Compared with conventional calibration methods, the mean absolute error is reduced by over 30%. Moreover, this method does not require additional hardware cost and has good generalization.
[LG-24] Load Forecasting for Households and Energy Communities: Are Deep Learning Models Worth the Effort?
链接: https://arxiv.org/abs/2501.05000
作者: Lukas Moosbrugger,Valentin Seiler,Philipp Wohlgenannt,Sebastian Hegenbart,Sashko Ristov,Peter Kepplinger
类目: Machine Learning (cs.LG)
*备注: This preprint was submitted to the Elsevier journal Energy and AI on December 18, 2024
Abstract:Accurate load forecasting is crucial for predictive control in many energy domain applications, with significant economic and ecological implications. To address these implications, this study provides an extensive benchmark of state-of-the-art deep learning models for short-term load forecasting in energy communities. Namely, LSTM, xLSTM, and Transformers are compared with benchmarks such as KNNs, synthetic load models, and persistence forecasting models. This comparison considers different scales of aggregation (e.g., number of household loads) and varying training data availability (e.g., training data time spans). Further, the impact of transfer learning from synthetic (standard) load profiles and the deep learning model size (i.e., parameter count) is investigated in terms of forecasting error. Implementations are publicly available and other researchers are encouraged to benchmark models using this framework. Additionally, a comprehensive case study, comprising an energy community of 50 households and a battery storage demonstrates the beneficial financial implications of accurate predictions. Key findings of this research include: (1) Simple persistence benchmarks outperform deep learning models for short-term load forecasting when the available training data is limited to six months or less; (2) Pretraining with publicly available synthetic load profiles improves the normalized Mean Absolute Error (nMAE) by an average of 1.28%pt during the first nine months of training data; (3) Increased aggregation significantly enhances the performance of deep learning models relative to persistence benchmarks; (4) Improved load forecasting, with an nMAE reduction of 1.1%pt, translates to an economic benefit of approximately 600EUR per year in an energy community comprising 50 households.
[LG-25] Self-Adaptive Ising Machines for Constrained Optimization
链接: https://arxiv.org/abs/2501.04971
作者: Corentin Delacour
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:Ising machines (IM) are physics-inspired alternatives to von Neumann architectures for solving hard optimization tasks. By mapping binary variables to coupled Ising spins, IMs can naturally solve unconstrained combinatorial optimization problems such as finding maximum cuts in graphs. However, despite their importance in practical applications, constrained problems remain challenging to solve for IMs that require large quadratic energy penalties to ensure the correspondence between energy ground states and constrained optimal solutions. To relax this requirement, we propose a self-adaptive IM that iteratively shapes its energy landscape using a Lagrange relaxation of constraints and avoids prior tuning of penalties. Using a probabilistic-bit (p-bit) IM emulated in software, we benchmark our algorithm with multidimensional knapsack problems (MKP) and quadratic knapsack problems (QKP), the latter being an Ising problem with linear constraints. For QKP with 300 variables, the proposed algorithm finds better solutions than state-of-the-art IMs such as Fujitsu’s Digital Annealer and requires 7,500x fewer samples. Our results show that adapting the energy landscape during the search can speed up IMs for constrained optimization.
[LG-26] argeted Adversarial Denoising Autoencoders (TADA) for Neural Time Series Filtration AAAI2025
链接: https://arxiv.org/abs/2501.04967
作者: Benjamin J. Choi(1),Griffin Milsap(2),Clara A. Scholl(2),Francesco Tenore(2),Mattson Ogg(2) ((1) Harvard John A. Paulson School of Engineering and Applied Sciences, Cambridge, MA, United States of America, (2) Johns Hopkins University Applied Physics Laboratory, Laurel, MD, United States of America)
类目: Machine Learning (cs.LG)
*备注: [Accepted] Artificial Intelligence for Time Series Analysis (AI4TS): Theory, Algorithms, and Applications @ AAAI 2025, Philadelphia, PA, USA
Abstract:Current machine learning (ML)-based algorithms for filtering electroencephalography (EEG) time series data face challenges related to cumbersome training times, regularization, and accurate reconstruction. To address these shortcomings, we present an ML filtration algorithm driven by a logistic covariance-targeted adversarial denoising autoencoder (TADA). We hypothesize that the expressivity of a targeted, correlation-driven convolutional autoencoder will enable effective time series filtration while minimizing compute requirements (e.g., runtime, model size). Furthermore, we expect that adversarial training with covariance rescaling will minimize signal degradation. To test this hypothesis, a TADA system prototype was trained and evaluated on the task of removing electromyographic (EMG) noise from EEG data in the EEGdenoiseNet dataset, which includes EMG and EEG data from 67 subjects. The TADA filter surpasses conventional signal filtration algorithms across quantitative metrics (Correlation Coefficient, Temporal RRMSE, Spectral RRMSE), and performs competitively against other deep learning architectures at a reduced model size of less than 400,000 trainable parameters. Further experimentation will be necessary to assess the viability of TADA on a wider range of deployment cases.
[LG-27] Open Problems in Machine Unlearning for AI Safety
链接: https://arxiv.org/abs/2501.04952
作者: Fazl Barez,Tingchen Fu,Ameya Prabhu,Stephen Casper,Amartya Sanyal,Adel Bibi,Aidan O’Gara,Robert Kirk,Ben Bucknall,Tim Fist,Luke Ong,Philip Torr,Kwok-Yan Lam,Robert Trager,David Krueger,Sören Mindermann,José Hernandez-Orallo,Mor Geva,Yarin Gal
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning – the ability to selectively forget or suppress specific types of knowledge – has shown promise for privacy and data removal tasks, which has been the primary focus of existing research. More recently, its potential application to AI safety has gained attention. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear (CBRN) safety. In these contexts, information can be both beneficial and harmful, and models may combine seemingly harmless information for harmful purposes – unlearning this information could strongly affect beneficial uses. We provide an overview of inherent constraints and open problems, including the broader side effects of unlearning dangerous knowledge, as well as previously unexplored tensions between unlearning and existing safety mechanisms. Finally, we investigate challenges related to evaluation, robustness, and the preservation of safety features during unlearning. By mapping these limitations and open challenges, we aim to guide future research toward realistic applications of unlearning within a broader AI safety framework, acknowledging its limitations and highlighting areas where alternative approaches may be required.
[LG-28] A New Perspective on Privacy Protection in Federated Learning with Granular-Ball Computing
链接: https://arxiv.org/abs/2501.04940
作者: Guannan Lai,Yihui Feng,Xin Yang,Xiaoyu Deng,Hao Yu,Shuyin Xia,Guoyin Wang,Tianrui Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) facilitates collaborative model training while prioritizing privacy by avoiding direct data sharing. However, most existing articles attempt to address challenges within the model’s internal parameters and corresponding outputs, while neglecting to solve them at the input level. To address this gap, we propose a novel framework called Granular-Ball Federated Learning (GrBFL) for image classification. GrBFL diverges from traditional methods that rely on the finest-grained input data. Instead, it segments images into multiple regions with optimal coarse granularity, which are then reconstructed into a graph structure. We designed a two-dimensional binary search segmentation algorithm based on variance constraints for GrBFL, which effectively removes redundant information while preserving key representative features. Extensive theoretical analysis and experiments demonstrate that GrBFL not only safeguards privacy and enhances efficiency but also maintains robust utility, consistently outperforming other state-of-the-art FL methods. The code is available at this https URL.
[LG-29] SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection
链接: https://arxiv.org/abs/2501.04916
作者: Jake H. Lee,Michael Kiper,David R. Thompson,Philip G. Brodrick
类目: Machine Learning (cs.LG)
*备注: 23 pages, 5 figures, in review. Code repository: this https URL
Abstract:Current and upcoming generations of visible-shortwave infrared (VSWIR) imaging spectrometers promise unprecedented capacity to quantify Earth System processes across the globe. However, reliable cloud screening remains a fundamental challenge for these instruments, where traditional spatial and temporal approaches are limited by cloud variability and limited temporal coverage. The Spectroscopic Transformer (SpecTf) addresses these challenges with a spectroscopy-specific deep learning architecture that performs cloud detection using only spectral information (no spatial or temporal data are required). By treating spectral measurements as sequences rather than image channels, SpecTf learns fundamental physical relationships without relying on spatial context. Our experiments demonstrate that SpecTf significantly outperforms the current baseline approach implemented for the EMIT instrument, and performs comparably with other machine learning methods with orders of magnitude fewer learned parameters. Critically, we demonstrate SpecTf’s inherent interpretability through its attention mechanism, revealing physically meaningful spectral features the model has learned. Finally, we present SpecTf’s potential for cross-instrument generalization by applying it to a different instrument on a different platform without modifications, opening the door to instrument agnostic data driven algorithms for future imaging spectroscopy tasks.
[LG-30] Online Continual Learning: A Systematic Literature Review of Approaches Challenges and Benchmarks
链接: https://arxiv.org/abs/2501.04897
作者: Seyed Amir Bidaki,Amir Mohammadkhah,Kiyan Rezaee,Faeze Hassani,Sadegh Eskandari,Maziar Salahi,Mohammad M. Ghassemi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Online Continual Learning (OCL) is a critical area in machine learning, focusing on enabling models to adapt to evolving data streams in real-time while addressing challenges such as catastrophic forgetting and the stability-plasticity trade-off. This study conducts the first comprehensive Systematic Literature Review (SLR) on OCL, analyzing 81 approaches, extracting over 1,000 features (specific tasks addressed by these approaches), and identifying more than 500 components (sub-models within approaches, including algorithms and tools). We also review 83 datasets spanning applications like image classification, object detection, and multimodal vision-language tasks. Our findings highlight key challenges, including reducing computational overhead, developing domain-agnostic solutions, and improving scalability in resource-constrained environments. Furthermore, we identify promising directions for future research, such as leveraging self-supervised learning for multimodal and sequential data, designing adaptive memory mechanisms that integrate sparse retrieval and generative replay, and creating efficient frameworks for real-world applications with noisy or evolving task boundaries. By providing a rigorous and structured synthesis of the current state of OCL, this review offers a valuable resource for advancing this field and addressing its critical challenges and opportunities. The complete SLR methodology steps and extracted data are publicly available through the provided link: this https URL Systematic-Literature-Review-on-Online-Continual-Learning
[LG-31] A Look into How Machine Learning is Reshaping Engineering Models: the Rise of Analysis Paralysis Optimal yet Infeasible Solutions and the Inevitable Rashomon Paradox
链接: https://arxiv.org/abs/2501.04894
作者: MZ Naser
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:The widespread acceptance of empirically derived codal provisions and equations in civil engineering stands in stark contrast to the skepticism facing machine learning (ML) models, despite their shared statistical foundations. This paper examines this philosophical tension through the lens of structural engineering and explores how integrating ML challenges traditional engineering philosophies and professional identities. Recent efforts have documented how ML enhances predictive accuracy, optimizes designs, and analyzes complex behaviors. However, one might also raise concerns about the diminishing role of human intuition and the interpretability of algorithms. To showcase this rarely explored front, this paper presents how ML can be successfully integrated into various engineering problems by means of formulation via deduction, induction, and abduction. Then, this paper identifies three principal paradoxes that could arise when adopting ML: analysis paralysis (increased prediction accuracy leading to a reduced understanding of physical mechanisms), infeasible solutions (optimization resulting in unconventional designs that challenge engineering intuition), and the Rashomon effect (where contradictions in explainability methods and physics arise). This paper concludes by addressing these paradoxes and arguing the need to rethink epistemological shifts in engineering and engineering education and methodologies to harmonize traditional principles with ML.
[LG-32] Multilinear Tensor Low-Rank Approximation for Policy-Gradient Methods in Reinforcement Learning
链接: https://arxiv.org/abs/2501.04879
作者: Sergio Rozada,Hoi-To Wai,Antonio G. Marques
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) aims to estimate the action to take given a (time-varying) state, with the goal of maximizing a cumulative reward function. Predominantly, there are two families of algorithms to solve RL problems: value-based and policy-based methods, with the latter designed to learn a probabilistic parametric policy from states to actions. Most contemporary approaches implement this policy using a neural network (NN). However, NNs usually face issues related to convergence, architectural suitability, hyper-parameter selection, and underutilization of the redundancies of the state-action representations (e.g. locally similar states). This paper postulates multi-linear mappings to efficiently estimate the parameters of the RL policy. More precisely, we leverage the PARAFAC decomposition to design tensor low-rank policies. The key idea involves collecting the policy parameters into a tensor and leveraging tensor-completion techniques to enforce low rank. We establish theoretical guarantees of the proposed methods for various policy classes and validate their efficacy through numerical experiments. Specifically, we demonstrate that tensor low-rank policy models reduce computational and sample complexities in comparison to NN models while achieving similar rewards.
[LG-33] Probabilistic Skip Connections for Deterministic Uncertainty Quantification in Deep Neural Networks
链接: https://arxiv.org/abs/2501.04816
作者: Felix Jimenez,Matthias Katzfuss
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 9 figures
Abstract:Deterministic uncertainty quantification (UQ) in deep learning aims to estimate uncertainty with a single pass through a network by leveraging outputs from the network’s feature extractor. Existing methods require that the feature extractor be both sensitive and smooth, ensuring meaningful input changes produce meaningful changes in feature vectors. Smoothness enables generalization, while sensitivity prevents feature collapse, where distinct inputs are mapped to identical feature vectors. To meet these requirements, current deterministic methods often retrain networks with spectral normalization. Instead of modifying training, we propose using measures of neural collapse to identify an existing intermediate layer that is both sensitive and smooth. We then fit a probabilistic model to the feature vector of this intermediate layer, which we call a probabilistic skip connection (PSC). Through empirical analysis, we explore the impact of spectral normalization on neural collapse and demonstrate that PSCs can effectively disentangle aleatoric and epistemic uncertainty. Additionally, we show that PSCs achieve uncertainty quantification and out-of-distribution (OOD) detection performance that matches or exceeds existing single-pass methods requiring training modifications. By retrofitting existing models, PSCs enable high-quality UQ and OOD capabilities without retraining.
[LG-34] Fast Fine-Grained Equivalence Checking for Neural Decompilers
链接: https://arxiv.org/abs/2501.04811
作者: Luke Dramko,Claire Le Goues,Edward J. Schwartz
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Software Engineering (cs.SE)
*备注:
Abstract:Neural decompilers are machine learning models that reconstruct the source code from an executable program. Critical to the lifecycle of any machine learning model is an evaluation of its effectiveness. However, existing techniques for evaluating neural decompilation models have substantial weaknesses, especially when it comes to showing the correctness of the neural decompiler’s predictions. To address this, we introduce codealign, a novel instruction-level code equivalence technique designed for neural decompilers. We provide a formal definition of a relation between equivalent instructions, which we term an equivalence alignment. We show how codealign generates equivalence alignments, then evaluate codealign by comparing it with symbolic execution. Finally, we show how the information codealign provides-which parts of the functions are equivalent and how well the variable names match-is substantially more detailed than existing state-of-the-art evaluation metrics, which report unitless numbers measuring similarity.
[LG-35] Efficient and Responsible Adaptation of Large Language Models for Robust and Equitable Top-k Recommendations
链接: https://arxiv.org/abs/2501.04762
作者: Kirandeep Kaur,Manya Chadha,Vinayak Gupta,Chirag Shah
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2405.00824
Abstract:Conventional recommendation systems (RSs) are typically optimized to enhance performance metrics uniformly across all training samples, inadvertently overlooking the needs of diverse user populations. The performance disparity among various populations can harm the model’s robustness to sub-populations due to the varying user properties. While large language models (LLMs) show promise in enhancing RS performance, their practical applicability is hindered by high costs, inference latency, and degraded performance on long user queries. To address these challenges, we propose a hybrid task allocation framework designed to promote social good by equitably serving all user groups. By adopting a two-phase approach, we promote a strategic assignment of tasks for efficient and responsible adaptation of LLMs. Our strategy works by first identifying the weak and inactive users that receive a suboptimal ranking performance by RSs. Next, we use an in-context learning approach for such users, wherein each user interaction history is contextualized as a distinct ranking task. We evaluate our hybrid framework by incorporating eight different recommendation algorithms and three different LLMs – both open and close-sourced. Our results on three real-world datasets show a significant reduction in weak users and improved robustness to subpopulations without disproportionately escalating costs.
[LG-36] DAREK – Distance Aware Error for Kolmogorov Networks ICASSP25
链接: https://arxiv.org/abs/2501.04757
作者: Masoud Ataei,Mohammad Javad Khojasteh,Vikas Dhiman
类目: Machine Learning (cs.LG)
*备注: Accepted at ICASSP25, 5 pages + 2 pages supplementary material, 3 figures
Abstract:In this paper, we provide distance-aware error bounds for Kolmogorov Arnold Networks (KANs). We call our new error bounds estimator DAREK – Distance Aware Error for Kolmogorov networks. Z. Liu et al. provide error bounds, which may be loose, lack distance-awareness, and are defined only up to an unknown constant of proportionality. We review the error bounds for Newton’s polynomial, which is then generalized to an arbitrary spline, under Lipschitz continuity assumptions. We then extend these bounds to nested compositions of splines, arriving at error bounds for KANs. We evaluate our method by estimating an object’s shape from sparse laser scan points. We use KAN to fit a smooth function to the scans and provide error bounds for the fit. We find that our method is faster than Monte Carlo approaches, and that our error bounds enclose the true obstacle shape reliably.
[LG-37] RadioTransformer: Accurate Radio Map Construction and Coverag e Prediction
链接: https://arxiv.org/abs/2501.05190
作者: Yuxuan Li,Cheng Zhang,Wen Wang,Yongming Huang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to IEEE VTC 2025 Spring
Abstract:Radio map, or pathloss map prediction, is a crucial method for wireless network modeling and management. By leveraging deep learning to construct pathloss patterns from geographical maps, an accurate digital replica of the transmission environment could be established with less computational overhead and lower prediction error compared to traditional model-driven techniques. While existing state-of-the-art (SOTA) methods predominantly rely on convolutional architectures, this paper introduces a hybrid transformer-convolution model, termed RadioTransformer, to enhance the accuracy of radio map prediction. The proposed model features a multi-scale transformer-based encoder for efficient feature extraction and a convolution-based decoder for precise pixel-level image reconstruction. Simulation results demonstrate that the proposed scheme significantly improves prediction accuracy, and over a 30% reduction in root mean square error (RMSE) is achieved compared to typical SOTA approaches.
[LG-38] Robust Score Matching
链接: https://arxiv.org/abs/2501.05105
作者: Richard Schwank,Andrew McCormack,Mathias Drton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Proposed in Hyvärinen (2005), score matching is a parameter estimation procedure that does not require computation of distributional normalizing constants. In this work we utilize the geometric median of means to develop a robust score matching procedure that yields consistent parameter estimates in settings where the observed data has been contaminated. A special appeal of the proposed method is that it retains convexity in exponential family models. The new method is therefore particularly attractive for non-Gaussian, exponential family graphical models where evaluation of normalizing constants is intractable. Support recovery guarantees for such models when contamination is present are provided. Additionally, support recovery is studied in numerical experiments and on a precipitation dataset. We demonstrate that the proposed robust score matching estimator performs comparably to the standard score matching estimator when no contamination is present but greatly outperforms this estimator in a setting with contamination.
[LG-39] Supervised Learning with Evolving Tasks and Performance Guarantees
链接: https://arxiv.org/abs/2501.05089
作者: Verónica Álvarez,Santiago Mazuelas,Jose A. Lozano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2310.15974
Abstract:Multiple supervised learning scenarios are composed by a sequence of classification tasks. For instance, multi-task learning and continual learning aim to learn a sequence of tasks that is either fixed or grows over time. Existing techniques for learning tasks that are in a sequence are tailored to specific scenarios, lacking adaptability to others. In addition, most of existing techniques consider situations in which the order of the tasks in the sequence is not relevant. However, it is common that tasks in a sequence are evolving in the sense that consecutive tasks often have a higher similarity. This paper presents a learning methodology that is applicable to multiple supervised learning scenarios and adapts to evolving tasks. Differently from existing techniques, we provide computable tight performance guarantees and analytically characterize the increase in the effective sample size. Experiments on benchmark datasets show the performance improvement of the proposed methodology in multiple scenarios and the reliability of the presented performance guarantees.
[LG-40] Non-asymptotic analysis of the performance of the penalized least trimmed squares in sparse models
链接: https://arxiv.org/abs/2501.04946
作者: Yijun Zuo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The least trimmed squares (LTS) estimator is a renowned robust alternative to the classic least squares estimator and is popular in location, regression, machine learning, and AI literature. Many studies exist on LTS, including its robustness, computation algorithms, extension to non-linear cases, asymptotics, etc. The LTS has been applied in the penalized regression in a high-dimensional real-data sparse-model setting where dimension p (in thousands) is much larger than sample size n (in tens, or hundreds). In such a practical setting, the sample size n often is the count of sub-population that has a special attribute (e.g. the count of patients of Alzheimer’s, Parkinson’s, Leukemia, or ALS, etc.) among a population with a finite fixed size N. Asymptotic analysis assuming that n tends to infinity is not practically convincing and legitimate in such a scenario. A non-asymptotic or finite sample analysis will be more desirable and feasible. This article establishes some finite sample (non-asymptotic) error bounds for estimating and predicting based on LTS with high probability for the first time. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) MSC classes: Primary 62J07, 62G35, Secondary 62J99, 62G99 Cite as: arXiv:2501.04946 [stat.ML] (or arXiv:2501.04946v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.04946 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-41] owards understanding the bias in decision trees
链接: https://arxiv.org/abs/2501.04903
作者: Nathan Phelps,Daniel J. Lizotte,Douglas G. Woolford
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:There is a widespread and longstanding belief that machine learning models are biased towards the majority (or negative) class when learning from imbalanced data, leading them to neglect or ignore the minority (or positive) class. In this study, we show that this belief is not necessarily correct for decision trees, and that their bias can actually be in the opposite direction. Motivated by a recent simulation study that suggested that decision trees can be biased towards the minority class, our paper aims to reconcile the conflict between that study and decades of other works. First, we critically evaluate past literature on this problem, finding that failing to consider the data generating process has led to incorrect conclusions about the bias in decision trees. We then prove that, under specific conditions related to the predictors, decision trees fit to purity and trained on a dataset with only one positive case are biased towards the minority class. Finally, we demonstrate that splits in a decision tree are also biased when there is more than one positive case. Our findings have implications on the use of popular tree-based models, such as random forests.
[LG-42] Optimality and Adaptivity of Deep Neural Features for Instrumental Variable Regression
链接: https://arxiv.org/abs/2501.04898
作者: Juno Kim,Dimitri Meunier,Arthur Gretton,Taiji Suzuki,Zhu Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 46 pages, 1 figure, 2 tables
Abstract:We provide a convergence analysis of deep feature instrumental variable (DFIV) regression (Xu et al., 2021), a nonparametric approach to IV regression using data-adaptive features learned by deep neural networks in two stages. We prove that the DFIV algorithm achieves the minimax optimal learning rate when the target structural function lies in a Besov space. This is shown under standard nonparametric IV assumptions, and an additional smoothness assumption on the regularity of the conditional distribution of the covariate given the instrument, which controls the difficulty of Stage 1. We further demonstrate that DFIV, as a data-adaptive algorithm, is superior to fixed-feature (kernel or sieve) IV methods in two ways. First, when the target function possesses low spatial homogeneity (i.e., it has both smooth and spiky/discontinuous regions), DFIV still achieves the optimal rate, while fixed-feature methods are shown to be strictly suboptimal. Second, comparing with kernel-based two-stage regression estimators, DFIV is provably more data efficient in the Stage 1 samples.
[LG-43] Geophysical inverse problems with measurement-guided diffusion models
链接: https://arxiv.org/abs/2501.04881
作者: Matteo Ravasi
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:
Abstract:Solving inverse problems with the reverse process of a diffusion model represents an appealing avenue to produce highly realistic, yet diverse solutions from incomplete and possibly noisy measurements, ultimately enabling uncertainty quantification at scale. However, because of the intractable nature of the score function of the likelihood term (i.e., \nabla_\mathbfx_t p(\mathbfy | \mathbfx_t) ), various samplers have been proposed in the literature that use different (more or less accurate) approximations of such a gradient to guide the diffusion process towards solutions that match the observations. In this work, I consider two sampling algorithms recently proposed under the name of Diffusion Posterior Sampling (DPS) and Pseudo-inverse Guided Diffusion Model (PGDM), respectively. In DSP, the guidance term used at each step of the reverse diffusion process is obtained by applying the adjoint of the modeling operator to the residual obtained from a one-step denoising estimate of the solution. On the other hand, PGDM utilizes a pseudo-inverse operator that originates from the fact that the one-step denoised solution is not assumed to be deterministic, rather modeled as a Gaussian distribution. Through an extensive set of numerical examples on two geophysical inverse problems (namely, seismic interpolation and seismic inversion), I show that two key aspects for the success of any measurement-guided diffusion process are: i) our ability to re-parametrize the inverse problem such that the sought after model is bounded between -1 and 1 (a pre-requisite for any diffusion model); ii) the choice of the training dataset used to learn the implicit prior that guides the reverse diffusion process. Numerical examples on synthetic and field datasets reveal that PGDM outperforms DPS in both scenarios at limited additional cost.
[LG-44] RieszBoost: Gradient Boosting for Riesz Regression
链接: https://arxiv.org/abs/2501.04871
作者: Kaitlyn J. Lee,Alejandro Schuler
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Answering causal questions often involves estimating linear functionals of conditional expectations, such as the average treatment effect or the effect of a longitudinal modified treatment policy. By the Riesz representation theorem, these functionals can be expressed as the expected product of the conditional expectation of the outcome and the Riesz representer, a key component in doubly robust estimation methods. Traditionally, the Riesz representer is estimated indirectly by deriving its explicit analytical form, estimating its components, and substituting these estimates into the known form (e.g., the inverse propensity score). However, deriving or estimating the analytical form can be challenging, and substitution methods are often sensitive to practical positivity violations, leading to higher variance and wider confidence intervals. In this paper, we propose a novel gradient boosting algorithm to directly estimate the Riesz representer without requiring its explicit analytical form. This method is particularly suited for tabular data, offering a flexible, nonparametric, and computationally efficient alternative to existing methods for Riesz regression. Through simulation studies, we demonstrate that our algorithm performs on par with or better than indirect estimation techniques across a range of functionals, providing a user-friendly and robust solution for estimating causal quantities.
[LG-45] Deep Transfer Q-Learning for Offline Non-Stationary Reinforcement Learning
链接: https://arxiv.org/abs/2501.04870
作者: Jinhang Chai,Elynn Chen,Jianqing Fan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In dynamic decision-making scenarios across business and healthcare, leveraging sample trajectories from diverse populations can significantly enhance reinforcement learning (RL) performance for specific target populations, especially when sample sizes are limited. While existing transfer learning methods primarily focus on linear regression settings, they lack direct applicability to reinforcement learning algorithms. This paper pioneers the study of transfer learning for dynamic decision scenarios modeled by non-stationary finite-horizon Markov decision processes, utilizing neural networks as powerful function approximators and backward inductive learning. We demonstrate that naive sample pooling strategies, effective in regression settings, fail in Markov decision this http URL address this challenge, we introduce a novel re-weighted targeting procedure'' to construct
transferable RL samples’’ and propose ``transfer deep Q^* -learning’', enabling neural network approximation with theoretical guarantees. We assume that the reward functions are transferable and deal with both situations in which the transition densities are transferable or nontransferable. Our analytical techniques for transfer learning in neural network approximation and transition density transfers have broader implications, extending to supervised transfer learning with neural networks and domain shift scenarios. Empirical experiments on both synthetic and real datasets corroborate the advantages of our method, showcasing its potential for improving decision-making through strategically constructing transferable RL samples in non-stationary reinforcement learning contexts.
[LG-46] Intelligent experiments through real-time AI: Fast Data Processing and Autonomous Detector Control for sPHENIX and future EIC detectors
链接: https://arxiv.org/abs/2501.04845
作者: J. Kvapil(1),G. Borca-Tasciuc(2),H. Bossi(3),K. Chen(4),Y. Chen(4),Y. Corrales Morales(3),H. Da Costa(1),C. Da Silva(1),C. Dean(3),J. Durham(1),S. Fu(5),C. Hao(6),P. Harris(3),O. Hen(3),H. Jheng(3),Y. Lee(3),P. Li(6),X. Li(1),Y. Lin(1),M. X. Liu(1),V. Loncar(3),J. P. Mitrevski(8),A. Olvera(5),M. L. Purschke(7),J. S. Renck(1),G. Roland(3),J. Schambach(9),Z. Shi(1),N. Tran(8),N. Wuerfel(10),B. Xu(6),D. Yu(11),H. Zhang(6) ((1) Los Alamos National Laboratory, (2) Rensselaer Polytechnic Institute, (3) Massachusetts Institute of Technology, (4) Central China Normal University, (5) University of North Texas, (6) Georgia Institute of Technology, (7) Brookhaven National Laboratory, (8) Fermilab, (9) Oak Ridge National Laboratory, (10) University of Michigan, (11) New Jersey Institute of Technology)
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex)
*备注: proceedings for 42nd International Conference on High Energy Physics (ICHEP2024), 18-24 July 2024, Prague, Czech Republic
Abstract:This R\D project, initiated by the DOE Nuclear Physics AI-Machine Learning initiative in 2022, leverages AI to address data processing challenges in high-energy nuclear experiments (RHIC, LHC, and future EIC). Our focus is on developing a demonstrator for real-time processing of high-rate data streams from sPHENIX experiment tracking detectors. The limitations of a 15 kHz maximum trigger rate imposed by the calorimeters can be negated by intelligent use of streaming technology in the tracking system. The approach efficiently identifies low momentum rare heavy flavor events in high-rate p+p collisions (3MHz), using Graph Neural Network (GNN) and High Level Synthesis for Machine Learning (hls4ml). Success at sPHENIX promises immediate benefits, minimizing resources and accelerating the heavy-flavor measurements. The approach is transferable to other fields. For the EIC, we develop a DIS-electron tagger using Artificial Intelligence - Machine Learning (AI-ML) algorithms for real-time identification, showcasing the transformative potential of AI and FPGA technologies in high-energy nuclear and particle experiments real-time data processing pipelines.
[LG-47] Quantum Hybrid Support Vector Machines for Stress Detection in Older Adults
链接: https://arxiv.org/abs/2501.04831
作者: Md Saif Hassan Onim,Travis S. Humble,Himanshu Thapliyal
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Stress can increase the possibility of cognitive impairment and decrease the quality of life in older adults. Smart healthcare can deploy quantum machine learning to enable preventive and diagnostic support. This work introduces a unique technique to address stress detection as an anomaly detection problem that uses quantum hybrid support vector machines. With the help of a wearable smartwatch, we mapped baseline sensor reading as normal data and stressed sensor reading as anomaly data using cortisol concentration as the ground truth. We have used quantum computing techniques to explore the complex feature spaces with kernel-based preprocessing. We illustrate the usefulness of our method by doing experimental validation on 40 older adults with the help of the TSST protocol. Our findings highlight that using a limited number of features, quantum machine learning provides improved accuracy compared to classical methods. We also observed that the recall value using quantum machine learning is higher compared to the classical method. The higher recall value illustrates the potential of quantum machine learning in healthcare, as missing anomalies could result in delayed diagnostics or treatment.
[LG-48] Guiding Treatment Strategies: The Role of Adjuvant Anti-Her2 Neu Therapy and Skin/Nipple Involvement in Local Recurrence-Free Survival in Breast Cancer Patients
链接: https://arxiv.org/abs/2501.04724
作者: Joe Omatoi,Abdul M Mohammed,Dennis Trujillo
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:This study explores how causal inference models, specifically the Linear Non-Gaussian Acyclic Model (LiNGAM), can extract causal relationships between demographic factors, treatments, conditions, and outcomes from observational patient data, enabling insights beyond correlation. Unlike traditional randomized controlled trials (RCTs), which establish causal relationships within narrowly defined populations, our method leverages broader observational data, improving generalizability. Using over 40 features in the Duke MRI Breast Cancer dataset, we found that Adjuvant Anti-Her2 Neu Therapy increased local recurrence-free survival by 169 days, while Skin/Nipple involvement reduced it by 351 days. These findings highlight the therapy’s importance for Her2-positive patients and the need for targeted interventions for high-risk cases, informing personalized treatment strategies.
[LG-49] A Shape-Based Functional Index for Objective Assessment of Pediatric Motor Function
链接: https://arxiv.org/abs/2501.04721
作者: Shashwat Kumar,Arafat Rahman,Robert Gutierrez,Sarah Livermon,Allison N. McCrady,Silvia Blemker,Rebecca Scharf,Anuj Srivastava,Laura E. Barnes
类目: Applications (stat.AP); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: 13 pages
Abstract:Clinical assessments for neuromuscular disorders, such as Spinal Muscular Atrophy (SMA) and Duchenne Muscular Dystrophy (DMD), continue to rely on subjective measures to monitor treatment response and disease progression. We introduce a novel method using wearable sensors to objectively assess motor function during daily activities in 19 patients with DMD, 9 with SMA, and 13 age-matched controls. Pediatric movement data is complex due to confounding factors such as limb length variations in growing children and variability in movement speed. Our approach uses Shape-based Principal Component Analysis to align movement trajectories and identify distinct kinematic patterns, including variations in motion speed and asymmetry. Both DMD and SMA cohorts have individuals with motor function on par with healthy controls. Notably, patients with SMA showed greater activation of the motion asymmetry pattern. We further combined projections on these principal components with partial least squares (PLS) to identify a covariation mode with a canonical correlation of r = 0.78 (95% CI: [0.34, 0.94]) with muscle fat infiltration, the Brooke score (a motor function score), and age-related degenerative changes, proposing a novel motor function index. This data-driven method can be deployed in home settings, enabling better longitudinal tracking of treatment efficacy for children with neuromuscular disorders.
[LG-50] Pressing Intensity: An Intuitive Measure for Pressing in Soccer
链接: https://arxiv.org/abs/2501.04712
作者: Joris Bekkers
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Pressing is a fundamental defensive strategy in football, characterized by applying pressure on the ball owning team to regain possession. Despite its significance, existing metrics for measuring pressing often lack precision or comprehensive consideration of positional data, player movement and speed. This research introduces an innovative framework for quantifying pressing intensity, leveraging advancements in positional tracking data and components from Spearman’s Pitch Control model. Our method integrates player velocities, movement directions, and reaction times to compute the time required for a defender to intercept an attacker or the ball. This time-to-intercept measure is then transformed into probabilistic values using a logistic function, enabling dynamic and intuitive analysis of pressing situations at the individual frame level. the model captures how every player’s movement influences pressure on the field, offering actionable insights for coaches, analysts, and decision-makers. By providing a robust and intepretable metric, our approach facilitates the identification of pressing strategies, advanced situational analyses, and the derivation of metrics, advancing the analytical capabilities for modern football.
信息检索
[IR-0] Unraveling the Impact of Visual Complexity on Search as Learning
链接: https://arxiv.org/abs/2501.05289
作者: Wolfgang Gritz,Anett Hoppe,Ralph Ewerth
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Information search has become essential for learning and knowledge acquisition, offering broad access to information and learning resources. The visual complexity of web pages is known to influence search behavior, with previous work suggesting that searchers make evaluative judgments within the first second on a page. However, there is a significant gap in our understanding of how visual complexity impacts searches specifically conducted with a learning intent. This gap is particularly relevant for the development of optimized information retrieval (IR) systems that effectively support educational objectives. To address this research need, we model visual complexity and aesthetics via a diverse set of features, investigating their relationship with search behavior during learning-oriented web sessions. Our study utilizes a publicly available dataset from a lab study where participants learned about thunderstorm formation. Our findings reveal that while content relevance is the most significant predictor for knowledge gain, sessions with less visually complex pages are associated with higher learning success. This observation applies to features associated with the layout of web pages rather than to simpler features (e.g., number of images). The reported results shed light on the impact of visual complexity on learning-oriented searches, informing the design of more effective IR systems for educational contexts. To foster reproducibility, we release our source code (this https URL).