本篇博文主要内容为 2025-01-27 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-01-27)
今日共更新436篇论文,其中:
- 自然语言处理共65篇(Computation and Language (cs.CL))
- 人工智能共133篇(Artificial Intelligence (cs.AI))
- 计算机视觉共106篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共146篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Mitigating GenAI-powered Evidence Pollution for Out-of-Context Multimodal Misinformation Detection
【速读】: 该论文试图解决生成式人工智能(GenAI)模型在生成虚假内容时对在线信息安全的威胁,特别是针对多模态虚假信息检测中的证据污染问题。现有的检测方法在面对被GenAI污染的证据时,性能显著下降,降幅超过9个百分点。论文提出了两种关键策略来解决这一问题:跨模态证据重排序(cross-modal evidence reranking)和跨模态声明-证据推理(cross-modal claim-evidence reasoning)。这些策略通过重新评估证据的相关性和加强声明与证据之间的跨模态关联,有效提升了现有检测器在污染证据环境下的鲁棒性。实验结果表明,这两种策略在基准数据集上显著增强了检测器的性能。
链接: https://arxiv.org/abs/2501.14728
作者: Zehong Yan,Peng Qi,Wynne Hsu,Mong Li Lee
机构: 未知
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 12 pages, 11 figures
点击查看摘要
Abstract:While large generative artificial intelligence (GenAI) models have achieved significant success, they also raise growing concerns about online information security due to their potential misuse for generating deceptive content. Out-of-context (OOC) multimodal misinformation detection, which often retrieves Web evidence to identify the repurposing of images in false contexts, faces the issue of reasoning over GenAI-polluted evidence to derive accurate predictions. Existing works simulate GenAI-powered pollution at the claim level with stylistic rewriting to conceal linguistic cues, and ignore evidence-level pollution for such information-seeking applications. In this work, we investigate how polluted evidence affects the performance of existing OOC detectors, revealing a performance degradation of more than 9 percentage points. We propose two strategies, cross-modal evidence reranking and cross-modal claim-evidence reasoning, to address the challenges posed by polluted evidence. Extensive experiments on two benchmark datasets show that these strategies can effectively enhance the robustness of existing out-of-context detectors amidst polluted evidence.
zh
[NLP-1] Comparable Corpora: Opportunities for New Research Directions ACL COLING-2025
【速读】: 该论文旨在探讨如何利用可比语料库(comparable corpora)推动研究领域的进一步发展,并鼓励学术界思考其潜在应用。论文的核心并非提出具体的研究成果,而是通过回顾历史并展望未来研究方向,激发社区成员对可比语料库的广泛思考和创新贡献。解决方案的关键在于引导研究者超越传统应用,探索新的研究方向和潜在机会,从而推动该领域的持续发展。
链接: https://arxiv.org/abs/2501.14721
作者: Kenneth Church
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: Keynote in this https URL , workshop associated with Coling-2025
点击查看摘要
Abstract:Most conference papers present new results, but this paper will focus more on opportunities for the audience to make their own contributions. This paper is intended to challenge the community to think more broadly about what we can do with comparable corpora. We will start with a review of the history, and then suggest new directions for future research. This was a keynote at BUCC-2025, a workshop associated with Coling-2025.
zh
[NLP-2] Do LLM s Provide Consistent Answers to Health-Related Questions across Languages? ECIR2025
【速读】: 该论文旨在解决大型语言模型(LLMs)在多语言环境下提供健康相关信息时的一致性问题。由于在线健康资源的质量因语言而异,LLMs在不同语言中的回答可能存在不一致性,这可能导致健康信息的误传。论文通过扩展HealthFC数据集,增加了土耳其语和中文的翻译,并将健康相关问题按疾病类型分类,从而构建了一个多语言健康相关查询数据集。此外,论文提出了一种基于提示(prompt-based)的评估工作流程,通过解析实现两种语言之间的子维度比较。该研究揭示了在多语言环境中部署LLM工具的关键挑战,并强调了改进跨语言对齐以确保健康信息准确性和公平性的必要性。
链接: https://arxiv.org/abs/2501.14719
作者: Ipek Baris Schlicht,Zhixue Zhao,Burcu Sayin,Lucie Flek,Paolo Rosso
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 9 pages. Short paper appeared at 47th European Conference on Information Retrieval (ECIR 2025)
点击查看摘要
Abstract:Equitable access to reliable health information is vital for public health, but the quality of online health resources varies by language, raising concerns about inconsistencies in Large Language Models (LLMs) for healthcare. In this study, we examine the consistency of responses provided by LLMs to health-related questions across English, German, Turkish, and Chinese. We largely expand the HealthFC dataset by categorizing health-related questions by disease type and broadening its multilingual scope with Turkish and Chinese translations. We reveal significant inconsistencies in responses that could spread healthcare misinformation. Our main contributions are 1) a multilingual health-related inquiry dataset with meta-information on disease categories, and 2) a novel prompt-based evaluation workflow that enables sub-dimensional comparisons between two languages through parsing. Our findings highlight key challenges in deploying LLM-based tools in multilingual contexts and emphasize the need for improved cross-lingual alignment to ensure accurate and equitable healthcare information.
zh
[NLP-3] owards Better Understanding Table Instruction Tuning: Decoupling the Effects from Data versus Models
【速读】: 该论文试图解决在自然语言处理领域中,针对表格相关任务的大语言模型(LLMs)性能评估缺乏一致性的问题。具体而言,以往的研究使用不同的基础模型和训练数据进行训练,导致难以进行直接的性能比较。为了解决这一问题,论文通过在现有的公开训练数据集上对Mistral、OLMo和Phi系列的基础模型进行微调,实现了与现有表格LLMs相当或更优的性能,并在表格问答数据集Hitab上达到了新的最先进水平。关键解决方案在于通过系统的跨领域评估,解耦了训练数据和基础模型对模型性能的贡献,从而深入理解它们各自的影响。此外,论文还评估了表格特定指令微调对通用基准的影响,揭示了专业化与泛化之间的权衡。
链接: https://arxiv.org/abs/2501.14717
作者: Naihao Deng,Sheng Zhang,Henghui Zhu,Shuaichen Chang,Jiani Zhang,Alexander Hanbo Li,Chung-Wei Hang,Hideo Kobayashi,Yiqun Hu,Patrick Ng
机构: University of Michigan(密歇根大学); AWS AI Labs, New York, USA(AWS AI实验室, 纽约, 美国)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Recent advances in natural language processing have leveraged instruction tuning to enhance Large Language Models (LLMs) for table-related tasks. However, previous works train different base models with different training data, lacking an apples-to-apples comparison across the result table LLMs. To address this, we fine-tune base models from the Mistral, OLMo, and Phi families on existing public training datasets. Our replication achieves performance on par with or surpassing existing table LLMs, establishing new state-of-the-art performance on Hitab, a table question-answering dataset. More importantly, through systematic out-of-domain evaluation, we decouple the contributions of training data and the base model, providing insight into their individual impacts. In addition, we assess the effects of table-specific instruction tuning on general-purpose benchmarks, revealing trade-offs between specialization and generalization.
zh
[NLP-4] FlexiGPT : Pruning and Extending Large Language Models with Low-Rank Weight Sharing NAACL2025
【速读】: 该论文旨在解决大型语言模型(LLMs)在内存受限设备上高效部署的问题,同时不牺牲模型性能。随着LLMs在自然语言处理(NLP)领域的广泛应用,如何在资源受限的环境中优化这些模型的部署成为了一个关键挑战。论文提出了一种基于重要性评分的选择性剪枝方法,通过剪枝模型中的某些模块,并使用低参数替换策略来替代这些被剪枝的模块。具体而言,该方案采用了一种基于权重共享机制的度量方法,利用模型中未剪枝的部分和模块特定的低秩适配器(low-rank adapters)来替换被剪枝的模块。此外,论文还通过输出特征归一化和基于低秩SVD重构的适配器初始化方案来促进这些替换模块的学习。实验结果表明,该方法在多个基准测试中显著优于现有方法,尤其是在30%和40%的压缩率下,分别在5/6和6/6的基准测试中达到了最先进的性能。此外,该方法还能扩展较小模型的性能,仅需约0.3%的额外训练数据即可在6/6的基准测试中提升性能,且额外参数成本极低。
链接: https://arxiv.org/abs/2501.14713
作者: James Seale Smith,Chi-Heng Lin,Shikhar Tuli,Haris Jeelani,Shangqian Gao,Yilin Shen,Hongxia Jin,Yen-Chang Hsu
机构: Samsung Research America (三星美国研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to NAACL 2025 - Main Conference
点击查看摘要
Abstract:The rapid proliferation of large language models (LLMs) in natural language processing (NLP) has created a critical need for techniques that enable efficient deployment on memory-constrained devices without compromising performance. We present a method to prune LLMs that selectively prunes model blocks based on an importance score and replaces them with a low-parameter replacement strategy. Specifically, we propose a principled metric to replace each pruned block using a weight-sharing mechanism that leverages unpruned counterparts from the model and block-specific low-rank adapters. Furthermore, we facilitate the learning of these replacement blocks with output feature normalization and an adapter initialization scheme built on low-rank SVD reconstructions. Empirical evaluations demonstrate substantial performance gains over existing methods, achieving state-of-the-art performance on 5/6 benchmarks for a compression rate of 30% and 6/6 benchmarks for a compression rate of 40%. We also demonstrate that our approach can extend smaller models, boosting performance on 6/6 benchmarks using only ~0.3% tokens of extended training with minimal additional parameter costs.
zh
[NLP-5] he Karp Dataset NEURIPS2024
【速读】: 该论文试图解决的问题是理解和评估大语言模型(LLMs)在数学推理能力方面的表现。具体而言,研究关注于LLMs在处理NP完全性归约(NP-completeness reductions)任务时的能力。为了解决这一问题,论文提出了Karp数据集,这是首个包含详细NP完全性归约证明的数据集,涵盖了从本科课程中的简单练习到学术论文中更具挑战性的归约任务。通过这一数据集,研究团队能够对现有最先进的模型进行性能比较,并展示了使用Karp数据集进行微调对模型推理能力的提升效果。解决方案的关键在于创建了一个专门针对复杂数学推理任务的数据集,从而为训练和评估LLMs提供了新的基准。
链接: https://arxiv.org/abs/2501.14705
作者: Mason DiCicco,Eamon Worden,Conner Olsen,Nikhil Gangaram,Daniel Reichman,Neil Heffernan
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to the 4th workshop on mathematical reasoning and AI at NeurIPS 2024
点击查看摘要
Abstract:Understanding the mathematical reasoning capabilities of Large Language Models (LLMs) is a central topic in the study of artificial intelligence. This new domain necessitates the creation of datasets of reasoning tasks for both training and benchmarking the performance of LLMs. To this end, we introduce the Karp dataset: The first dataset composed of detailed proofs of NP-completeness reductions. The reductions vary in difficulty, ranging from simple exercises of undergraduate courses to more challenging reductions from academic papers. We compare the performance of state-of-the-art models on this task and demonstrate the effect of fine-tuning with the Karp dataset on reasoning capacity.
zh
[NLP-6] NLP-based assessment of prescription appropriateness from Italian referrals
【速读】: 该论文试图解决意大利转诊中处方适当性评估的问题,特别是在转诊原因仅以自由文本记录的情况下,自动化与指南进行比较的复杂性。解决方案的关键在于提出了一种基于自然语言处理(Natural Language Processing, NLP)的流程,利用基于Transformer模型的嵌入技术对转诊文本进行聚类,将聚类结果映射到标签,并将这些标签与现有指南对齐。通过这种方法,首次实现了对这些转诊原因的全面总结及其适当性的量化评估。该流程在一个包含496,971份转诊的案例研究中进行了验证,结果表明其在转诊原因和适当性评估方面表现出色,为卫生当局提供了有价值的工具,以加强推荐并减少不适当转诊的负担。
链接: https://arxiv.org/abs/2501.14701
作者: Vittorio Torri,Annamaria Bottelli,Michele Ercolanoni,Olivia Leoni,Francesca Ieva
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Objective: This study proposes a Natural Language Processing pipeline to evaluate prescription appropriateness in Italian referrals, where reasons for prescriptions are recorded only as free text, complicating automated comparisons with guidelines. The pipeline aims to derive, for the first time, a comprehensive summary of the reasons behind these referrals and a quantification of their appropriateness. While demonstrated in a specific case study, the approach is designed to generalize to other types of examinations. Methods: Leveraging embeddings from a transformer-based model, the proposed approach clusters referral texts, maps clusters to labels, and aligns these labels with existing guidelines. We present a case study on a dataset of 496,971 referrals, consisting of all referrals for venous echocolordopplers of the lower limbs between 2019 and 2021 in the Lombardy Region. A sample of 1,000 referrals was manually annotated to validate the results. Results: The pipeline exhibited high performance for referrals’ reasons (Prec=92.43%, Rec=83.28%) and excellent results for referrals’ appropriateness (Prec=93.58%, Rec=91.52%) on the annotated subset. Analysis of the entire dataset identified clusters matching guideline-defined reasons - both appropriate and inappropriate - as well as clusters not addressed in the guidelines. Overall, 34.32% of referrals were marked as appropriate, 34.07% inappropriate, 14.37% likely inappropriate, and 17.24% could not be mapped to guidelines. Conclusions: The proposed pipeline effectively assessed prescription appropriateness across a large dataset, serving as a valuable tool for health authorities. Findings have informed the Lombardy Region’s efforts to strengthen recommendations and reduce the burden of inappropriate referrals. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) MSC classes: 68T50 ACMclasses: I.2.7; J.1; J.3 Cite as: arXiv:2501.14701 [cs.CL] (or arXiv:2501.14701v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.14701 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vittorio Torri [view email] [v1] Fri, 24 Jan 2025 18:24:16 UTC (419 KB)
zh
[NLP-7] Rethinking Table Instruction Tuning
【速读】: 该论文旨在解决当前表格理解(table understanding)研究中存在的两个主要问题:一是现有研究忽视了超参数选择对模型性能的影响,二是缺乏对表格大语言模型(table LLMs)在跨领域表格理解能力和通用能力方面的全面评估。论文通过系统分析发现,超参数(如学习率)对表格特定任务和通用能力有显著影响,并指出较小的学习率和较少的训练实例可以提升表格理解能力,同时保持模型的通用能力。基于这些发现,论文提出了TAMA模型,该模型基于LLaMA 3.1 8B Instruct进行指令微调,在表格任务上表现优于或与GPT-3.5和GPT-4相当,同时保持了较强的跨领域泛化能力和通用能力。解决方案的关键在于通过精细的超参数选择,减少数据标注成本并提高模型开发效率。
链接: https://arxiv.org/abs/2501.14693
作者: Naihao Deng,Rada Mihalcea
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of hyperparameter choices and lacks a comprehensive evaluation of the out-of-domain table understanding ability and the general capabilities of these table LLMs. In this paper, we evaluate these abilities in existing table LLMs, and reveal significant declines in both out-of-domain table understanding and general capabilities compared to their base models. Through systematic analysis, we show that hyperparameters, such as learning rate, can significantly influence both table-specific and general capabilities. Contrary to the existing table instruction-tuning works, we demonstrate that smaller learning rates and fewer training instances can enhance table understanding while preserving general capabilities. Based on our findings, we introduce TAMA, a TAble LLM instruction-tuned from LLaMA 3.1 8B Instruct, which achieves performance on par with, or surpassing GPT-3.5 and GPT-4 on table tasks, while maintaining strong out-of-domain generalization and general capabilities. Our findings highlight the potential for reduced data annotation costs and more efficient model development through careful hyperparameter selection.
zh
[NLP-8] State Space Models for Extractive Summarization in Low Resource Scenarios
【速读】: 该论文旨在解决在低资源环境下提升抽取式摘要(extractive summarization)性能的问题。抽取式摘要的核心任务是从文本中选择最相关的句子。为了解决这一问题,作者提出了MPoincareSum方法,其关键创新点在于结合了Mamba状态空间模型(Mamba state space model)和Poincare压缩(Poincare compression)技术。具体而言,MPoincareSum首先利用Mamba模型生成评论和句子的语义表示,并将这些表示进行拼接;接着通过Poincare压缩选择最具意义的特征,再通过线性层预测句子与评论的相关性;最后对相关句子进行释义以生成最终摘要。实验结果表明,MPoincareSum在Amazon评论数据集上的表现优于现有方法,并通过ROUGE评分验证了其有效性。
链接: https://arxiv.org/abs/2501.14673
作者: Nisrine Ait Khayi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Extractive summarization involves selecting the most relevant sentences from a text. Recently, researchers have focused on advancing methods to improve state-of-the-art results in low-resource settings. Motivated by these advancements, we propose the MPoincareSum method. This method applies the Mamba state space model to generate the semantics of reviews and sentences, which are then concatenated. A Poincare compression is used to select the most meaningful features, followed by the application of a linear layer to predict sentence relevance based on the corresponding review. Finally, we paraphrase the relevant sentences to create the final summary. To evaluate the effectiveness of MPoincareSum, we conducted extensive experiments using the Amazon review dataset. The performance of the method was assessed using ROUGE scores. The experimental results demonstrate that MPoincareSum outperforms several existing approaches in the literature
zh
[NLP-9] Investigating the (De)Composition Capabilities of Large Language Models in Natural-to-Formal Language Conversion NAACL2025
【速读】: 该论文旨在解决大语言模型(LLMs)在自然语言到形式语言转换(N2F)中的分解(decomposition)和组合(composition)能力问题。具体而言,研究探讨了LLMs在面对不熟悉的形式语言时,是否具备处理组合性差距(compositional gaps)和反直觉符号名称(counter-intuitive symbolic names)的能力。为解决这一问题,论文提出了DEDC框架,该框架通过半自动化的样本和任务构建,实现了对LLMs在N2F中分解和组合能力的解耦评估。通过这一框架,研究发现LLMs在分解和组合能力上存在显著不足,且错误类型广泛,主要归因于自然语言理解和符号系统学习与使用的缺陷。该研究为改进LLMs在N2F中的基本能力提供了新的视角和详细的分析依据。
链接: https://arxiv.org/abs/2501.14649
作者: Ziyao Xu,Houfeng Wang
机构: National Key Laboratory for Multimedia Information Processing (国家多媒体信息处理重点实验室); School of Computer Science, Peking University (北京大学计算机学院)
类目: Computation and Language (cs.CL)
备注: Accepted at NAACL 2025 main conference
点击查看摘要
Abstract:To achieve generalized and robust natural-to-formal language conversion (N2F), large language models (LLMs) need to have strong capabilities of decomposition and composition in N2F when faced with an unfamiliar formal language and be able to cope with compositional gaps and counter-intuitive symbolic names. To investigate whether LLMs have this set of basic capabilities in N2F, we propose the DEDC framework. This framework semi-automatically performs sample and task construction, allowing decoupled evaluation of the set of decomposition and composition capabilities of LLMs in N2F. Based on this framework, we evaluate and analyze the most advanced LLMs, and the main findings include that: (1) the LLMs are deficient in both decomposition and composition; (2) the LLMs show a wide coverage of error types that can be attributed to deficiencies in natural language understanding and the learning and use of symbolic systems; (3) compositional gaps and counter-intuitive symbolic names both affect the decomposition and composition of the LLMs. Our work provides a new perspective for investigating the basic capabilities of decomposition and composition of LLMs in N2F. The detailed analysis of deficiencies and attributions can help subsequent improvements of LLMs.
zh
[NLP-10] Funzac at CoMeDi Shared Task: Modeling Annotator Disagreement from Word-In-Context Perspectives COLING2025
【速读】: 该论文旨在解决在Word-in-Context (WiC)任务中标注者分歧(annotator disagreement)的问题,特别是探讨上下文意义与分歧之间的关系。作为CoMeDi共享任务竞赛的一部分,研究通过引入WiC任务来弥补句子级语义表示与标注者判断变异性之间的差距。解决方案的关键在于提出了三种不同的方法:1)特征增强方法,通过结合拼接、元素差异、乘积和余弦相似度,以及欧几里得和曼哈顿距离来扩展上下文嵌入表示;2)使用Adapter块进行转换,以获得任务特定的上下文嵌入表示;3)采用不同复杂度的分类器,包括集成方法。研究结果表明,包含增强特征和任务特定特征的方法在性能上有所提升,尽管在子任务1(OGWiC)中表现不及最佳系统,但在子任务2(DisWiC)中与官方评估结果具有竞争力。
链接: https://arxiv.org/abs/2501.14617
作者: Olufunke O. Sarumi,Charles Welch,Lucie Flek,Jörg Schlötterer
机构: University of Marburg(马尔堡大学); McMaster University(麦克马斯特大学); University of Bonn(波恩大学); University of Mannheim(曼海姆大学)
类目: Computation and Language (cs.CL)
备注: Accepted to CoMeDi Shared Task at COLING 2025
点击查看摘要
Abstract:In this work, we evaluate annotator disagreement in Word-in-Context (WiC) tasks exploring the relationship between contextual meaning and disagreement as part of the CoMeDi shared task competition. While prior studies have modeled disagreement by analyzing annotator attributes with single-sentence inputs, this shared task incorporates WiC to bridge the gap between sentence-level semantic representation and annotator judgment variability. We describe three different methods that we developed for the shared task, including a feature enrichment approach that combines concatenation, element-wise differences, products, and cosine similarity, Euclidean and Manhattan distances to extend contextual embedding representations, a transformation by Adapter blocks to obtain task-specific representations of contextual embeddings, and classifiers of varying complexities, including ensembles. The comparison of our methods demonstrates improved performance for methods that include enriched and task-specfic features. While the performance of our method falls short in comparison to the best system in subtask 1 (OGWiC), it is competitive to the official evaluation results in subtask 2 (DisWiC).
zh
[NLP-11] Idiom Detection in Sorani Kurdish Texts
【速读】: 该论文旨在解决索拉尼库尔德语(Sorani Kurdish)中的习语检测(idiom detection)问题。习语检测是自然语言处理(NLP)中的一个重要任务,旨在识别文本中具有非字面意义的表达方式。尽管习语检测在多种语言中已取得显著进展,但库尔德语在这一领域的研究仍存在较大空白,尤其是在机器翻译和情感分析等任务中习语的重要性尤为突出。为了解决这一问题,研究团队将习语检测视为文本分类任务,并采用深度学习技术进行处理。关键解决方案包括构建一个包含10,580个句子、涵盖101个索拉尼库尔德语习语的数据集,并基于该数据集开发并评估了三种深度学习模型:基于KuBERT的Transformer序列分类模型、循环卷积神经网络(RCNN)以及带有注意力机制的双向长短期记忆网络(BiLSTM)。实验结果表明,基于Transformer的模型(即微调的BERT)表现最佳,准确率接近99%,而RCNN和BiLSTM的准确率分别为96.5%和80%。这些结果证明了Transformer架构在低资源语言(如库尔德语)中的有效性,并为库尔德语的NLP研究奠定了基础。
链接: https://arxiv.org/abs/2501.14528
作者: Skala Kamaran Omer,Hossein Hassani
机构: University of Kurdistan Hewlêr (库尔德斯坦大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 8 figures, 7 tables
点击查看摘要
Abstract:Idiom detection using Natural Language Processing (NLP) is the computerized process of recognizing figurative expressions within a text that convey meanings beyond the literal interpretation of the words. While idiom detection has seen significant progress across various languages, the Kurdish language faces a considerable research gap in this area despite the importance of idioms in tasks like machine translation and sentiment analysis. This study addresses idiom detection in Sorani Kurdish by approaching it as a text classification task using deep learning techniques. To tackle this, we developed a dataset containing 10,580 sentences embedding 101 Sorani Kurdish idioms across diverse contexts. Using this dataset, we developed and evaluated three deep learning models: KuBERT-based transformer sequence classification, a Recurrent Convolutional Neural Network (RCNN), and a BiLSTM model with an attention mechanism. The evaluations revealed that the transformer model, the fine-tuned BERT, consistently outperformed the others, achieving nearly 99% accuracy while the RCNN achieved 96.5% and the BiLSTM 80%. These results highlight the effectiveness of Transformer-based architectures in low-resource languages like Kurdish. This research provides a dataset, three optimized models, and insights into idiom detection, laying a foundation for advancing Kurdish NLP.
zh
[NLP-12] WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
【速读】: 该论文旨在解决低资源语言(low-resource languages)在训练多语言模型时面临的高质量语料库不足的问题。为此,作者提出了一个系统化的数据处理框架,专门针对低资源语言进行优化。该框架包括数据提取、语料清洗、内容去重、安全过滤、质量评估和主题分类等关键步骤。通过这些步骤的实施,显著提升了数据集的质量和安全性,同时保持了其语言多样性。最终,作者开放了包含五种语言的数据集,供研究社区使用。
链接: https://arxiv.org/abs/2501.14506
作者: Jia Yu,Fei Yuan,Rui Min,Jing Yu,Pei Chu,Jiayang Li,Wei Li,Ruijie Zhang,Zhenxiang Li,Zhifei Ren,Dong Zheng,Wenjian Zhang,Yan Teng,Lingyu Meng,ZhenJiang Jin,Jiantao Qiu,ShaSha Wang,Zhongying Tu,Dahua Lin,Yu Wang,Yu Qiao,Yanfeng Wang,Conghui He
机构: Shanghai Artifcial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus cleaning, content deduplication, security filtering, quality evaluation, and theme classification. Through the implementation of this framework, we have significantly improved both the quality and security of the dataset, while maintaining its linguistic diversity. As of now, data for all five languages have been fully open-sourced. The dataset can be accessed at this https URL, and GitHub repository is available at this https URL
zh
[NLP-13] Evaluating and Improving Graph to Text Generation with Large Language Models NAACL2025
【速读】: 该论文试图解决大语言模型(LLMs)在解释图结构(graph structures)方面的能力不足问题,特别是在图到文本生成(graph-to-text generation)任务中。尽管LLMs在多种任务中展现了巨大潜力,但在处理复杂图结构(尤其是包含大量三元组的图)时,其规划能力仍然有限。为了解决这一问题,论文提出了一个名为PlanGTG的新数据集,该数据集包含两个子任务:重新排序(reordering)和属性标注(attribution)。通过自动和人工评估,论文展示了使用PlanGTG数据集在少样本学习(few-shot learning)和微调(fine-tuning)方面对生成文本质量的显著提升。解决方案的关键在于引入PlanGTG数据集,并通过多样性和难度为基础的少样本样本选择方法,优化了提示策略(prompting strategies),从而提升了LLMs在图到文本生成任务中的表现。
链接: https://arxiv.org/abs/2501.14497
作者: Jie He,Yijun Yang,Wanqiu Long,Deyi Xiong,Victor Gutierrez Basulto,Jeff Z. Pan
机构: School of Informatics, University of Edinburgh, UK (爱丁堡大学信息学院); College of Intelligence and Computing, Tianjin University, Tianjin, China (天津大学智能与计算学院); School of Computer Science and Informatics, Cardiff University, UK (卡迪夫大学计算机科学与信息学院)
类目: Computation and Language (cs.CL)
备注: NAACL 2025
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated immense potential across various tasks. However, research for exploring and improving the capabilities of LLMs in interpreting graph structures remains limited. To address this gap, we conduct a comprehensive evaluation of prompting current open-source LLMs on graph-to-text generation tasks. Although we explored the optimal prompting strategies and proposed a novel and effective diversity-difficulty-based few-shot sample selection method, we found that the improvements from tuning-free approaches were incremental, as LLMs struggle with planning on complex graphs, particularly those with a larger number of triplets. To further improve LLMs in planning with graph sequences and grounding in truth, we introduce a new graph-to-text dataset, PlanGTG, annotated with two sub-tasks: reordering and attribution. Through extensive automatic and human evaluations, we demonstrate significant improvements in the quality of generated text from both few-shot learning and fine-tuning perspectives using the PlanGTG dataset. Our study paves the way for new research directions in graph-to-text generation. PlanGTG datasets can be found in this https URL.
zh
[NLP-14] RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques
【速读】: 该论文旨在解决如何有效评估大语言模型(LLMs)的批判能力(critique capabilities)的问题。由于批判任务具有开放性的特点,传统的评估方法难以准确衡量模型的批判能力。为此,作者提出了一种新的基准测试(benchmark),采用闭环(closed-loop)方法,通过评估基于批判生成的修正质量来衡量模型的批判能力。该基准测试引入了自批判(self-critique)、交叉批判(cross-critique)和迭代批判(iterative critique)等关键特征,这些特征能够有效区分高级推理模型与经典模型的能力差异。通过八个具有挑战性的推理任务,作者发现经典LLMs在批判场景中显著落后于高级推理模型o1-mini,且在自批判和迭代批判设置下,经典LLMs的表现甚至可能低于其基线能力。该基准测试为未来研究提供了重要参考。
链接: https://arxiv.org/abs/2501.14492
作者: Zhengyang Tang,Ziniu Li,Zhenyang Xiao,Tian Ding,Ruoyu Sun,Benyou Wang,Dayiheng Liu,Fei Huang,Tianyu Liu,Bowen Yu,Junyang Lin
机构: Cranberry-Lemon University (克兰伯里-莱蒙大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at \urlthis https URL.
zh
[NLP-15] Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter
【速读】: 该论文试图解决在低资源环境下,如何通过跨语言迁移(cross-lingual transfer)来增加自然语言处理(NLP)任务的训练数据的问题。具体而言,研究探讨了在不同语言家族和多种NLP任务中,如何选择最佳的跨语言数据以提升模型性能。解决方案的关键在于分析266种来自不同语言家族的语言,并结合三种常见的NLP任务:词性标注(POS tagging)、依存句法分析(dependency parsing)和主题分类(topic classification)。研究发现,语言相似性对跨语言迁移效果的影响取决于多个因素,包括具体的NLP任务、输入表示(单语或多语)以及语言相似性的定义。这些发现为在不同语言和任务中选择合适的跨语言数据提供了新的见解。
链接: https://arxiv.org/abs/2501.14491
作者: Verena Blaschke,Masha Fedzechkina,Maartje ter Hoeve
机构: Center for Information and Language Processing (CIS), LMU Munich(慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning(慕尼黑机器学习中心); Apple(苹果)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Cross-lingual transfer is a popular approach to increase the amount of training data for NLP tasks in a low-resource context. However, the best strategy to decide which cross-lingual data to include is unclear. Prior research often focuses on a small set of languages from a few language families and/or a single task. It is still an open question how these findings extend to a wider variety of languages and tasks. In this work, we analyze cross-lingual transfer for 266 languages from a wide variety of language families. Moreover, we include three popular NLP tasks: POS tagging, dependency parsing, and topic classification. Our findings indicate that the effect of linguistic similarity on transfer performance depends on a range of factors: the NLP task, the (mono- or multilingual) input representations, and the definition of linguistic similarity.
zh
[NLP-16] Understanding and Mitigating Gender Bias in LLM s via Interpretable Neuron Editing
【速读】: 该论文旨在解决大型语言模型(LLMs)中普遍存在的性别偏见问题,并提出了一个系统性的解决方案。现有的去偏见方法往往缺乏对偏见机制的全面理解,或者会损害模型的核心能力。为此,作者提出了CommonWords数据集,用于系统评估LLMs中的性别偏见,并通过分析揭示了模型中普遍存在的偏见及其背后的神经元机制,包括性别神经元和通用神经元。研究发现,即使是编辑少量通用神经元,也可能由于神经元之间的层次化交互而破坏模型的整体能力。基于这些发现,作者提出了一种可解释的神经元编辑方法,结合了基于logit和因果关系的策略,选择性地针对偏见神经元进行编辑。实验表明,该方法在有效减少性别偏见的同时,能够保持模型的原有能力,优于现有的微调和编辑方法。该研究的关键贡献在于提供了一个新的数据集、对偏见机制的详细分析以及一种实用的去偏见解决方案。
链接: https://arxiv.org/abs/2501.14457
作者: Zeping Yu,Sophia Ananiadou
机构: Department of Computer Science, National Centre for Text Mining, The University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注: preprint
点击查看摘要
Abstract:Large language models (LLMs) often exhibit gender bias, posing challenges for their safe deployment. Existing methods to mitigate bias lack a comprehensive understanding of its mechanisms or compromise the model’s core capabilities. To address these issues, we propose the CommonWords dataset, to systematically evaluate gender bias in LLMs. Our analysis reveals pervasive bias across models and identifies specific neuron circuits, including gender neurons and general neurons, responsible for this behavior. Notably, editing even a small number of general neurons can disrupt the model’s overall capabilities due to hierarchical neuron interactions. Based on these insights, we propose an interpretable neuron editing method that combines logit-based and causal-based strategies to selectively target biased neurons. Experiments on five LLMs demonstrate that our method effectively reduces gender bias while preserving the model’s original capabilities, outperforming existing fine-tuning and editing approaches. Our findings contribute a novel dataset, a detailed analysis of bias mechanisms, and a practical solution for mitigating gender bias in LLMs.
zh
[NLP-17] Domaino1 s: Guiding LLM Reasoning for Explainable Answers in High-Stakes Domains
【速读】: 该论文试图解决大型语言模型(LLMs)在高风险领域任务(如金融投资和法律问答)中生成简短答案而缺乏推理过程和解释的问题,这限制了用户基于其回答做出决策的信心。解决方案的关键在于引入Domaino1s,通过监督微调(supervised fine-tuning)和树搜索(tree search)增强LLMs在领域任务中的推理能力。具体而言,作者构建了CoT-stock-2k和CoT-legal-2k数据集用于微调模型,以激活领域特定的推理步骤,并提出选择性树探索(Selective Tree Exploration)方法,自发探索解空间并采样最优推理路径以提升性能。此外,论文还引入了PROOF-Score这一新指标,用于评估领域模型的可解释性,补充了传统准确性指标的评估维度。实验结果表明,Domaino1s在股票投资推荐和法律推理问答任务中表现出领先的性能和可解释性。
链接: https://arxiv.org/abs/2501.14431
作者: Xu Chu,Zhijie Tan,Hanlin Xue,Guanyu Wang,Tong Mo,Weiping Li
机构: 1School of Software and Microelectronics, Peking University, Beijing, China (北京大学软件与微电子学院, 北京, 中国)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are widely applied to downstream domains. However, current LLMs for high-stakes domain tasks, such as financial investment and legal QA, typically generate brief answers without reasoning processes and explanations. This limits users’ confidence in making decisions based on their responses. While original CoT shows promise, it lacks self-correction mechanisms during reasoning. This work introduces Domain o1 s, which enhances LLMs’ reasoning capabilities on domain tasks through supervised fine-tuning and tree search. We construct CoT-stock-2k and CoT-legal-2k datasets for fine-tuning models that activate domain-specific reasoning steps based on their judgment. Additionally, we propose Selective Tree Exploration to spontaneously explore solution spaces and sample optimal reasoning paths to improve performance. We also introduce PROOF-Score, a new metric for evaluating domain models’ explainability, complementing traditional accuracy metrics with richer assessment dimensions. Extensive experiments on stock investment recommendation and legal reasoning QA tasks demonstrate Domaino1s’s leading performance and explainability. Our code is available at this https URL.
zh
[NLP-18] DRESSing Up LLM : Efficient Stylized Question-Answering via Style Subspace Editing ICLR2025
【速读】: 该论文试图解决在生成风格化大语言模型(LLM)响应时,现有方法如提示(prompting)和微调(fine-tuning)在复杂风格适应方面表现不足或计算成本过高的问题,特别是在NPC创建或角色扮演等任务中。论文提出的解决方案DRESS通过利用LLM的过参数化特性,在模型的表示空间中解耦出与风格相关的子空间,并进行表示编辑(representation editing),从而在最小化对原始语义影响的前提下实现风格控制。关键点在于通过自适应编辑强度动态调整风格子空间中的导向向量(steering vectors),以同时保持风格忠实度和语义完整性。DRESS是一种轻量级、无需训练的解决方案,能够灵活且有效地增强LLM的风格控制能力,特别适用于开发风格化对话代理。
链接: https://arxiv.org/abs/2501.14371
作者: Xinyu Ma,Yifeng Xu,Yang Lin,Tianlong Wang,Xu Chu,Xin Gao,Junfeng Zhao,Yasha Wang
机构: 1 School of Computer Science, Peking University (北京大学计算机学院); 2 Center on Frontiers of Computing Studies, Peking University (北京大学前沿计算研究中心); 3 National Research and Engineering Center of Software Engineering, Peking University (北京大学软件工程国家工程研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2025 Accepted
点击查看摘要
Abstract:We introduce DRESS, a novel approach for generating stylized large language model (LLM) responses through representation editing. Existing methods like prompting and fine-tuning are either insufficient for complex style adaptation or computationally expensive, particularly in tasks like NPC creation or character role-playing. Our approach leverages the over-parameterized nature of LLMs to disentangle a style-relevant subspace within the model’s representation space to conduct representation editing, ensuring a minimal impact on the original semantics. By applying adaptive editing strengths, we dynamically adjust the steering vectors in the style subspace to maintain both stylistic fidelity and semantic integrity. We develop two stylized QA benchmark datasets to validate the effectiveness of DRESS, and the results demonstrate significant improvements compared to baseline methods such as prompting and ITI. In short, DRESS is a lightweight, train-free solution for enhancing LLMs with flexible and effective style control, making it particularly useful for developing stylized conversational agents. Codes and benchmark datasets are available at this https URL.
zh
[NLP-19] Chain-of-Retrieval Augmented Generation
【速读】: 该论文旨在解决传统检索增强生成(RAG, Retrieval-Augmented Generation)模型在处理复杂查询时因单次检索结果不完善而导致的性能限制问题。传统RAG方法通常在生成答案前仅进行一次检索,这在面对多步推理或多跳问答任务时效果有限。为此,论文提出了一种名为CoRAG(Chain-of-Retrieval Augmented Generation)的新方法,允许模型根据动态演化的状态逐步重新生成查询,并通过多次检索和推理来生成最终答案。CoRAG的关键在于利用拒绝采样(rejection sampling)自动生成中间检索链,从而增强现有仅提供最终正确答案的RAG数据集。此外,论文还提出了多种解码策略,通过控制检索链的长度和数量来调节模型在测试时的计算复杂度。实验结果表明,CoRAG在多跳问答任务中显著优于现有基线模型,并在KILT基准测试中实现了新的最优性能。
链接: https://arxiv.org/abs/2501.14342
作者: Liang Wang,Haonan Chen,Nan Yang,Xiaolong Huang,Zhicheng Dou,Furu Wei
机构: Microsoft Corporation(微软公司); Renmin University of China(中国人民大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 18 pages
点击查看摘要
Abstract:This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer. Conventional RAG methods usually perform a single retrieval step before the generation process, which limits their effectiveness in addressing complex queries due to imperfect retrieval results. In contrast, our proposed method, CoRAG (Chain-of-Retrieval Augmented Generation), allows the model to dynamically reformulate the query based on the evolving state. To train CoRAG effectively, we utilize rejection sampling to automatically generate intermediate retrieval chains, thereby augmenting existing RAG datasets that only provide the correct final answer. At test time, we propose various decoding strategies to scale the model’s test-time compute by controlling the length and number of sampled retrieval chains. Experimental results across multiple benchmarks validate the efficacy of CoRAG, particularly in multi-hop question answering tasks, where we observe more than 10 points improvement in EM score compared to strong baselines. On the KILT benchmark, CoRAG establishes a new state-of-the-art performance across a diverse range of knowledge-intensive tasks. Furthermore, we offer comprehensive analyses to understand the scaling behavior of CoRAG, laying the groundwork for future research aimed at developing factual and grounded foundation models.
zh
[NLP-20] Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity
【速读】: 该论文试图解决机器学习中模型在不同领域(domain)间性能一致性(consistent model performance across domains)的挑战,特别是探讨了使用大语言模型(LLM)生成的数据进行微调(fine-tuning)对跨领域泛化(cross-domain generalization)的影响。研究发现,与使用真实数据(ground truth data)进行微调相比,使用LLM生成的数据不仅提高了目标任务的表现,还减少了跨领域性能下降(out-of-domain degradation)。关键解决方案在于,LLM生成的数据序列中高困惑度(high perplexity)的标记(tokens)出现频率较低,这增强了模型的跨领域鲁棒性(OOD robustness)。进一步实验表明,通过在真实数据中掩码高困惑度标记,也能达到与使用LLM生成数据相似的跨领域性能保持效果。这一发现为开发更鲁棒的微调策略提供了新的见解。
链接: https://arxiv.org/abs/2501.14315
作者: Chao-Chung Wu,Zhi Rui Tam,Chieh-Yen Lin,Hung-yi Lee,Yun-Nung Chen
机构: Appier AI Research; National Taiwan University (国立台湾大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Maintaining consistent model performance across domains is a fundamental challenge in machine learning. While recent work has explored using LLM-generated data for fine-tuning, its impact on cross-domain generalization remains poorly understood. In this paper, we present a systematic analysis revealing that fine-tuning with LLM-generated data not only improves target task performance but also reduces out-of-domain (OOD) degradation compared to fine-tuning with ground truth data. Through analyzing the data sequence in tasks of various domains, we demonstrate that this enhanced OOD robustness stems from a reduced prevalence of high perplexity tokens in LLM-generated sequences. Following this hypothesis we showed that masking high perplexity tokens in ground truth training data also achieves similar OOD preservation comparable to using LLM-generated data. Extensive experiments across diverse model architectures and scales, including Gemma2-2B, Mistral-7B and Llama3-8B, corroborate the consistency of our findings. To the best of our knowledge, this work provides the first mechanistic explanation for the superior OOD robustness conferred by LLM-generated training data, offering valuable insights for developing more robust fine-tuning strategies.
zh
[NLP-21] Fast Think-on-Graph: Wider Deeper and Faster Reasoning of Large Language Model on Knowledge Graph
【速读】: 该论文旨在解决现有图检索增强生成(Graph Retrieval Augmented Generation, GRAG)系统中的两个主要问题:1) 在处理复杂问题时,由于从知识图谱(Knowledge Graphs, KGs)中捕获的关联关系过于狭窄和浅层,导致简单范式通常失效;2) 与知识图谱强耦合的方法在图结构密集时,计算成本高且耗时。为解决这些问题,论文提出了Fast Think-on-Graph (FastToG) 这一创新范式,通过使大规模语言模型(Large-scale Language Models, LLMs)在知识图谱中以“社区为单位”进行思考,从而提升推理效率和准确性。FastToG 的关键在于采用社区检测(community detection)来捕获更深层次的关联关系,并通过粗粒度(coarse pruning)和细粒度(fine pruning)两阶段的社区剪枝来加速检索。此外,论文还开发了两种社区到文本(Community-to-Text)方法,将社区图结构转换为文本形式,以便于LLMs更好地理解。实验结果表明,FastToG 在准确性、推理速度和可解释性方面均优于现有方法。
链接: https://arxiv.org/abs/2501.14300
作者: Xujian Liang,Zhaoquan Gu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:Graph Retrieval Augmented Generation (GRAG) is a novel paradigm that takes the naive RAG system a step further by integrating graph information, such as knowledge graph (KGs), into large-scale language models (LLMs) to mitigate hallucination. However, existing GRAG still encounter limitations: 1) simple paradigms usually fail with the complex problems due to the narrow and shallow correlations capture from KGs 2) methods of strong coupling with KGs tend to be high computation cost and time consuming if the graph is dense. In this paper, we propose the Fast Think-on-Graph (FastToG), an innovative paradigm for enabling LLMs to think ``community by community" within KGs. To do this, FastToG employs community detection for deeper correlation capture and two stages community pruning - coarse and fine pruning for faster retrieval. Furthermore, we also develop two Community-to-Text methods to convert the graph structure of communities into textual form for better understanding by LLMs. Experimental results demonstrate the effectiveness of FastToG, showcasing higher accuracy, faster reasoning, and better explainability compared to the previous works.
zh
[NLP-22] Examining Alignment of Large Language Models through Representative Heuristics: The Case of Political Stereotypes ICLR2025
【速读】: 该论文试图解决大语言模型(LLMs)在政治议题上与人类意图和价值观的对齐问题,特别是这些模型在偏离经验立场时的表现。研究重点在于量化这些偏离,并识别导致偏离的条件。通过借鉴认知科学中的代表性启发式(representativeness heuristics)理论,论文探讨了LLMs在回应政治议题时如何夸大特定政党的立场,表现出比人类更强烈的刻板印象。研究表明,LLMs在模仿某些政党的立场时,往往会过度强调代表性,导致比人类更极端的判断。论文提出了一种基于提示(prompt-based)的缓解策略,有效减少了代表性启发式对LLM回应的影响。
链接: https://arxiv.org/abs/2501.14294
作者: Sullam Jeoung,Yubin Ge,Haohan Wang,Jana Diesner
机构: University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Amazon AWS(亚马逊AWS); Technical University of Munich(慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2025
点击查看摘要
Abstract:Examining the alignment of large language models (LLMs) has become increasingly important, particularly when these systems fail to operate as intended. This study explores the challenge of aligning LLMs with human intentions and values, with specific focus on their political inclinations. Previous research has highlighted LLMs’ propensity to display political leanings, and their ability to mimic certain political parties’ stances on various issues. However, the extent and conditions under which LLMs deviate from empirical positions have not been thoroughly examined. To address this gap, our study systematically investigates the factors contributing to LLMs’ deviations from empirical positions on political issues, aiming to quantify these deviations and identify the conditions that cause them. Drawing on cognitive science findings related to representativeness heuristics – where individuals readily recall the representative attribute of a target group in a way that leads to exaggerated beliefs – we scrutinize LLM responses through this heuristics lens. We conduct experiments to determine how LLMs exhibit stereotypes by inflating judgments in favor of specific political parties. Our results indicate that while LLMs can mimic certain political parties’ positions, they often exaggerate these positions more than human respondents do. Notably, LLMs tend to overemphasize representativeness to a greater extent than humans. This study highlights the susceptibility of LLMs to representativeness heuristics, suggeseting potential vulnerabilities to political stereotypes. We propose prompt-based mitigation strategies that demonstrate effectiveness in reducing the influence of representativeness in LLM responses. Comments: ICLR 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.14294 [cs.CL] (or arXiv:2501.14294v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.14294 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-23] A Comprehensive Framework for Semantic Similarity Detection Using Transformer Architectures and Enhanced Ensemble Techniques
【速读】: 该论文试图解决在短上下文文档中检测生成式 AI (Generative AI) 生成文本的难题,由于短上下文缺乏足够的语境信息,传统方法难以进行准确分类。论文提出了一种新的教师-学生模型 (teacher-student model),通过领域适应 (domain adaptation) 和数据增强 (data augmentation) 来解决这一问题。其关键解决方案包括:1)教师模型结合了 DeBERTa-v3-large 和 Mamba-790m,通过领域特定的微调学习语义知识;2)学生模型更高效地处理短上下文文本;3)使用均方误差 (Mean Squared Error, MSE) 损失函数指导学生模型的学习,提升准确性和效率;4)通过拼写校正和错误注入等数据增强方法增强模型的鲁棒性。实验结果表明,该方法在实时生成式 AI 文本检测及其他文本分类任务中优于基线方法。
链接: https://arxiv.org/abs/2501.14288
作者: Lifu Gao,Qi Zhang,Ziwei Liu
机构: Cornell University(康奈尔大学); University of Chinese Academy of Sciences(中国科学院大学); University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Detecting AI-generated text, especially in short-context documents, is difficult because there is not enough context for accurate classification. This paper presents a new teacher-student model that uses domain adaptation and data augmentation to solve these problems. The teacher model, which combines DeBERTa-v3-large and Mamba-790m, learns semantic knowledge through domain-specific fine-tuning. The student model handles short-context text more efficiently. The system uses a Mean Squared Error (MSE) loss function to guide the student’s learning, improving both accuracy and efficiency. Also, data augmentation methods like spelling correction and error injection make the model more robust. Experimental results show that this approach works better than baseline methods, proving its usefulness for real-time AI-generated text detection and other text classification tasks.
zh
[NLP-24] Leverag ing Online Olympiad-Level Math Problems for LLM s Training and Contamination-Resistant Evaluation
【速读】: 该论文旨在解决大型语言模型(LLMs)在解决奥林匹克级别数学问题时面临的训练和评估挑战。具体问题包括现有数据集的规模和质量有限,以及当前基准测试容易受到污染,导致评估结果不可靠。论文提出了一种自动化管道,利用Art of Problem Solving (AoPS) 论坛的丰富资源,提取高质量的问题-答案对,构建了包含超过60万对QA的AoPS-Instruct数据集。通过在该数据集上微调LLMs,显著提升了模型的推理能力。此外,论文还引入了LiveAoPSBench,一个基于最新论坛数据的动态评估集,具有时间戳,能够有效抵抗污染,为评估LLM性能提供了可靠的基准。关键解决方案在于利用AoPS论坛的社区驱动内容,自动化生成大规模、高质量的训练和评估数据,从而提升LLMs在高级数学推理任务中的表现。
链接: https://arxiv.org/abs/2501.14275
作者: Sadegh Mahdavi,Muchen Li,Kaiwen Liu,Christos Thrampoulidis,Leonid Sigal,Renjie Liao
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute for AI (向量人工智能研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席); NSERC CRC Chair (NSERC CRC主席)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Advances in Large Language Models (LLMs) have sparked interest in their ability to solve Olympiad-level math problems. However, the training and evaluation of these models are constrained by the limited size and quality of available datasets, as creating large-scale data for such advanced problems requires extensive effort from human experts. In addition, current benchmarks are prone to contamination, leading to unreliable evaluations. In this paper, we present an automated pipeline that leverages the rich resources of the Art of Problem Solving (AoPS) forum, which predominantly features Olympiad-level problems and community-driven solutions. Using open-source LLMs, we develop a method to extract question-answer pairs from the forum, resulting in AoPS-Instruct, a dataset of more than 600,000 high-quality QA pairs. Our experiments demonstrate that fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introduces LiveAoPSBench, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance. Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning, offering valuable insights into the capabilities and limitations of LLMs in this domain. Our benchmark and code is available at this https URL
zh
[NLP-25] Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors
【速读】: 该论文试图解决大语言模型(LLMs)在实际应用中的安全性和可信性问题,特别是针对多轮攻击(multi-turn attacks)的防御不足问题。现有的多轮攻击方法通常依赖于静态模式或预定义的逻辑链,无法有效应对攻击过程中动态策略的变化。论文提出的解决方案是Siren,一个基于学习的多轮攻击框架,旨在模拟真实世界中的人类越狱行为。Siren的关键在于其三个阶段的设计:(1) 利用Turn-Level LLM反馈(Turn-MF)构建训练集,(2) 通过监督微调(SFT)和直接偏好优化(DPO)对攻击者进行后训练,(3) 攻击者与目标LLM之间的交互。实验表明,Siren在攻击成功率(ASR)上显著优于单轮攻击基线,尤其是在LLaMA-3-8B攻击Gemini-1.5-Pro和Mistral-7B攻击GPT-4o的场景中表现突出。Siren通过分解策略和语义对齐的攻击目标,展示了其在多轮攻击中的高效性和适应性。
链接: https://arxiv.org/abs/2501.14250
作者: Yi Zhao,Youzhi Zhang
机构: Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算系); Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences(中国科学院香港科学与创新研究院人工智能与机器人中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
点击查看摘要
Abstract:Large language models (LLMs) are widely used in real-world applications, raising concerns about their safety and trustworthiness. While red-teaming with jailbreak prompts exposes the vulnerabilities of LLMs, current efforts focus primarily on single-turn attacks, overlooking the multi-turn strategies used by real-world adversaries. Existing multi-turn methods rely on static patterns or predefined logical chains, failing to account for the dynamic strategies during attacks. We propose Siren, a learning-based multi-turn attack framework designed to simulate real-world human jailbreak behaviors. Siren consists of three stages: (1) training set construction utilizing Turn-Level LLM feedback (Turn-MF), (2) post-training attackers with supervised fine-tuning (SFT) and direct preference optimization (DPO), and (3) interactions between the attacking and target LLMs. Experiments demonstrate that Siren achieves an attack success rate (ASR) of 90% with LLaMA-3-8B as the attacker against Gemini-1.5-Pro as the target model, and 70% with Mistral-7B against GPT-4o, significantly outperforming single-turn baselines. Moreover, Siren with a 7B-scale model achieves performance comparable to a multi-turn baseline that leverages GPT-4o as the attacker, while requiring fewer turns and employing decomposition strategies that are better semantically aligned with attack goals. We hope Siren inspires the development of stronger defenses against advanced multi-turn jailbreak attacks under realistic scenarios. Code is available at this https URL. Warning: This paper contains potentially harmful text.
zh
[NLP-26] Humanitys Last Exam
【速读】: 该论文试图解决当前大型语言模型(LLM)在流行基准测试(如MMLU)上表现过于优异(准确率超过90%),导致这些基准测试难以有效衡量最先进LLM能力的问题。为了解决这一问题,作者提出了“Humanity’s Last Exam (HLE)”,这是一个多模态基准测试,旨在覆盖广泛的学科领域,并设计为最终封闭式学术基准测试。HLE包含3000个问题,涵盖数学、人文和自然科学等多个学科,由全球各领域的专家开发,包含选择题和简答题,适合自动评分。每个问题都有明确且易于验证的答案,但无法通过互联网快速检索得到。实验表明,当前最先进的LLM在HLE上的准确率和校准度较低,突显了LLM在封闭式学术问题上与人类专家水平之间的显著差距。通过公开发布HLE,作者希望为研究和政策制定提供更清晰的模型能力评估依据。
链接: https://arxiv.org/abs/2501.14249
作者: Long Phan,Alice Gatti,Ziwen Han,Nathaniel Li,Josephina Hu,Hugh Zhang,Sean Shi,Michael Choi,Anish Agrawal,Arnav Chopra,Adam Khoja,Ryan Kim,Jason Hausenloy,Oliver Zhang,Mantas Mazeika,Daron Anderson,Tung Nguyen,Mobeen Mahmood,Fiona Feng,Steven Y. Feng,Haoran Zhao,Michael Yu,Varun Gangal,Chelsea Zou,Zihan Wang,Jessica P. Wang,Pawan Kumar,Oleksandr Pokutnyi,Robert Gerbicz,Serguei Popov,John-Clark Levin,Mstyslav Kazakov,Johannes Schmitt,Geoff Galgon,Alvaro Sanchez,Yongki Lee,Will Yeadon,Scott Sauers,Marc Roth,Chidozie Agu,Søren Riis,Fabian Giska,Saiteja Utpala,Zachary Giboney,Gashaw M. Goshu,Joan of Arc Xavier,Sarah-Jane Crowson,Mohinder Maheshbhai Naiya,Noah Burns,Lennart Finke,Zerui Cheng,Hyunwoo Park,Francesco Fournier-Facio,John Wydallis,Mark Nandor,Ankit Singh,Tim Gehrunger,Jiaqi Cai,Ben McCarty,Darling Duclosel,Jungbae Nam,Jennifer Zampese,Ryan G. Hoerr,Aras Bacho,Gautier Abou Loume,Abdallah Galal,Hangrui Cao,Alexis C Garretson,Damien Sileo,Qiuyu Ren,Doru Cojoc,Pavel Arkhipov,Usman Qazi,Lianghui Li,Sumeet Motwani,Christian Schroeder de Witt,Edwin Taylor,Johannes Veith,Eric Singer,Taylor D. Hartman,Paolo Rissone,Jaehyeok Jin,Jack Wei Lun Shi,Chris G. Willcocks,Joshua Robinson,Aleksandar Mikov,Ameya Prabhu,Longke Tang,Xavier Alapont,Justine Leon Uro,Kevin Zhou,Emily de Oliveira Santos,Andrey Pupasov Maksimov,Edward Vendrow,Kengo Zenitani,Julien Guillod,Yuqi Li,Joshua Vendrow,Vladyslav Kuchkin,Ng Ze-An
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 6 figures
点击查看摘要
Abstract:Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at this https URL.
zh
[NLP-27] Multi-agent KTO: Reinforcing Strategic Interactions of Large Language Model in Language Game
【速读】: 该论文试图解决如何实现人工通用智能(Artificial General Intelligence, AGI)的问题,特别是如何使AI代理不仅能够做出战略决策,还能进行灵活且有意义的交流。解决方案的关键在于提出了一种基于维特根斯坦语言游戏理论的多代理Kahneman-Tversky优化(Multi-agent Kahneman-Tversky’s Optimization, MaKTO)方法。该方法通过上下文交互学习,而不是传统的多阶段框架,将决策与语言表达分离。MaKTO通过在多玩家狼人杀(Werewolf)游戏中进行广泛的游戏互动,生成未配对的理想和不可接受的响应,并利用KTO优化模型的决策过程。实验结果表明,MaKTO在9人狼人杀游戏中取得了61%的平均胜率,显著优于GPT-4o和两阶段强化学习代理,并且在对抗专家玩家时表现出60%的胜率和49%的不可检测性,展示了其在复杂社交推理游戏中的卓越决策、战略适应和自然语言生成能力。
链接: https://arxiv.org/abs/2501.14225
作者: Rong Ye,Yongxin Zhang,Yikai Zhang,Haoyu Kuang,Zhongyu Wei,Peng Sun
机构: Bytedance(字节跳动); Fudan University(复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Preprint. Code and data will be available at this https URL
点击查看摘要
Abstract:Achieving Artificial General Intelligence (AGI) requires AI agents that can not only make stratigic decisions but also engage in flexible and meaningful communication. Inspired by Wittgenstein’s language game theory in Philosophical Investigations, we propose that language agents can learn through in-context interaction rather than traditional multi-stage frameworks that separate decision-making from language expression. Using Werewolf, a social deduction game that tests language understanding, strategic interaction, and adaptability, we develop the Multi-agent Kahneman Tversky’s Optimization (MaKTO). MaKTO engages diverse models in extensive gameplay to generate unpaired desirable and unacceptable responses, then employs KTO to refine the model’s decision-making process. In 9-player Werewolf games, MaKTO achieves a 61% average win rate across various models, outperforming GPT-4o and two-stage RL agents by relative improvements of 23.0% and 10.9%, respectively. Notably, MaKTO also demonstrates human-like performance, winning 60% against expert players and showing only 49% detectability in Turing-style blind tests. These results showcase MaKTO’s superior decision-making, strategic adaptation, and natural language generation in complex social deduction games.
zh
[NLP-28] st-Time Code-Switching for Cross-lingual Aspect Sentiment Triplet Extraction
【速读】: 该论文试图解决跨语言情感三元组提取(Aspect Sentiment Triplet Extraction, ASTE)任务中的跨语言迁移问题,特别是在低资源语言中的应用。当前的方法在术语边界检测和词典外问题上存在不足。论文提出了一种新颖的测试时代码切换(Test-Time Code-SWitching, TT-CSW)框架,旨在弥合双语训练阶段和单语测试阶段之间的差距。解决方案的关键在于:1)在训练阶段,基于双语代码切换的训练数据开发生成模型,能够为双语输入生成双语ASTE三元组;2)在测试阶段,采用基于对齐的代码切换技术进行测试时增强。通过在跨语言ASTE数据集上的广泛实验,验证了该方法的有效性,平均加权F1分数提高了3.7%,并且在多个语言数据集上超越了ChatGPT和GPT-4的表现。
链接: https://arxiv.org/abs/2501.14144
作者: Dongming Sheng,Kexin Han,Hao Li,Yan Zhang,Yucheng Huang,Jun Lang,Wenqiang Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Aspect Sentiment Triplet Extraction (ASTE) is a thriving research area with impressive outcomes being achieved on high-resource languages. However, the application of cross-lingual transfer to the ASTE task has been relatively unexplored, and current code-switching methods still suffer from term boundary detection issues and out-of-dictionary problems. In this study, we introduce a novel Test-Time Code-SWitching (TT-CSW) framework, which bridges the gap between the bilingual training phase and the monolingual test-time prediction. During training, a generative model is developed based on bilingual code-switched training data and can produce bilingual ASTE triplets for bilingual inputs. In the testing stage, we employ an alignment-based code-switching technique for test-time augmentation. Extensive experiments on cross-lingual ASTE datasets validate the effectiveness of our proposed method. We achieve an average improvement of 3.7% in terms of weighted-averaged F1 in four datasets with different languages. Additionally, we set a benchmark using ChatGPT and GPT-4, and demonstrate that even smaller generative models fine-tuned with our proposed TT-CSW framework surpass ChatGPT and GPT-4 by 14.2% and 5.0% respectively.
zh
[NLP-29] Autonomous Structural Memory Manipulation for Large Language Models Using Hierarchical Embedding Augmentation
【速读】: 该论文旨在解决传统静态内存架构在处理复杂语言输入和多任务场景时面临的局限性,特别是在计算效率、上下文对齐和任务泛化能力方面的挑战。解决方案的关键在于引入了分层嵌入增强(hierarchical embedding augmentation)和自主结构内存操作(autonomous structural memory manipulation)。分层嵌入通过多级语义结构重新定义令牌表示,增强了模型对复杂语言输入的适应性,并促进了任务泛化。自主结构内存操作则通过动态内存重新分配机制,优先处理关键上下文特征,同时抑制不相关信息,从而实现了对长输入序列的高效处理。这些策略不仅提升了计算效率,还增强了模型在不同任务中的准确性和可解释性,特别是在需要复杂上下文理解或领域特定适应性的任务中表现尤为突出。
链接: https://arxiv.org/abs/2501.14119
作者: Derek Yotheringhay,Alistair Kirkland,Humphrey Kirkbride,Josiah Whitesteeple
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Transformative innovations in model architectures have introduced hierarchical embedding augmentation as a means to redefine the representation of tokens through multi-level semantic structures, offering enhanced adaptability to complex linguistic inputs. Autonomous structural memory manipulation further advances this paradigm through dynamic memory reallocation mechanisms that prioritize critical contextual features while suppressing less relevant information, enabling scalable and efficient performance across diverse tasks. Experimental results reveal substantial improvements in computational efficiency, with marked reductions in processing overhead for longer input sequences, achieved through memory reorganization strategies that adapt to evolving contextual requirements. Hierarchical embeddings not only improved contextual alignment but also facilitated task generalization by capturing relationships at varying semantic granularities, ensuring coherence across layers without introducing significant computational redundancies. Comparative analysis against baseline models demonstrated unique advantages in accuracy, efficiency, and interpretability, particularly in tasks requiring complex contextual understanding or domain-specific adaptability. The ability to dynamically adjust token representations and memory configurations contributed to the model’s robustness under varied and unpredictable input conditions. Applications benefiting from these advancements include multi-domain generalization, interactive systems, and scenarios involving real-time decision-making, where traditional static memory architectures often face limitations. The proposed methodology combines advanced embedding and memory management strategies into a cohesive framework that addresses scalability challenges while preserving task-specific relevance.
zh
[NLP-30] LeCoPCR: Legal Concept-guided Prior Case Retrieval for European Court of Human Rights cases NAACL2025
【速读】: 该论文试图解决在法律实践中,如何更有效地检索与查询案件相关的先例案件(Prior Case Retrieval, PCR)的问题。现有方法在确定查询案件的相关性时,往往忽略了潜在的语义意图。为此,论文提出了一种名为LeCoPCR的新方法,该方法通过从查询案件的事实中生成法律概念(legal concepts)来明确表达语义意图,并利用这些概念增强模型对相关性的理解。解决方案的关键在于:1)通过弱监督方法从未标注的法律文本中提取关键法律概念;2)使用行列式点过程(Determinantal Point Process, DPP)来平衡提取概念的质量和多样性,从而提升检索效果。实验结果表明,该方法在ECtHR-PCR数据集上显著提高了检索性能。
链接: https://arxiv.org/abs/2501.14114
作者: T.Y.S.S. Santosh,Isaac Misael Olguín Nolasco,Matthias Grabmair
机构: School of Computation, Information, and Technology, Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025
点击查看摘要
Abstract:Prior case retrieval (PCR) is crucial for legal practitioners to find relevant precedent cases given the facts of a query case. Existing approaches often overlook the underlying semantic intent in determining relevance with respect to the query case. In this work, we propose LeCoPCR, a novel approach that explicitly generate intents in the form of legal concepts from a given query case facts and then augments the query with these concepts to enhance models understanding of semantic intent that dictates relavance. To overcome the unavailability of annotated legal concepts, we employ a weak supervision approach to extract key legal concepts from the reasoning section using Determinantal Point Process (DPP) to balance quality and diversity. Experimental results on the ECtHR-PCR dataset demonstrate the effectiveness of leveraging legal concepts and DPP-based key concept extraction.
zh
[NLP-31] RELexED: Retrieval-Enhanced Legal Summarization with Exemplar Diversity NAACL2025
【速读】: 该论文试图解决法律文本摘要生成任务中存在的主题偏离和写作风格不一致的问题。现有的方法通常仅依赖于源文档,导致生成的摘要内容偏离主题且风格不一致。为此,论文提出了RELexED框架,该框架通过检索增强的方式,利用示例摘要(exemplar summaries)和源文档共同指导模型生成摘要。RELexED的关键在于其两阶段示例选择策略,该策略采用行列式点过程(determinantal point process)来平衡示例与查询的相似性和示例之间的多样性,并通过影响函数(influence functions)计算得分。实验结果表明,RELexED在两种法律摘要数据集上显著优于不使用示例或仅依赖相似性选择示例的模型。
链接: https://arxiv.org/abs/2501.14113
作者: T.Y.S.S. Santosh,Chen Jia,Patrick Goroncy,Matthias Grabmair
机构: Technical University of Munich, Germany(德国慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025
点击查看摘要
Abstract:This paper addresses the task of legal summarization, which involves distilling complex legal documents into concise, coherent summaries. Current approaches often struggle with content theme deviation and inconsistent writing styles due to their reliance solely on source documents. We propose RELexED, a retrieval-augmented framework that utilizes exemplar summaries along with the source document to guide the model. RELexED employs a two-stage exemplar selection strategy, leveraging a determinantal point process to balance the trade-off between similarity of exemplars to the query and diversity among exemplars, with scores computed via influence functions. Experimental results on two legal summarization datasets demonstrate that RELexED significantly outperforms models that do not utilize exemplars and those that rely solely on similarity-based exemplar selection.
zh
[NLP-32] CoPERLex: Content Planning with Event-based Representations for Legal Case Summarization NAACL2025
【速读】: 该论文旨在解决法律专业人士在处理冗长判决书时面临的效率问题,特别是如何通过有效的摘要生成技术快速理解案件内容。为此,论文提出了一个名为CoPERLex的框架,其核心解决方案包括三个关键步骤:首先,通过内容选择(content selection)从判决书中识别出关键信息;其次,利用这些信息生成以事件为中心(event-centric)的中间计划,这些计划以主谓宾(Subject-Verb-Object)三元组的形式建模;最后,基于内容和结构化计划生成连贯的摘要。实验结果表明,结合内容选择和计划生成的方法在法律判决摘要任务中具有显著优势,尤其是事件中心的计划相较于传统的以实体为中心(entity-centric)的方法更能有效反映法律案件的叙事结构。
链接: https://arxiv.org/abs/2501.14112
作者: T.Y.S.S. Santosh,Youssef Farag,Matthias Grabmair
机构: Technical University of Munich, Germany(慕尼黑工业大学, 德国)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025
点击查看摘要
Abstract:Legal professionals often struggle with lengthy judgments and require efficient summarization for quick comprehension. To address this challenge, we investigate the need for structured planning in legal case summarization, particularly through event-centric representations that reflect the narrative nature of legal case documents. We propose our framework, CoPERLex, which operates in three stages: first, it performs content selection to identify crucial information from the judgment; second, the selected content is utilized to generate intermediate plans through event-centric representations modeled as Subject-Verb-Object tuples; and finally, it generates coherent summaries based on both the content and the structured plan. Our experiments on four legal summarization datasets demonstrate the effectiveness of integrating content selection and planning components, highlighting the advantages of event-centric plans over traditional entity-centric approaches in the context of legal judgements.
zh
[NLP-33] MedSlice: Fine-Tuned Large Language Models for Secure Clinical Note Sectioning
【速读】: 该论文试图解决从临床笔记中提取特定部分(如现病史、间隔病史、评估与计划)的自动化问题,这一问题由于笔记格式的多样性和手动分段的劳动密集型特性而具有挑战性。解决方案的关键在于使用开源的大型语言模型(LLMs)进行自动化分段,并通过微调这些模型来提高其性能。研究团队使用了一个包含487份进展笔记的精选数据集,对三个开源LLMs进行了微调,并将其与专有模型(如GPT-4o和GPT-4o mini)进行了比较。结果表明,微调后的Llama 3.1 8B模型在F1得分上优于GPT-4o(F1=0.92),并且在外部验证集上表现依然优异(F1=0.85)。这一解决方案在成本、性能和可访问性方面具有显著优势。
链接: https://arxiv.org/abs/2501.14105
作者: Joshua Davis,Thomas Sounack,Kate Sciacca,Jessie M Brain,Brigitte N Durieux,Nicole D Agaronnik,Charlotta Lindvall
机构: Dana-Farber Cancer Institute(丹娜-法伯癌症研究所); Harvard Medical School(哈佛医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Our code is publicly available on github ( this https URL )
点击查看摘要
Abstract:Extracting sections from clinical notes is crucial for downstream analysis but is challenging due to variability in formatting and labor-intensive nature of manual sectioning. While proprietary large language models (LLMs) have shown promise, privacy concerns limit their accessibility. This study develops a pipeline for automated note sectioning using open-source LLMs, focusing on three sections: History of Present Illness, Interval History, and Assessment and Plan. We fine-tuned three open-source LLMs to extract sections using a curated dataset of 487 progress notes, comparing results relative to proprietary models (GPT-4o, GPT-4o mini). Internal and external validity were assessed via precision, recall and F1 score. Fine-tuned Llama 3.1 8B outperformed GPT-4o (F1=0.92). On the external validity test set, performance remained high (F1= 0.85). Fine-tuned open-source LLMs can surpass proprietary models in clinical note sectioning, offering advantages in cost, performance, and accessibility.
zh
[NLP-34] Communicating Activations Between Language Model Agents
【速读】: 该论文试图解决多语言模型(LM)之间通信时使用自然语言所带来的高推理成本和信息损失问题。自然语言通信不仅随着代理数量和消息数量的增加而迅速增加计算成本,而且在解码过程中会丢失大量可以从内部激活中获取的丰富信息。论文提出了一种简单的技术,即通过激活(activations)进行通信:具体来说,在LM B的中间层暂停计算,将其当前激活与另一个LM A的中间激活通过某种函数f结合,然后将f的输出传递到B的下一层并继续前向传播直至解码完成。这种方法在无需额外参数和数据的情况下扩展了LM在新任务上的能力,并显著节省了计算资源。实验结果表明,与自然语言通信相比,该方法在多个数据集上实现了高达27.0%的性能提升,同时仅需1/4的计算量,证明了激活作为LM间通信替代“语言”的优越性和鲁棒性。
链接: https://arxiv.org/abs/2501.14082
作者: Vignav Ramesh,Kenneth Li
机构: Kempner Institute for AI (肯普纳人工智能研究所); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Communication between multiple language model (LM) agents has been shown to scale up the reasoning ability of LMs. While natural language has been the dominant medium for inter-LM communication, it is not obvious this should be the standard: not only does natural language communication incur high inference costs that scale quickly with the number of both agents and messages, but also the decoding process abstracts away too much rich information that could be otherwise accessed from the internal activations. In this work, we propose a simple technique whereby LMs communicate via activations; concretely, we pause an LM \textitB 's computation at an intermediate layer, combine its current activation with another LM \textitA 's intermediate activation via some function \textitf , then pass \textitf 's output into the next layer of \textitB and continue the forward pass till decoding is complete. This approach scales up LMs on new tasks with zero additional parameters and data, and saves a substantial amount of compute over natural language communication. We test our method with various functional forms \textitf on two experimental setups–multi-player coordination games and reasoning benchmarks–and find that it achieves up to 27.0% improvement over natural language communication across datasets with 1/4 the compute, illustrating the superiority and robustness of activations as an alternative “language” for communication between LMs.
zh
[NLP-35] Enhancing Biomedical Relation Extraction with Directionality
【速读】: 该论文试图解决生物医学文献中关系提取(Relation Extraction)的两个关键问题:一是现有数据集(如BioRED)缺乏实体角色(entity roles)的方向性标注(directionality annotations),即无法区分关系中的主体(subject)和客体(object),这对于研究复杂的生物网络至关重要;二是如何通过多任务学习(multi-task learning)和软提示学习(soft-prompt learning)来联合识别关系、新发现以及实体角色。解决方案的关键在于对BioRED数据集进行了扩展,新增了10,864个方向性标注,并提出了一个新颖的多任务语言模型,该模型在两项基准测试任务中表现优于现有的先进大语言模型(如GPT-4和Llama-3)。
链接: https://arxiv.org/abs/2501.14079
作者: Po-Ting Lai,Chih-Hsuan Wei,Shubo Tian,Robert Leaman,Zhiyong Lu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Biological relation networks contain rich information for understanding the biological mechanisms behind the relationship of entities such as genes, proteins, diseases, and chemicals. The vast growth of biomedical literature poses significant challenges updating the network knowledge. The recent Biomedical Relation Extraction Dataset (BioRED) provides valuable manual annotations, facilitating the develop-ment of machine-learning and pre-trained language model approaches for automatically identifying novel document-level (inter-sentence context) relationships. Nonetheless, its annotations lack directionality (subject/object) for the entity roles, essential for studying complex biological networks. Herein we annotate the entity roles of the relationships in the BioRED corpus and subsequently propose a novel multi-task language model with soft-prompt learning to jointly identify the relationship, novel findings, and entity roles. Our results in-clude an enriched BioRED corpus with 10,864 directionality annotations. Moreover, our proposed method outperforms existing large language models such as the state-of-the-art GPT-4 and Llama-3 on two benchmarking tasks. Our source code and dataset are available at this https URL.
zh
[NLP-36] LLM s are Vulnerable to Malicious Prompts Disguised as Scientific Language
【速读】: 该论文探讨了大语言模型(LLMs)在现实世界应用中的潜在危害,特别是其在面对恶意请求时表现出的脆弱性。研究发现,许多先进的专有和开源LLMs(如GPT4o、GPT-4、LLama3-405B-Instruct等)在面对隐藏在科学语言背后的恶意请求时,其偏见和毒性显著增加。这些模型甚至可以被操纵生成虚假的科学论证,声称偏见是有益的,从而被恶意行为者用于系统性地“越狱”最强模型。论文的关键解决方案在于揭示这些模型在面对学术语言中的恶意请求时的脆弱性,并分析了影响这些脆弱性的多种因素,如提及作者姓名和发表场所会增加某些模型的说服力,且偏见分数会随着对话的进行而增加。研究呼吁在LLMs的训练过程中更谨慎地使用科学数据,以提高其安全性。
链接: https://arxiv.org/abs/2501.14073
作者: Yubin Ge,Neeraja Kirtane,Hao Peng,Dilek Hakkani-Tür
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: 15 pages
点击查看摘要
Abstract:As large language models (LLMs) have been deployed in various real-world settings, concerns about the harm they may propagate have grown. Various jailbreaking techniques have been developed to expose the vulnerabilities of these models and improve their safety. This work reveals that many state-of-the-art proprietary and open-source LLMs are vulnerable to malicious requests hidden behind scientific language. Specifically, our experiments with GPT4o, GPT4o-mini, GPT-4, LLama3-405B-Instruct, Llama3-70B-Instruct, Cohere, Gemini models on the StereoSet data demonstrate that, the models’ biases and toxicity substantially increase when prompted with requests that deliberately misinterpret social science and psychological studies as evidence supporting the benefits of stereotypical biases. Alarmingly, these models can also be manipulated to generate fabricated scientific arguments claiming that biases are beneficial, which can be used by ill-intended actors to systematically jailbreak even the strongest models like GPT. Our analysis studies various factors that contribute to the models’ vulnerabilities to malicious requests in academic language. Mentioning author names and venues enhances the persuasiveness of some models, and the bias scores can increase as dialogues progress. Our findings call for a more careful investigation on the use of scientific data in the training of LLMs.
zh
[NLP-37] Leverag ing Large Language Models to Analyze Emotional and Contextual Drivers of Teen Substance Use in Online Discussions
【速读】: 该论文试图解决的问题是青少年在社交媒体上的自我表达中,与物质使用(substance use)相关的情感模式和背景因素。研究通过应用大语言模型(Large Language Models, LLMs)分析青少年的社交媒体帖子,揭示了与物质使用相关的情感(如悲伤、内疚、恐惧、喜悦)和背景因素(如家庭、同伴、学校)。研究的关键解决方案包括使用热图(heatmap)和机器学习分析,识别出与物质使用相关的帖子的关键预测因素。研究发现,负面情感如悲伤和内疚在物质使用背景下更为频繁,其中内疚起到保护作用,而羞耻感和同伴影响则增加了物质使用的风险。喜悦在非物质使用讨论中更为常见。此外,同伴影响与悲伤、恐惧和厌恶情感密切相关,而家庭和学校环境则与非物质使用相关。研究强调了解决情感脆弱性和背景影响的重要性,并建议通过家庭、学校和社区的协作干预来减少风险因素,促进青少年的健康发展。
链接: https://arxiv.org/abs/2501.14037
作者: Jianfeng Zhu,Ruoming Jin,Hailong Jiang,Yulan Wang,Xinyu Zhang,Karin G. Coifman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 28 pages, 9 figures with an appendix
点击查看摘要
Abstract:Adolescence is a critical stage often linked to risky behaviors, including substance use, with significant developmental and public health implications. Social media provides a lens into adolescent self-expression, but interpreting emotional and contextual signals remains complex. This study applies Large Language Models (LLMs) to analyze adolescents’ social media posts, uncovering emotional patterns (e.g., sadness, guilt, fear, joy) and contextual factors (e.g., family, peers, school) related to substance use. Heatmap and machine learning analyses identified key predictors of substance use-related posts. Negative emotions like sadness and guilt were significantly more frequent in substance use contexts, with guilt acting as a protective factor, while shame and peer influence heightened substance use risk. Joy was more common in non-substance use discussions. Peer influence correlated strongly with sadness, fear, and disgust, while family and school environments aligned with non-substance use. Findings underscore the importance of addressing emotional vulnerabilities and contextual influences, suggesting that collaborative interventions involving families, schools, and communities can reduce risk factors and foster healthier adolescent development.
zh
[NLP-38] QuanTaxo: A Quantum Approach to Self-Supervised Taxonomy Expansion
【速读】: 该论文试图解决现有分类法(taxonomy)扩展方法在捕捉层次多义性(hierarchical polysemy)方面的不足。层次多义性指的是实体在分类法中的位置和上下文环境会影响其含义,而传统的词嵌入(word embeddings)方法难以有效捕捉这种复杂性。为了解决这一问题,论文提出了QuanTaxo,一种基于量子启发的分类法扩展框架。QuanTaxo通过在量子空间中编码实体表示,利用希尔伯特空间(Hilbert space)的原理捕捉实体之间的干涉效应,从而生成更丰富和细致的表示。实验结果表明,QuanTaxo在准确性、平均倒数排名(Mean Reciprocal Rank)和Wu-Palmer指标上显著优于传统的嵌入模型,证明了其在分类法扩展任务中的优越性。
链接: https://arxiv.org/abs/2501.14011
作者: Sahil Mishra,Avi Patni,Niladri Chatterjee,Tanmoy Chakraborty
机构: Dept. of EE, IIT Delhi, India (印度理工学院德里分校电气工程系); Dept. of Maths, IIT Delhi, India (印度理工学院德里分校数学系)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:A taxonomy is a hierarchical graph containing knowledge to provide valuable insights for various web applications. Online retail organizations like Microsoft and Amazon utilize taxonomies to improve product recommendations and optimize advertisement by enhancing query interpretation. However, the manual construction of taxonomies requires significant human effort. As web content continues to expand at an unprecedented pace, existing taxonomies risk becoming outdated, struggling to incorporate new and emerging information effectively. As a consequence, there is a growing need for dynamic taxonomy expansion to keep them relevant and up-to-date. Existing taxonomy expansion methods often rely on classical word embeddings to represent entities. However, these embeddings fall short in capturing hierarchical polysemy, where an entity’s meaning can vary based on its position in the hierarchy and its surrounding context. To address this challenge, we introduce QuanTaxo, an innovative quantum-inspired framework for taxonomy expansion. QuanTaxo encodes entity representations in quantum space, effectively modeling hierarchical polysemy by leveraging the principles of Hilbert space to capture interference effects between entities, yielding richer and more nuanced representations. Comprehensive experiments on four real-world benchmark datasets show that QuanTaxo significantly outperforms classical embedding models, achieving substantial improvements of 18.45% in accuracy, 20.5% in Mean Reciprocal Rank, and 17.87% in Wu Palmer metrics across eight classical embedding-based baselines. We further highlight the superiority of QuanTaxo through extensive ablation and case studies.
zh
[NLP-39] Advancing Math Reasoning in Language Models: The Impact of Problem-Solving Data Data Synthesis Methods and Training Stages ICLR2025
【速读】: 该论文试图解决大语言模型(LLMs)在数学推理能力提升方面面临的挑战,特别是在持续预训练(CPT)阶段效果不如监督微调(SFT)阶段显著的问题。论文的核心解决方案在于探索在预训练阶段使用问题解决数据(problem-solving data)而非通用数学语料库(general mathematical corpora)的策略。通过研究三个主要问题,论文发现:1)问题解决数据在CPT阶段能更有效地提升模型的数学推理能力;2)合成数据(synthetic data)的有效性取决于其生成方法,其中“导师放大合成法”(tutorship amplification synthesis method)表现最佳;3)尽管SFT有助于提升指令跟随能力,但在处理复杂多步问题解决数据时表现不如CPT。这些发现为优化LLMs的数学推理能力提供了重要指导,并最终开发了一个名为JiuZhang-8B的强大数学基础模型。
链接: https://arxiv.org/abs/2501.14002
作者: Zui Chen,Tianqiao Liu,Mi Tian,Qing Tong,Weiqi Luo,Zitao Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2025
点击查看摘要
Abstract:Advancements in LLMs have significantly expanded their capabilities across various domains. However, mathematical reasoning remains a challenging area, prompting the development of math-specific LLMs. These models typically follow a two-stage training paradigm: pre-training with math-related corpora and post-training with problem datasets for SFT. Despite these efforts, the improvements in mathematical reasoning achieved through continued pre-training (CPT) are often less significant compared to those obtained via SFT. This study addresses this discrepancy by exploring alternative strategies during the pre-training phase, focusing on the use of problem-solving data over general mathematical corpora. We investigate three primary research questions: (1) Can problem-solving data enhance the model’s mathematical reasoning capabilities more effectively than general mathematical corpora during CPT? (2) Are synthetic data from the same source equally effective, and which synthesis methods are most efficient? (3) How do the capabilities developed from the same problem-solving data differ between the CPT and SFT stages, and what factors contribute to these differences? Our findings indicate that problem-solving data significantly enhances the model’s mathematical capabilities compared to general mathematical corpora. We also identify effective data synthesis methods, demonstrating that the tutorship amplification synthesis method achieves the best performance. Furthermore, while SFT facilitates instruction-following abilities, it underperforms compared to CPT with the same data, which can be partially attributed to its poor learning capacity for hard multi-step problem-solving data. These insights provide valuable guidance for optimizing the mathematical reasoning capabilities of LLMs, culminating in our development of a powerful mathematical base model called JiuZhang-8B.
zh
[NLP-40] Framework for Progressive Knowledge Fusion in Large Language Models Through Structured Conceptual Redundancy Analysis
【速读】: 该论文试图解决大规模模型中潜在知识组织(latent knowledge organization)的挑战,特别是处理重叠表示(overlapping representations)和优化上下文准确性(contextual accuracy)的问题。这些挑战导致计算需求增加和任务特定结果效率低下。论文提出的解决方案关键在于通过高级聚类技术(advanced clustering techniques)和动态阈值(dynamic thresholding)来重构这些冗余,确保保留关键的语义关系(semantic relationships)同时去除不必要的重叠。这一方法显著提高了内存效率、推理速度,并增强了潜在知识簇的对齐和可解释性,同时降低了错误率和提升了对抗鲁棒性(adversarial robustness)。
链接: https://arxiv.org/abs/2501.13999
作者: Joseph Sakau,Evander Kozlowski,Roderick Thistledown,Basil Steinberger
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The organization of latent knowledge within large-scale models poses unique challenges when addressing overlapping representations and optimizing contextual accuracy. Conceptual redundancies embedded across layers often result in inefficiencies that affect both computational demands and task-specific outcomes. A framework was proposed to restructure these redundancies through advanced clustering techniques and dynamic thresholding, ensuring that critical semantic relationships are preserved while removing unnecessary overlaps. Evaluations revealed improved memory efficiency and faster inference times, alongside better alignment in latent knowledge clusters that enhanced interpretability. Improvements in error rates and adversarial robustness suggest that restructuring redundancies has broader implications for increasing model reliability across diverse applications. Comparative analyses highlighted reductions in resource consumption and notable gains in performance, particularly in translation and summarization tasks. Energy metrics demonstrated significant savings during training phases, further validating the practicality of the approach for real-world deployments. Representational fidelity was also enhanced, with latent space evaluations indicating better cluster alignment and higher semantic consistency. The methodology bridges a key gap in model optimization through directly addressing redundancies at the structural level. Its application opens avenues for scalable, efficient, and contextually aware systems that can adapt to complex, domain-specific tasks without compromising on performance.
zh
[NLP-41] CAPRAG : A Large Language Model Solution for Customer Service and Automatic Reporting using Vector and Graph Retrieval-Augmented Generation
【速读】: 该论文试图解决银行领域中新功能和服务引入时客户信息过载的问题,旨在通过基于大语言模型(LLMs)的金融聊天机器人提升用户体验。解决方案的关键在于提出了一种混合客户分析管道检索增强生成(CAPRAG)方法,该方法能够有效处理基于关系和上下文的查询,从而提升数字银行环境中的客户参与度。具体实现包括开发一个文本数据处理管道,并将其应用于两种主要框架:向量RAG(Vector RAG)和图RAG(Graph RAG)。通过这两种框架,处理后的数据被填充到向量和图数据库中,以实现高效检索。此外,论文还采用了Cypher查询组件来有效查询图数据库。用户提交的查询首先通过查询扩展模块进行扩展,然后从混合知识库(KB)中构建最终查询,最终由开源LLM生成响应。这一创新设计为国际银行的客户提供了在日益复杂的数字环境中更清晰和可访问的信息服务。
链接: https://arxiv.org/abs/2501.13993
作者: Hamza Landolsi,Kais Letaief,Nizar Taghouti,Ines Abdeljaoued-Tej
机构: INETUM Tunisia; University of Carthage, Engineering School of Statistics and Information Analysis (迦太基大学, 统计与信息分析工程学院); Laboratory of BioInformatics bioMathematics, and bioStatistics (LR24IPT09), Institut Pasteur de Tunis, University of Tunis El Manar (突尼斯巴斯德研究所, 突尼斯埃尔马纳尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 14 pages, 5 Figures, 3 Tables
点击查看摘要
Abstract:The introduction of new features and services in the banking sector often overwhelms customers, creating an opportunity for banks to enhance user experience through financial chatbots powered by large language models (LLMs). We initiated an AI agent designed to provide customers with relevant information about banking services and insights from annual reports. We proposed a hybrid Customer Analysis Pipeline Retrieval-Augmented Generation (CAPRAG) that effectively addresses both relationship-based and contextual queries, thereby improving customer engagement in the digital banking landscape. To implement this, we developed a processing pipeline to refine text data, which we utilized in two main frameworks: Vector RAG and Graph RAG. This dual approach enables us to populate both vector and graph databases with processed data for efficient retrieval. The Cypher query component is employed to effectively query the graph database. When a user submits a query, it is first expanded by a query expansion module before being routed to construct a final query from the hybrid Knowledge Base (KB). This final query is then sent to an open-source LLM for response generation. Overall, our innovative, designed to international banks, serves bank’s customers in an increasingly complex digital environment, enhancing clarity and accessibility of information.
zh
[NLP-42] Comprehensive Modeling and Question Answering of Cancer Clinical Practice Guidelines using LLM s
【速读】: 该论文旨在解决如何有效捕捉和利用临床实践指南(Clinical Practice Guidelines, CPGs)中的医学知识,特别是其中的上下文及其关系,以支持医疗专业人员的治疗决策过程。现有的研究虽然已经利用这些指南创建了临床决策支持系统的规则库,但在直接捕捉CPGs中完整的医学知识方面仍存在不足。论文提出了一种方法,通过自动化提取和节点关系分类,创建了国家综合癌症网络(National Comprehensive Cancer Network, NCCN)癌症CPGs的上下文丰富的数字图形表示。关键解决方案包括使用大语言模型(Large Language Models, LLMs)进行节点分类,实现了零样本学习(zero-shot learning)和少样本学习(few-shot learning)的准确率分别为80.86%和88.47%。此外,论文还引入了一种方法,通过利用LLMs从指南知识库中提取相关子图,并结合子图路径和语义信息生成自然语言答案,从而减少LLMs在医学领域问答中可能出现的错误答案和幻觉问题,确保事实准确性。
链接: https://arxiv.org/abs/2501.13984
作者: Bhumika Gupta,Pralaypati Ta,Keerthi Ram,Mohanasankar Sivaprakasam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The updated recommendations on diagnostic procedures and treatment pathways for a medical condition are documented as graphical flows in Clinical Practice Guidelines (CPGs). For effective use of the CPGs in helping medical professionals in the treatment decision process, it is necessary to fully capture the guideline knowledge, particularly the contexts and their relationships in the graph. While several existing works have utilized these guidelines to create rule bases for Clinical Decision Support Systems, limited work has been done toward directly capturing the full medical knowledge contained in CPGs. This work proposes an approach to create a contextually enriched, faithful digital representation of National Comprehensive Cancer Network (NCCN) Cancer CPGs in the form of graphs using automated extraction and node relationship classification. We also implement semantic enrichment of the model by using Large Language Models (LLMs) for node classification, achieving an accuracy of 80.86% and 88.47% with zero-shot learning and few-shot learning, respectively. Additionally, we introduce a methodology for answering natural language questions with constraints to guideline text by leveraging LLMs to extract the relevant subgraph from the guideline knowledge base. By generating natural language answers based on subgraph paths and semantic information, we mitigate the risk of incorrect answers and hallucination associated with LLMs, ensuring factual accuracy in medical domain Question Answering.
zh
[NLP-43] AdEval: Alignment-based Dynamic Evaluation to Mitigate Data Contamination in Large Language Models
【速读】: 该论文试图解决大语言模型(LLMs)在评估过程中由于数据污染(data contamination)导致的性能高估问题。数据污染指的是模型在预训练阶段可能接触到评估数据集中的内容,从而在评估时表现出不真实的性能提升。为了解决这一问题,作者提出了AdEval(Alignment-based Dynamic Evaluation),一种动态数据评估方法。AdEval的关键在于通过提取关键知识点和核心概念,动态生成与静态数据核心概念对齐的问题,并结合在线搜索提供详细的知识点解释,从而生成高质量且具有知识支持的评估样本。此外,AdEval通过控制问题的数量和复杂度,实现动态对齐和灵活调整,确保生成的问题与静态数据的复杂度相匹配,并支持不同复杂度的评估需求。基于布鲁姆分类法(Bloom’s taxonomy),AdEval在六个认知层次(记忆、理解、应用、分析、评估和创造)上对LLMs进行多维评估。实验结果表明,AdEval有效减少了数据污染对评估结果的影响,提升了评估过程的公平性和可靠性。
链接: https://arxiv.org/abs/2501.13983
作者: Yang Fan
机构: Jinan University(暨南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As Large Language Models (LLMs) are pretrained on massive-scale corpora, the issue of data contamination has become increasingly severe, leading to potential overestimation of model performance during evaluation. To address this, we propose AdEval (Alignment-based Dynamic Evaluation), a dynamic data evaluation method aimed at mitigating the impact of data contamination on evaluation reliability. AdEval extracts key knowledge points and main ideas to align dynamically generated questions with static data’s core concepts. It also leverages online search to provide detailed explanations of related knowledge points, thereby creating high-quality evaluation samples with robust knowledge support. Furthermore, AdEval incorporates mechanisms to control the number and complexity of questions, enabling dynamic alignment and flexible adjustment. This ensures that the generated questions align with the complexity of static data while supporting varied complexity levels. Based on Bloom’s taxonomy, AdEval conducts a multi-dimensional evaluation of LLMs across six cognitive levels: remembering, understanding, applying, analyzing, evaluating, and creating. Experimental results on multiple datasets demonstrate that AdEval effectively reduces the impact of data contamination on evaluation outcomes, enhancing both the fairness and reliability of the evaluation process.
zh
[NLP-44] Chain of Grounded Objectives: Bridging Process and Goal-oriented Prompting for Code Generation
【速读】: 该论文试图解决现有大语言模型(LLMs)在代码生成中因采用顺序推理策略而导致的灵活性不足问题。这些策略虽然模仿了人类的逐步思考过程,但并不总是与编程语言的结构化特性相匹配,从而限制了生成代码的质量和适应性。论文提出的解决方案是“基于目标的链式推理”(Chain of Grounded Objectives, CGO),该方法通过在输入提示中嵌入功能目标(functional objectives)来增强代码生成。CGO的关键在于利用适当结构化的目标作为输入,并避免显式的顺序推理过程,从而更好地适应编程任务的结构化特性。实验结果表明,CGO能够有效提升代码生成的质量,克服了现有方法的局限性。
链接: https://arxiv.org/abs/2501.13978
作者: Sangyeop Yeo,Seung-won Hwang,Yu-Seung Ma
机构: Electronics and Telecommunications Research Institute(韩国电子通信研究院); Seoul National University(首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:The use of Large Language Models (LLMs) for code generation has gained significant attention in recent years. Existing methods often aim to improve the quality of generated code by incorporating additional contextual information or guidance into input prompts. Many of these approaches adopt sequential reasoning strategies, mimicking human-like step-by-step thinking. However, such strategies may constrain flexibility, as they do not always align with the structured characteristics of programming languages. This paper introduces the Chain of Grounded Objectives (CGO), a method that embeds functional objectives into input prompts to enhance code generation. By leveraging appropriately structured objectives as input and avoiding explicit sequential procedures, CGO adapts effectively to the structured nature of programming tasks. Empirical evaluations demonstrate that CGO effectively enhances code generation, addressing limitations of existing approaches.
zh
[NLP-45] Re-ranking Using Large Language Models for Mitigating Exposure to Harmful Content on Social Media Platforms
【速读】: 该论文旨在解决社交媒体平台上推荐算法(Recommendation Algorithms)在最大化用户参与度时,可能导致用户无意中接触到有害内容的问题。现有的内容审核方法依赖于大量人工标注数据训练的分类器(Classifiers),但这些方法在扩展性和应对新型有害内容方面存在困难。为解决这些问题,论文提出了一种基于大语言模型(Large Language Models, LLMs)的重新排序方法,该方法在零样本(Zero-shot)和少样本(Few-shot)设置下动态评估并重新排序内容序列,从而有效减少有害内容的暴露,且无需大量标注数据。论文还引入了两个新的评估指标,用于衡量重新排序在减少有害内容暴露方面的效果。通过在三个数据集、三个模型和三种配置下的实验,论文证明了基于LLM的方法显著优于现有的专有审核方法,提供了一种可扩展且适应性强的有害内容缓解方案。
链接: https://arxiv.org/abs/2501.13977
作者: Rajvardhan Oak,Muhammad Haroon,Claire Jo,Magdalena Wojcieszak,Anshuman Chhabra
机构: University of California, Davis(加州大学戴维斯分校); University of South Florida(南佛罗里达大学); Microsoft Corporation(微软公司), USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: This paper is under peer review
点击查看摘要
Abstract:Social media platforms utilize Machine Learning (ML) and Artificial Intelligence (AI) powered recommendation algorithms to maximize user engagement, which can result in inadvertent exposure to harmful content. Current moderation efforts, reliant on classifiers trained with extensive human-annotated data, struggle with scalability and adapting to new forms of harm. To address these challenges, we propose a novel re-ranking approach using Large Language Models (LLMs) in zero-shot and few-shot settings. Our method dynamically assesses and re-ranks content sequences, effectively mitigating harmful content exposure without requiring extensive labeled data. Alongside traditional ranking metrics, we also introduce two new metrics to evaluate the effectiveness of re-ranking in reducing exposure to harmful content. Through experiments on three datasets, three models and across three configurations, we demonstrate that our LLM-based approach significantly outperforms existing proprietary moderation approaches, offering a scalable and adaptable solution for harm mitigation.
zh
[NLP-46] owards Safer Social Media Platforms: Scalable and Performant Few-Shot Harmful Content Moderation Using Large Language Models
【速读】: 该论文旨在解决社交媒体平台上有害内容(harmful content)的普遍存在及其对用户和社会带来的重大风险问题。当前的内容审核策略主要依赖于人工审核、监督分类器以及大量训练数据,但面临扩展性、主观性和有害内容动态变化(如暴力内容、危险挑战趋势等)的挑战。为解决这些问题,论文提出利用大语言模型(Large Language Models, LLMs)通过上下文学习(in-context learning)进行少样本动态内容审核(few-shot dynamic content moderation)。实验表明,该方法在识别有害内容方面优于现有的专有基线(如Perspective和OpenAI Moderation)以及先前的少样本学习方法。此外,论文还结合了视觉信息(视频缩略图)并评估了不同多模态技术对模型性能的提升。研究结果强调了基于LLM的方法在可扩展和动态有害内容审核中的显著优势。
链接: https://arxiv.org/abs/2501.13976
作者: Akash Bonagiri,Lucen Li,Rajvardhan Oak,Zeerak Babar,Magdalena Wojcieszak,Anshuman Chhabra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: This paper is in submission and under peer review
点击查看摘要
Abstract:The prevalence of harmful content on social media platforms poses significant risks to users and society, necessitating more effective and scalable content moderation strategies. Current approaches rely on human moderators, supervised classifiers, and large volumes of training data, and often struggle with scalability, subjectivity, and the dynamic nature of harmful content (e.g., violent content, dangerous challenge trends, etc.). To bridge these gaps, we utilize Large Language Models (LLMs) to undertake few-shot dynamic content moderation via in-context learning. Through extensive experiments on multiple LLMs, we demonstrate that our few-shot approaches can outperform existing proprietary baselines (Perspective and OpenAI Moderation) as well as prior state-of-the-art few-shot learning methods, in identifying harm. We also incorporate visual information (video thumbnails) and assess if different multimodal techniques improve model performance. Our results underscore the significant benefits of employing LLM based methods for scalable and dynamic harmful content moderation online.
zh
[NLP-47] Assisting Mathematical Formalization with A Learning-based Premise Retriever
【速读】: 该论文试图解决数学形式化过程中前提选择(premise selection)这一关键但具有挑战性的问题,特别是对于经验有限的用户而言。由于缺乏可用的形式化项目,现有的基于语言模型的方法往往面临数据稀缺的困境。论文提出了一种创新的方法,通过训练一个前提检索器来支持数学的形式化。其解决方案的关键在于使用BERT模型将证明状态(proof states)和前提(premises)嵌入到一个共享的潜在空间(shared latent space)中,并在对比学习(contrastive learning)框架下进行训练。此外,该方法还结合了领域特定的分词器(domain-specific tokenizer)和细粒度的相似度计算方法(fine-grained similarity computation method),并通过引入重排序模块(re-ranking module)进一步提升了性能。最终,论文还计划发布一个搜索引擎,使用户能够直接通过证明状态查询Mathlib定理,从而提高形式化过程的可访问性和效率。
链接: https://arxiv.org/abs/2501.13959
作者: Yicheng Tao,Haotian Liu,Shanwen Wang,Hongteng Xu
机构: Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院); School of Mathematics, Renmin University of China(中国人民大学数学学院); Beijing Key Laboratory of Big Data Management and Analysis Methods(北京市大数据管理与分析方法重点实验室); Innovation Laboratory of Mingli College, Renmin University of China(中国人民大学明理学院创新实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Premise selection is a crucial yet challenging step in mathematical formalization, especially for users with limited experience. Due to the lack of available formalization projects, existing approaches that leverage language models often suffer from data scarcity. In this work, we introduce an innovative method for training a premise retriever to support the formalization of mathematics. Our approach employs a BERT model to embed proof states and premises into a shared latent space. The retrieval model is trained within a contrastive learning framework and incorporates a domain-specific tokenizer along with a fine-grained similarity computation method. Experimental results show that our model is highly competitive compared to existing baselines, achieving strong performance while requiring fewer computational resources. Performance is further enhanced through the integration of a re-ranking module. To streamline the formalization process, we will release a search engine that enables users to query Mathlib theorems directly using proof states, significantly improving accessibility and efficiency. Codes are available at this https URL.
zh
[NLP-48] A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models
【速读】: 该论文旨在解决大型语言模型(LLMs)在专业领域应用中的挑战,特别是传统检索增强生成(RAG, Retrieval-Augmented Generation)系统在复杂查询理解、分布式知识整合和系统效率方面的局限性。论文提出的解决方案是图结构检索增强生成(GraphRAG),其关键创新包括:(1)图结构知识表示,显式捕捉实体关系和领域层次结构;(2)基于图的高效检索技术,支持多跳推理能力,确保上下文保留的知识检索;(3)结构感知的知识整合算法,利用检索到的知识实现准确且逻辑一致的生成。通过这些创新,GraphRAG能够更好地适应专业领域的复杂需求,提升LLMs在特定领域的应用效果。
链接: https://arxiv.org/abs/2501.13958
作者: Qinggang Zhang,Shengyuan Chen,Yuanchen Bei,Zheng Yuan,Huachi Zhou,Zijin Hong,Junnan Dong,Hao Chen,Yi Chang,Xiao Huang
机构: The Hong Kong Polytechnic University, Hong Kong SAR, China(香港理工大学); Jilin University, Changchun, China(吉林大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks, yet their application to specialized domains remains challenging due to the need for deep expertise. Retrieval-augmented generation (RAG) has emerged as a promising solution to customize LLMs for professional fields by seamlessly integrating external knowledge bases, enabling real-time access to domain-specific expertise during inference. Despite its potential, traditional RAG systems, based on flat text retrieval, face three critical challenges: (i) complex query understanding in professional contexts, (ii) difficulties in knowledge integration across distributed sources, and (iii) system efficiency bottlenecks at scale. This survey presents a systematic analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new paradigm that revolutionizes domain-specific LLM applications. GraphRAG addresses traditional RAG limitations through three key innovations: (i) graph-structured knowledge representation that explicitly captures entity relationships and domain hierarchies, (ii) efficient graph-based retrieval techniques that enable context-preserving knowledge retrieval with multihop reasoning ability, and (iii) structure-aware knowledge integration algorithms that leverage retrieved knowledge for accurate and logical coherent generation of LLMs. In this survey, we systematically analyze the technical foundations of GraphRAG and examine current implementations across various professional domains, identifying key technical challenges and promising research directions. All the related resources of GraphRAG, including research papers, open-source data, and projects, are collected for the community in \textcolorblue\urlthis https URL.
zh
[NLP-49] Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)
【速读】: 该论文旨在解决医学教育中客观结构化临床考试(OSCE)评估的自动化问题,特别是针对医学生沟通技能的评分。传统的人工评分方法耗时且可能受到主观偏见的影响。为此,研究探索了利用大语言模型(LLMs)来自动化OSCE评估的潜力,具体采用了主面试评分量表(MIRS)作为评估工具。研究的关键在于比较了四种先进的LLMs(GPT-4o、Claude 3.5、Llama 3.1和Gemini 1.5 Pro)在不同提示技术(零样本、思维链、少样本和多步提示)下的表现,并通过与专家共识评分的对比,评估了模型在MIRS所有28个项目上的准确性。结果表明,LLMs在自动化评估中表现出一定的可行性,尤其是在思维链、少样本和多步提示技术的支持下,模型的表现得到了显著提升。这一研究为未来临床沟通技能的自动化评估奠定了基础。
链接: https://arxiv.org/abs/2501.13957
作者: Jadon Geathers,Yann Hicke,Colleen Chan,Niroop Rajashekar,Justin Sewell,Susannah Cornes,Rene Kizilcec,Dennis Shung
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures (+3 figures in supplementary appendix)
点击查看摘要
Abstract:Introduction. Objective Structured Clinical Examinations (OSCEs) are widely used to assess medical students’ communication skills, but scoring interview-based assessments is time-consuming and potentially subject to human bias. This study explored the potential of large language models (LLMs) to automate OSCE evaluations using the Master Interview Rating Scale (MIRS). Methods. We compared the performance of four state-of-the-art LLMs (GPT-4o, Claude 3.5, Llama 3.1, and Gemini 1.5 Pro) in evaluating OSCE transcripts across all 28 items of the MIRS under the conditions of zero-shot, chain-of-thought (CoT), few-shot, and multi-step prompting. The models were benchmarked against a dataset of 10 OSCE cases with 174 expert consensus scores available. Model performance was measured using three accuracy metrics (exact, off-by-one, thresholded). Results. Averaging across all MIRS items and OSCE cases, LLMs performed with low exact accuracy (0.27 to 0.44), and moderate to high off-by-one accuracy (0.67 to 0.87) and thresholded accuracy (0.75 to 0.88). A zero temperature parameter ensured high intra-rater reliability ( \alpha = 0.98 for GPT-4o). CoT, few-shot, and multi-step techniques proved valuable when tailored to specific assessment items. The performance was consistent across MIRS items independent of encounter phases and communication domains. Conclusion. We demonstrated the feasibility of AI-assisted OSCE evaluation and provided benchmarking of multiple LLMs across multiple prompt techniques. Our work provides a baseline performance assessment for LLMs that lays a foundation for future research in automated assessment of clinical communication skills. Comments: 11 pages, 4 figures (+3 figures in supplementary appendix) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.13957 [cs.CL] (or arXiv:2501.13957v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.13957 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jadon Geathers [view email] [v1] Tue, 21 Jan 2025 04:05:45 UTC (6,121 KB)
zh
[NLP-50] Zep: A Temporal Knowledge Graph Architecture for Agent Memory
【速读】: 该论文旨在解决现有基于大语言模型(LLM)的检索增强生成(RAG)框架在动态知识整合方面的局限性。现有框架主要局限于静态文档检索,而企业应用需要从多种来源(如持续对话和业务数据)动态整合知识。为此,论文提出了Zep,一种新型内存层服务,其核心组件Graphiti是一个时间感知的知识图谱引擎(temporally-aware knowledge graph engine),能够动态合成非结构化对话数据和结构化业务数据,同时保持历史关系。Zep在Deep Memory Retrieval (DMR) 基准测试中表现优异(94.8% vs 93.4%),并在更具挑战性的LongMemEval基准测试中进一步验证了其能力,特别是在跨会话信息合成和长期上下文维护等企业关键任务中表现出色,显著提升了准确性并降低了响应延迟。
链接: https://arxiv.org/abs/2501.13956
作者: Preston Rasmussen,Pavlo Paliychuk,Travis Beauvais,Jack Ryan,Daniel Chalef
机构: Zep AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 12 pages, 3 tables
点击查看摘要
Abstract:We introduce Zep, a novel memory layer service for AI agents that outperforms the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR) benchmark. Additionally, Zep excels in more comprehensive and challenging evaluations than DMR that better reflect real-world enterprise use cases. While existing retrieval-augmented generation (RAG) frameworks for large language model (LLM)-based agents are limited to static document retrieval, enterprise applications demand dynamic knowledge integration from diverse sources including ongoing conversations and business data. Zep addresses this fundamental limitation through its core component Graphiti – a temporally-aware knowledge graph engine that dynamically synthesizes both unstructured conversational data and structured business data while maintaining historical relationships. In the DMR benchmark, which the MemGPT team established as their primary evaluation metric, Zep demonstrates superior performance (94.8% vs 93.4%). Beyond DMR, Zep’s capabilities are further validated through the more challenging LongMemEval benchmark, which better reflects enterprise use cases through complex temporal reasoning tasks. In this evaluation, Zep achieves substantial results with accuracy improvements of up to 18.5% while simultaneously reducing response latency by 90% compared to baseline implementations. These results are particularly pronounced in enterprise-critical tasks such as cross-session information synthesis and long-term context maintenance, demonstrating Zep’s effectiveness for deployment in real-world applications.
zh
[NLP-51] Guided Persona-based AI Surveys: Can we replicate personal mobility preferences at scale using LLM s?
【速读】: 该论文试图解决传统调查方法在成本高、效率低和可扩展性方面的局限性,特别是在研究德国个人移动偏好时。论文提出了一种利用大语言模型(LLMs)生成人工调查数据的新方法,通过引入“人物角色”(Personas)——即结合人口统计和行为属性的组合——来生成合成数据。该方法与其他五种合成调查方法进行了比较,这些方法在使用真实世界数据和方法复杂性方面有所不同。研究以德国2017年移动调查(MiD 2017)数据集为基准,评估了合成数据与真实世界模式的吻合度。结果表明,LLMs能够有效捕捉人口统计属性与偏好之间的复杂依赖关系,同时提供了探索假设情景的灵活性。这一方法为交通规划和社会科学研究提供了可扩展、成本效益高且保护隐私的数据生成途径。
链接: https://arxiv.org/abs/2501.13955
作者: Ioannis Tzachristas,Santhanakrishnan Narayanan,Constantinos Antoniou
机构: Chair of Transportation Systems Engineering, TUM School of Engineering and Design, TUM, Germany(德国慕尼黑工业大学工程与设计学院交通系统工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
点击查看摘要
Abstract:This study explores the potential of Large Language Models (LLMs) to generate artificial surveys, with a focus on personal mobility preferences in Germany. By leveraging LLMs for synthetic data creation, we aim to address the limitations of traditional survey methods, such as high costs, inefficiency and scalability challenges. A novel approach incorporating “Personas” - combinations of demographic and behavioural attributes - is introduced and compared to five other synthetic survey methods, which vary in their use of real-world data and methodological complexity. The MiD 2017 dataset, a comprehensive mobility survey in Germany, serves as a benchmark to assess the alignment of synthetic data with real-world patterns. The results demonstrate that LLMs can effectively capture complex dependencies between demographic attributes and preferences while offering flexibility to explore hypothetical scenarios. This approach presents valuable opportunities for transportation planning and social science research, enabling scalable, cost-efficient and privacy-preserving data generation.
zh
[NLP-52] Chat3GPP: An Open-Source Retrieval-Augmented Generation Framework for 3GPP Documents
【速读】: 该论文旨在解决全球电信领域中3GPP(第三代合作伙伴计划)文档的复杂性和频繁更新带来的挑战,这些文档内容庞大且复杂,对工程师和研究人员的理解和应用造成了显著困难。为了解决这一问题,论文提出了Chat3GPP,一个专门为3GPP规范设计的开源检索增强生成(RAG)框架。该框架的关键在于结合了分块策略、混合检索和高效索引方法,能够在不进行领域特定微调的情况下,高效检索相关信息并生成准确的用户查询响应。Chat3GPP的灵活性和可扩展性使其不仅适用于3GPP标准,还具备适应其他技术标准的潜力。通过在两套电信特定数据集上的评估,Chat3GPP展示了其在协议生成和代码自动化等下游任务中的优越性能。
链接: https://arxiv.org/abs/2501.13954
作者: Long Huang,Ming Zhao,Limin Xiao,Xiujun Zhang,Jungang Hu
机构: Dept of Electronic Engineering, Tsinghua University (清华大学电子工程系); Beijing National Research Center for Information Science and Technology (北京信息科学与技术国家研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
备注:
点击查看摘要
Abstract:The 3rd Generation Partnership Project (3GPP) documents is key standards in global telecommunications, while posing significant challenges for engineers and researchers in the telecommunications field due to the large volume and complexity of their contents as well as the frequent updates. Large language models (LLMs) have shown promise in natural language processing tasks, but their general-purpose nature limits their effectiveness in specific domains like telecommunications. To address this, we propose Chat3GPP, an open-source retrieval-augmented generation (RAG) framework tailored for 3GPP specifications. By combining chunking strategies, hybrid retrieval and efficient indexing methods, Chat3GPP can efficiently retrieve relevant information and generate accurate responses to user queries without requiring domain-specific fine-tuning, which is both flexible and scalable, offering significant potential for adapting to other technical standards beyond 3GPP. We evaluate Chat3GPP on two telecom-specific datasets and demonstrate its superior performance compared to existing methods, showcasing its potential for downstream tasks like protocol generation and code automation.
zh
[NLP-53] Redundancy Principles for MLLM s Benchmarks
【速读】: 该论文旨在解决多模态大语言模型(Multi-modality Large Language Models, MLLMs)评估中存在的冗余问题。随着MLLMs的快速迭代和领域需求的不断变化,每年产生的基准测试数量激增,导致显著的冗余现象。论文从三个关键角度分析了冗余问题:1)基准测试能力维度的冗余,2)测试问题数量的冗余,以及3)特定领域内跨基准测试的冗余。通过对数百个MLLMs在超过20个基准测试上的表现进行综合分析,论文试图量化现有MLLM评估中的冗余程度,并为未来MLLM基准测试的开发提供有价值的见解。解决方案的关键在于提出有针对性的原则,以构建有效的MLLM基准测试,并通过策略优化和解决冗余问题。
链接: https://arxiv.org/abs/2501.13953
作者: Zicheng Zhang,Xiangyu Zhao,Xinyu Fang,Chunyi Li,Xiaohong Liu,Xiongkuo Min,Haodong Duan,Kai Chen,Guangtao Zhai
机构: Shanghai AI Lab(上海人工智能实验室); Shanghai Jiao Tong University(上海交通大学); Zhejiang University(浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs’ performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively.
zh
[NLP-54] he Dual-use Dilemma in LLM s: Do Empowering Ethical Capacities Make a Degraded Utility?
【速读】: 该论文试图解决大型语言模型(LLMs)在拒绝有害请求以确保安全性与满足合法请求以保持实用性之间的平衡问题。这一伦理与实用性的权衡是当前LLMs面临的关键挑战之一。论文提出了一种基于直接偏好优化(Direct Preference Optimization, DPO)的对齐框架,通过在化学领域的应用中验证其有效性。该框架的核心在于采用了一个GPT辅助的三阶段数据生成方案,生成了包含31.6k个三元组实例的化学问答数据集LibraChemQA。通过在数据生成过程中引入创新的平衡种子,该框架系统性地考虑了合法与非法的请求。此外,框架还引入了重述机制以实现高效的数据增强,从而提升模型的化学理解能力。实验结果表明,该框架在综合考虑安全性和实用性的整体性能上取得了显著提升,其生成的模型LibraChem在基准测试中分别领先于Claude-3、GPT-4o和LLaMA-3模型13.44%、7.16%和7.10%。
链接: https://arxiv.org/abs/2501.13952
作者: Yiyi Zhang,Xingyu Chen,Kexin Chen,Yuyang Du,Xilin Dang,Pheng-Ann Heng
机构: Department of Computer Science and Engineering, The Chinese University of Hong Kong (香港中文大学计算机科学与工程系); School of Mechanical Engineering, Shanghai Jiao Tong University (上海交通大学机械工程学院); Department of Information Engineering, The Chinese University of Hong Kong (香港中文大学信息工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recent years have witnessed extensive efforts to enhance Large Language Models (LLMs) across various domains, alongside growing attention to their ethical implications. However, a critical challenge remains largely overlooked: LLMs must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility. This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance by addressing this ethical-utility trade-off, using chemical domain applications as a proof-of-concept. Our alignment pipeline starts with a GPT-assisted three-phase data generation scheme, in which we create LibraChemQA, a chemical question-answering dataset comprising 31.6k triplet instances. By incorporating an innovative balanced seed in the data generation process, our framework systematically considers both legitimate and illegitimate requests. The framework also introduces a rephrasing mechanism for efficient data augmentation that enhances the model’s chemical comprehension. We further develop a novel hybrid evaluation scheme with LLM judges for precise assessment of both safety and utility. Experimental results demonstrate our model’s substantial improvements in overall performance where both safety and utility are considered - our resulting model, LibraChem, outperforms leading LLMs including Claude-3, GPT-4o, and LLaMA-3 by margins of 13.44%, 7.16%, and 7.10% respectively on our released benchmark.
zh
[NLP-55] A Layered Multi-Expert Framework for Long-Context Mental Health Assessments
【速读】: 该论文试图解决大型语言模型(LLMs)在处理长篇幅心理健康评估时出现的幻觉(hallucinations)或推理不一致的问题。这些问题在涉及复杂、领域特定的上下文时尤为突出。论文提出的解决方案是堆叠多模型推理(Stacked Multi-Model Reasoning, SMMR),这是一种分层框架,通过结合多个LLMs和专门的小型模型作为平等的“专家”来协同工作。早期层次负责隔离和处理短小的离散子任务,而后期层次则通过更高级的长上下文模型对这些部分输出进行整合和优化。通过在DAIC-WOZ抑郁症筛查数据集和48个经过筛选的带有精神病诊断的案例研究上进行评估,SMMR在准确性、F1分数和PHQ-8误差减少方面均表现出优于单一模型的基线性能。该框架通过利用多样化的“第二意见”,有效减少了幻觉现象,捕捉了细微的临床差异,并提高了高风险心理健康评估的可靠性。研究结果强调了多专家框架在构建更可信赖的AI驱动筛查系统中的价值。
链接: https://arxiv.org/abs/2501.13951
作者: Jinwen Tang,Qiming Guo,Wenbo Sun,Yi Shang
机构: University of Missouri, Columbia, Missouri, USA (密苏里大学哥伦比亚分校); Texas A&M University-Corpus Christi, Corpus Christi, Texas, USA (德克萨斯A&M大学科珀斯克里斯蒂分校); Delft University of Technology, Delft, Netherlands (代尔夫特理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Long-form mental health assessments pose unique challenges for large language models (LLMs), which often exhibit hallucinations or inconsistent reasoning when handling extended, domain-specific contexts. We introduce Stacked Multi-Model Reasoning (SMMR), a layered framework that leverages multiple LLMs and specialized smaller models as coequal ‘experts’. Early layers isolate short, discrete subtasks, while later layers integrate and refine these partial outputs through more advanced long-context models. We evaluate SMMR on the DAIC-WOZ depression-screening dataset and 48 curated case studies with psychiatric diagnoses, demonstrating consistent improvements over single-model baselines in terms of accuracy, F1-score, and PHQ-8 error reduction. By harnessing diverse ‘second opinions’, SMMR mitigates hallucinations, captures subtle clinical nuances, and enhances reliability in high-stakes mental health assessments. Our findings underscore the value of multi-expert frameworks for more trustworthy AI-driven screening.
zh
[NLP-56] Can OpenAI o1 Reason Well in Ophthalmology? A 6990-Question Head-to-Head Evaluation Study
【速读】: 该论文旨在评估OpenAI o1与其他大型语言模型(LLMs)在处理眼科特定问题时的性能和推理能力。研究使用了来自MedMCQA的6,990个眼科问题,对OpenAI o1和五个其他LLMs进行了评估。研究发现,OpenAI o1在准确性(0.88)和宏F1得分上表现最佳,但在基于文本生成指标的推理能力上排名第三。在不同子主题中,o1在“晶状体”和“青光眼”方面表现最佳,但在“角膜和外部疾病”、“玻璃体和视网膜”以及“眼整形和眼眶疾病”方面略逊于GPT-4o。子组分析显示,o1在具有较长真实解释的查询上表现更好。研究结果表明,o1的推理增强可能并未完全扩展到眼科领域,强调了在眼科等专业领域进行特定领域优化的必要性。
链接: https://arxiv.org/abs/2501.13949
作者: Sahana Srinivasan,Xuguang Ai,Minjie Zou,Ke Zou,Hyunjae Kim,Thaddaeus Wai Soon Lo,Krithi Pushpanathan,Yiming Kong,Anran Li,Maxwell Singer,Kai Jin,Fares Antaki,David Ziyou Chen,Dianbo Liu,Ron A. Adelman,Qingyu Chen,Yih Chung Tham
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 44 pages
点击查看摘要
Abstract:Question: What is the performance and reasoning ability of OpenAI o1 compared to other large language models in addressing ophthalmology-specific questions? Findings: This study evaluated OpenAI o1 and five LLMs using 6,990 ophthalmological questions from MedMCQA. O1 achieved the highest accuracy (0.88) and macro-F1 score but ranked third in reasoning capabilities based on text-generation metrics. Across subtopics, o1 ranked first in Lens'' and
Glaucoma’’ but second to GPT-4o in Corneal and External Diseases'',
Vitreous and Retina’’ and ``Oculoplastic and Orbital Diseases’'. Subgroup analyses showed o1 performed better on queries with longer ground truth explanations. Meaning: O1’s reasoning enhancements may not fully extend to ophthalmology, underscoring the need for domain-specific refinements to optimize performance in specialized fields like ophthalmology. Comments: 44 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.13949 [cs.CL] (or arXiv:2501.13949v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.13949 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-57] Longitudinal Abuse and Sentiment Analysis of Hollywood Movie Dialogues using LLM s
【速读】: 该论文旨在探讨好莱坞电影中滥用和暴力内容的普遍性及其随时间的变化趋势。研究通过使用大型语言模型(LLMs)对1950年至2024年间的好莱坞奥斯卡提名电影和票房大片对白进行纵向的情感分析和滥用内容检测。关键解决方案在于利用经过微调的LLMs对超过一千部电影的四种类型(genres)字幕进行分析,以揭示过去七十年间电影对白中情感和滥用内容的趋势和变化。研究结果表明,电影对白中的情感倾向多样化,滥用内容的检测也显示出显著波动,尤其是近几十年来滥用内容逐渐增加,反映了社会规范和监管政策的变化。
链接: https://arxiv.org/abs/2501.13948
作者: Rohitash Chandra,Guoxiang Ren,Group-H
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Over the past decades, there has been an increasing concern about the prevalence of abusive and violent content in Hollywood movies. This study uses Large Language Models (LLMs) to explore the longitudinal abuse and sentiment analysis of Hollywood Oscar and blockbuster movie dialogues from 1950 to 2024. By employing fine-tuned LLMs, we analyze subtitles for over a thousand movies categorised into four genres to examine the trends and shifts in emotional and abusive content over the past seven decades. Our findings reveal significant temporal changes in movie dialogues, which reflect broader social and cultural influences. Overall, the emotional tendencies in the films are diverse, and the detection of abusive content also exhibits significant fluctuations. The results show a gradual rise in abusive content in recent decades, reflecting social norms and regulatory policy changes. Genres such as thrillers still present a higher frequency of abusive content that emphasises the ongoing narrative role of violence and conflict. At the same time, underlying positive emotions such as humour and optimism remain prevalent in most of the movies. Furthermore, the gradual increase of abusive content in movie dialogues has been significant over the last two decades, where Oscar-nominated movies overtook the top ten blockbusters.
zh
[NLP-58] A Comprehensive Survey on Integrating Large Language Models with Knowledge-Based Methods
【速读】: 该论文旨在探讨如何将大语言模型(Large Language Models, LLMs)与结构化知识库系统(structured knowledge-based systems)进行有效整合,以提升人工智能的能力。论文通过综合文献综述,分析了LLMs与知识库之间的协同作用,重点关注实际应用中的技术、操作和伦理挑战。解决方案的关键在于结合LLMs的生成式语言理解能力与结构化知识库的精确知识表示,从而实现数据情境化(data contextualization)的改进、模型准确性的提升以及知识资源的更好利用。论文通过识别关键问题、评估现有解决方案,并提出了可操作的建议,为AI技术的进一步发展和实际应用提供了重要参考。
链接: https://arxiv.org/abs/2501.13947
作者: Lilian Some,Wenli Yang,Michael Bain,Byeong Kang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid development of artificial intelligence has brought about substantial advancements in the field. One promising direction is the integration of Large Language Models (LLMs) with structured knowledge-based systems. This approach aims to enhance AI capabilities by combining the generative language understanding of LLMs with the precise knowledge representation of structured systems. This survey explores the synergy between LLMs and knowledge bases, focusing on real-world applications and addressing associated technical, operational, and ethical challenges. Through a comprehensive literature review, the study identifies critical issues and evaluates existing solutions. The paper highlights the benefits of integrating generative AI with knowledge bases, including improved data contextualization, enhanced model accuracy, and better utilization of knowledge resources. The findings provide a detailed overview of the current state of research, identify key gaps, and offer actionable recommendations. These insights contribute to advancing AI technologies and support their practical deployment across various sectors.
zh
[NLP-59] Hallucination Mitigation using Agent ic AI Natural Language-Based Frameworks
【速读】: 该论文试图解决生成式 AI 模型中存在的幻觉(hallucinations)问题,这一问题严重影响了 AI 系统的可信度和可靠性。论文的核心解决方案是通过协调多个专门的人工智能代理(Artificial Intelligent Agents),利用自然语言处理(NLP)技术实现代理之间的无缝交互,从而有效减少幻觉现象。具体而言,研究设计了一个多级代理处理流程:前端代理接收并处理数百个专门设计的诱导幻觉的提示(prompts),随后由第二级和第三级代理分别使用不同的大语言模型(large language models)和定制策略来检测未经验证的主张、加入明确的免责声明并澄清推测性内容。此外,研究还引入了一套新的关键绩效指标(KPIs),用于评估幻觉得分水平,并由第四级 AI 代理进行详细评估,确保幻觉相关行为的准确量化。整个系统的核心是基于 OVON(Open Voice Network)框架,通过结构化的 JSON 消息在代理之间传递上下文信息,确保每个代理能够评估幻觉的可能性并解释可疑内容的原因,从而在不丢失上下文的情况下优化文本输出。研究结果表明,通过多个专门代理的协同工作,结合基于 NLP 的代理框架,能够在幻觉缓解方面取得显著成效,进而增强 AI 系统的可信度。
链接: https://arxiv.org/abs/2501.13946
作者: Diego Gosmar,Deborah A. Dahl
机构: XCALLY; Linux Foundation AI & Data (Linux 基金会 AI 与数据); Conversational Technologies (对话技术)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 18 pages, 6 figures
点击查看摘要
Abstract:Hallucinations remain a significant challenge in current Generative AI models, undermining trust in AI systems and their reliability. This study investigates how orchestrating multiple specialized Artificial Intelligent Agents can help mitigate such hallucinations, with a focus on systems leveraging Natural Language Processing (NLP) to facilitate seamless agent interactions. To achieve this, we design a pipeline that introduces over three hundred prompts, purposefully crafted to induce hallucinations, into a front-end agent. The outputs are then systematically reviewed and refined by second- and third-level agents, each employing distinct large language models and tailored strategies to detect unverified claims, incorporate explicit disclaimers, and clarify speculative content. Additionally, we introduce a set of novel Key Performance Indicators (KPIs) specifically designed to evaluate hallucination score levels. A dedicated fourth-level AI agent is employed to evaluate these KPIs, providing detailed assessments and ensuring accurate quantification of shifts in hallucination-related behaviors. A core component of this investigation is the use of the OVON (Open Voice Network) framework, which relies on universal NLP-based interfaces to transfer contextual information among agents. Through structured JSON messages, each agent communicates its assessment of the hallucination likelihood and the reasons underlying questionable content, thereby enabling the subsequent stage to refine the text without losing context. The results demonstrate that employing multiple specialized agents capable of interoperating with each other through NLP-based agentic frameworks can yield promising outcomes in hallucination mitigation, ultimately bolstering trust within the AI community.
zh
[NLP-60] Self-Explanation in Social AI Agents
【速读】: 该论文试图解决社交AI助手在与社区成员互动时如何增强透明度和信任的问题。具体来说,社交AI助手需要能够解释其行为和决策过程,以便学习者能够理解其工作原理并建立信任。解决方案的关键在于使用自省(introspection)方法,通过构建一个功能模型(functional model)来捕捉AI助手的行为模式,并利用Chain of Thought(思维链)和ChatGPT生成自我解释。这种方法通过反思AI助手的自我模型,生成关于其功能和工作原理的解释,从而增强透明度和信任。论文还评估了这些自我解释的完整性和正确性,并报告了在实际课堂中的部署情况。
链接: https://arxiv.org/abs/2501.13945
作者: Rhea Basappa,Mustafa Tekman,Hong Lu,Benjamin Faught,Sandeep Kakar,Ashok K. Goel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Extended version of the paper published in International Conference on Intelligent Tutoring Systems, pages 351-360, 2024, Springer. Images corrected, and live deployment, ablation, and precision study results added
点击查看摘要
Abstract:Social AI agents interact with members of a community, thereby changing the behavior of the community. For example, in online learning, an AI social assistant may connect learners and thereby enhance social interaction. These social AI assistants too need to explain themselves in order to enhance transparency and trust with the learners. We present a method of self-explanation that uses introspection over a self-model of an AI social assistant. The self-model is captured as a functional model that specifies how the methods of the agent use knowledge to achieve its tasks. The process of generating self-explanations uses Chain of Thought to reflect on the self-model and ChatGPT to provide explanations about its functioning. We evaluate the self-explanation of the AI social assistant for completeness and correctness. We also report on its deployment in a live class.
zh
[NLP-61] Fanar: An Arabic-Centric Multimodal Generative AI Platform
【速读】: 该论文旨在解决阿拉伯语为中心的多模态生成式 AI(Generative AI)系统的开发问题,特别是针对语言、语音和图像生成任务。解决方案的关键在于 Fanar 平台,该平台的核心是 Fanar Star 和 Fanar Prime 两个高性能的阿拉伯语大语言模型(Large Language Models, LLMs)。Fanar Star 是一个 70 亿参数的模型,从头开始训练,使用了近 1 万亿个经过清理和去重的阿拉伯语、英语和代码 token。Fanar Prime 则是一个 90 亿参数的模型,基于 Gemma-2 9B 基础模型,并在相同的 1 万亿 token 数据集上进行了持续训练。这两个模型通过一个定制的编排器(orchestrator)透明地处理不同类型的提示。此外,Fanar 平台还提供了定制化的伊斯兰检索增强生成(Retrieval Augmented Generation, RAG)系统,用于处理宗教相关的提示,以及一个用于总结预训练数据截止日期后发生的最新事件的 Recency RAG 系统。平台还具备双语语音识别、语音和图像生成等认知能力,并提供了用于验证生成内容真实性的归因服务。Fanar 的设计、开发和实施由哈马德·本·哈利法大学的卡塔尔计算研究所(QCRI)完成,并得到了卡塔尔通信和信息技术部的支持,旨在推动主权 AI 技术的发展。
链接: https://arxiv.org/abs/2501.13944
作者: Fanar Team:Ummar Abbas,Mohammad Shahmeer Ahmad,Firoj Alam,Enes Altinisik,Ehsannedin Asgari,Yazan Boshmaf,Sabri Boughorbel,Sanjay Chawla,Shammur Chowdhury,Fahim Dalvi,Kareem Darwish,Nadir Durrani,Mohamed Elfeky,Ahmed Elmagarmid,Mohamed Eltabakh,Masoomali Fatehkia,Anastasios Fragkopoulos,Maram Hasanain,Majd Hawasly,Mus’ab Husaini,Soon-Gyo Jung,Ji Kim Lucas,Walid Magdy,Safa Messaoud,Abubakr Mohamed,Tasnim Mohiuddin,Basel Mousi,Hamdy Mubarak,Ahmad Musleh,Zan Naeem,Mourad Ouzzani,Dorde Popovic,Amin Sadeghi,Husrev Taha Sencar,Mohammed Shinoy,Omar Sinan,Yifan Zhang,Ahmed Ali,Yassine El Kheir,Xiaosong Ma,Chaoyi Ruan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:We present Fanar, a platform for Arabic-centric multimodal generative AI systems, that supports language, speech and image generation tasks. At the heart of Fanar are Fanar Star and Fanar Prime, two highly capable Arabic Large Language Models (LLMs) that are best in the class on well established benchmarks for similar sized models. Fanar Star is a 7B (billion) parameter model that was trained from scratch on nearly 1 trillion clean and deduplicated Arabic, English and Code tokens. Fanar Prime is a 9B parameter model continually trained on the Gemma-2 9B base model on the same 1 trillion token set. Both models are concurrently deployed and designed to address different types of prompts transparently routed through a custom-built orchestrator. The Fanar platform provides many other capabilities including a customized Islamic Retrieval Augmented Generation (RAG) system for handling religious prompts, a Recency RAG for summarizing information about current or recent events that have occurred after the pre-training data cut-off date. The platform provides additional cognitive capabilities including in-house bilingual speech recognition that supports multiple Arabic dialects, voice and image generation that is fine-tuned to better reflect regional characteristics. Finally, Fanar provides an attribution service that can be used to verify the authenticity of fact based generated content. The design, development, and implementation of Fanar was entirely undertaken at Hamad Bin Khalifa University’s Qatar Computing Research Institute (QCRI) and was sponsored by Qatar’s Ministry of Communications and Information Technology to enable sovereign AI technology development. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.0; D.2.0 Cite as: arXiv:2501.13944 [cs.CL] (or arXiv:2501.13944v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.13944 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-62] Language Representation Favored Zero-Shot Cross-Domain Cognitive Diagnosis
【速读】: 该论文试图解决现有认知诊断模型(Cognitive Diagnosis Models, CDMs)在跨领域应用中的局限性问题。现有模型通常依赖于特定领域的ID嵌入(ID embeddings),导致在不同目标领域(如不同学科或教育平台)中无法直接应用,必须为每个领域训练特定模型。为解决这一问题,论文提出了基于语言表示的零样本跨领域认知诊断模型(Language Representation favored zero-shot Cross-domain Cognitive Diagnosis, LRCD)。其关键解决方案包括:首先,通过分析不同领域中学生、习题和概念的行为模式,并使用文本描述来刻画这些实体的特征;其次,利用先进的文本嵌入模块将这些描述转换为统一语言空间中的向量;最后,通过语言-认知映射器(language-cognitive mappers)学习从语言空间到认知诊断空间的映射,从而将这些特征与现有认知诊断模型高效集成。实验表明,LRCD在跨领域任务中表现出色,甚至在某些情况下能与在目标领域全量数据上训练的经典认知诊断模型相媲美。
链接: https://arxiv.org/abs/2501.13943
作者: Shuo Liu,Zihan Zhou,Yuanhao Liu,Jing Zhang,Hong Qian
机构: Shanghai Institute of AI Education, and School of Computer Science and Technology, East China Normal University (上海人工智能教育研究院,计算机科学与技术学院,华东师范大学); Department of Educational Psychology, Faculty of Education, East China Normal University (教育心理学系,教育学院,华东师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Cognitive diagnosis aims to infer students’ mastery levels based on their historical response logs. However, existing cognitive diagnosis models (CDMs), which rely on ID embeddings, often have to train specific models on specific domains. This limitation may hinder their directly practical application in various target domains, such as different subjects (e.g., Math, English and Physics) or different education platforms (e.g., ASSISTments, Junyi Academy and Khan Academy). To address this issue, this paper proposes the language representation favored zero-shot cross-domain cognitive diagnosis (LRCD). Specifically, LRCD first analyzes the behavior patterns of students, exercises and concepts in different domains, and then describes the profiles of students, exercises and concepts using textual descriptions. Via recent advanced text-embedding modules, these profiles can be transformed to vectors in the unified language space. Moreover, to address the discrepancy between the language space and the cognitive diagnosis space, we propose language-cognitive mappers in LRCD to learn the mapping from the former to the latter. Then, these profiles can be easily and efficiently integrated and trained with existing CDMs. Extensive experiments show that training LRCD on real-world datasets can achieve commendable zero-shot performance across different target domains, and in some cases, it can even achieve competitive performance with some classic CDMs trained on the full response data on target domains. Notably, we surprisingly find that LRCD can also provide interesting insights into the differences between various subjects (such as humanities and sciences) and sources (such as primary and secondary education).
zh
[NLP-63] GaussMark: A Practical Approach for Structural Watermarking of Language Models
【速读】: 该论文旨在解决大型语言模型(LLMs)生成的文本在应用中难以区分是否由人类撰写的问题,特别是在需要确保文本来源可信的场景中。现有的水印技术(watermarking techniques)在生成延迟、检测时间、文本质量下降或鲁棒性方面存在不足,主要原因是这些技术通常基于令牌级别(token-level)的水印,忽略了文本的固有结构。论文提出了一种新的水印方案——GaussMark,其关键创新在于通过在高斯独立性测试(Gaussian independence testing)的基础上,向LLM的权重中添加少量高斯噪声(Gaussian noise),从而在模型权重中嵌入结构性水印(structural watermark)。这种方法不仅实现简单高效,且具有形式化的统计保证,不会增加生成延迟,同时保持了模型质量。实验表明,GaussMark在面对插入、删除、替换和往返翻译等干扰时表现出较强的鲁棒性,且几乎不会影响模型性能。
链接: https://arxiv.org/abs/2501.13941
作者: Adam Block,Ayush Sekhari,Alexander Rakhlin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Recent advances in Large Language Models (LLMs) have led to significant improvements in natural language processing tasks, but their ability to generate human-quality text raises significant ethical and operational concerns in settings where it is important to recognize whether or not a given text was generated by a human. Thus, recent work has focused on developing techniques for watermarking LLM-generated text, i.e., introducing an almost imperceptible signal that allows a provider equipped with a secret key to determine if given text was generated by their model. Current watermarking techniques are often not practical due to concerns with generation latency, detection time, degradation in text quality, or robustness. Many of these drawbacks come from the focus on token-level watermarking, which ignores the inherent structure of text. In this work, we introduce a new scheme, GaussMark, that is simple and efficient to implement, has formal statistical guarantees on its efficacy, comes at no cost in generation latency, and embeds the watermark into the weights of the model itself, providing a structural watermark. Our approach is based on Gaussian independence testing and is motivated by recent empirical observations that minor additive corruptions to LLM weights can result in models of identical (or even improved) quality. We show that by adding a small amount of Gaussian noise to the weights of a given LLM, we can watermark the model in a way that is statistically detectable by a provider who retains the secret key. We provide formal statistical bounds on the validity and power of our procedure. Through an extensive suite of experiments, we demonstrate that GaussMark is reliable, efficient, and relatively robust to corruptions such as insertions, deletions, substitutions, and roundtrip translations and can be instantiated with essentially no loss in model quality.
zh
[NLP-64] Evaluating Computational Accuracy of Large Language Models in Numerical Reasoning Tasks for Healthcare Applications
【速读】: 该论文试图解决大型语言模型(LLMs)在医疗领域中的数值推理能力不足的问题,特别是在高风险的临床应用中。数值推理在医疗应用中至关重要,直接影响患者结果、治疗计划和资源分配。论文通过评估基于GPT-3架构的LLM在数值推理任务中的计算准确性,探讨了其在医疗环境中的表现。解决方案的关键包括:1)使用精心策划的包含1000个数值问题的数据集,涵盖剂量计算和实验室结果解释等真实场景;2)采用提示工程(prompt engineering)、事实核查管道(fact-checking pipelines)和正则化技术(regularization techniques)来提高模型的准确性和泛化能力;3)通过精度(precision)、召回率(recall)和F1分数(F1-score)等关键指标评估模型效能。研究结果表明,模型在简单数值任务中表现较好,但在多步推理任务中面临挑战,且事实核查管道的集成显著提高了准确性。该研究为开发可靠、可解释且与医疗情境相关的AI工具提供了重要见解。
链接: https://arxiv.org/abs/2501.13936
作者: Arjun R. Malghan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 1 figure, 2 tables
点击查看摘要
Abstract:Large Language Models (LLMs) have emerged as transformative tools in the healthcare sector, demonstrating remarkable capabilities in natural language understanding and generation. However, their proficiency in numerical reasoning, particularly in high-stakes domains like in clinical applications, remains underexplored. Numerical reasoning is critical in healthcare applications, influencing patient outcomes, treatment planning, and resource allocation. This study investigates the computational accuracy of LLMs in numerical reasoning tasks within healthcare contexts. Using a curated dataset of 1,000 numerical problems, encompassing real-world scenarios such as dosage calculations and lab result interpretations, the performance of a refined LLM based on the GPT-3 architecture was evaluated. The methodology includes prompt engineering, integration of fact-checking pipelines, and application of regularization techniques to enhance model accuracy and generalization. Key metrics such as precision, recall, and F1-score were utilized to assess the model’s efficacy. The results indicate an overall accuracy of 84.10%, with improved performance in straightforward numerical tasks and challenges in multi-step reasoning. The integration of a fact-checking pipeline improved accuracy by 11%, underscoring the importance of validation mechanisms. This research highlights the potential of LLMs in healthcare numerical reasoning and identifies avenues for further refinement to support critical decision-making in clinical environments. The findings aim to contribute to the development of reliable, interpretable, and contextually relevant AI tools for healthcare.
zh
计算机视觉
[CV-0] HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
【速读】:该论文试图解决现有驾驶世界模型(Driving World Models, DWMs)在自动驾驶场景中仅能进行场景生成而缺乏场景理解能力的问题。现有模型无法对驾驶环境进行解释和推理,限制了其在复杂驾驶场景中的应用。为此,论文提出了一种名为HERMES的统一驾驶世界模型,通过一个统一的框架将3D场景理解与未来场景演化(生成)无缝集成。解决方案的关键在于利用鸟瞰图(Bird’s-Eye View, BEV)表示来整合多视角空间信息,同时保持几何关系和交互的完整性。此外,HERMES引入了世界查询(world queries),通过大语言模型(Large Language Model, LLM)中的因果注意力机制将世界知识融入BEV特征中,从而增强了对场景的理解和生成能力。实验结果表明,HERMES在nuScenes和OmniDrive-nuScenes数据集上取得了最先进的性能,生成误差减少了32.4%,理解指标如CIDEr提升了8.0%。
链接: https://arxiv.org/abs/2501.14729
作者: Xin Zhou,Dingkang Liang,Sifan Tu,Xiwu Chen,Yikang Ding,Dingyuan Zhang,Feiyang Tan,Hengshuang Zhao,Xiang Bai
机构: Huazhong University of Science and Technology(华中科技大学); MEGVII Technology(旷视科技); Mach Drive(马赫驱动); The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress. The code will be available at this https URL
点击查看摘要
Abstract:Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird’s-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model (LLM), enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of our method. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be publicly released at this https URL.
zh
[CV-1] Relightable Full-Body Gaussian Codec Avatars
【速读】:该论文旨在解决可重光照全身虚拟角色(relightable full-body avatars)建模中的挑战,特别是在身体姿态变化导致的大变形及其对光照传输(light transport)影响的情况下。关键挑战在于身体姿态的变化会显著改变身体表面相对于光源的方向,从而导致局部光照传输函数的变化以及身体部分之间的遮挡引起的非局部变化。为解决这一问题,论文提出了将光照传输分解为局部和非局部效应的解决方案。局部外观变化通过可学习的区域谐波(zonal harmonics)来建模,这些谐波在身体关节运动时能够高效旋转,从而在局部坐标系中解耦局部辐射传输与身体关节运动。对于非局部外观变化,论文引入了一个阴影网络(shadow network),该网络基于预计算的入射辐照度预测阴影,从而促进身体部分之间非局部阴影的学习。此外,论文还采用了延迟着色(deferred shading)方法来建模镜面辐射传输,以更好地捕捉反射和高光效果(如眼睛的闪光)。通过这些方法,论文成功实现了在新型光照条件和未见姿态下具有优异泛化能力的可重光照全身虚拟角色建模。
链接: https://arxiv.org/abs/2501.14726
作者: Shaofei Wang,Tomas Simon,Igor Santesteban,Timur Bagautdinov,Junxuan Li,Vasu Agrawal,Fabian Prada,Shoou-I Yu,Pace Nalbone,Matt Gramlich,Roman Lubachersky,Chenglei Wu,Javier Romero,Jason Saragih,Michael Zollhoefer,Andreas Geiger,Siyu Tang,Shunsuke Saito
机构: ETH Zürich (苏黎世联邦理工学院); Codec Avatars Lab, Meta (Meta); University of Tübingen (蒂宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 14 pages, 9 figures. Project page: this https URL
点击查看摘要
Abstract:We propose Relightable Full-Body Gaussian Codec Avatars, a new approach for modeling relightable full-body avatars with fine-grained details including face and hands. The unique challenge for relighting full-body avatars lies in the large deformations caused by body articulation and the resulting impact on appearance caused by light transport. Changes in body pose can dramatically change the orientation of body surfaces with respect to lights, resulting in both local appearance changes due to changes in local light transport functions, as well as non-local changes due to occlusion between body parts. To address this, we decompose the light transport into local and non-local effects. Local appearance changes are modeled using learnable zonal harmonics for diffuse radiance transfer. Unlike spherical harmonics, zonal harmonics are highly efficient to rotate under articulation. This allows us to learn diffuse radiance transfer in a local coordinate frame, which disentangles the local radiance transfer from the articulation of the body. To account for non-local appearance changes, we introduce a shadow network that predicts shadows given precomputed incoming irradiance on a base mesh. This facilitates the learning of non-local shadowing between the body parts. Finally, we use a deferred shading approach to model specular radiance transfer and better capture reflections and highlights such as eye glints. We demonstrate that our approach successfully models both the local and non-local light transport required for relightable full-body avatars, with a superior generalization ability under novel illumination conditions and unseen poses.
zh
[CV-2] Approach to Designing CV Systems for Medical Applications: Data Architecture and AI
【速读】:该论文旨在解决传统眼底图像分析中依赖特定诊断预测的局限性问题。传统的筛查方法通常侧重于预测特定疾病,而本文提出的创新软件系统则通过全面分析眼底结构的正常和病理特征,模拟临床诊断过程,将最终决策权交还给医疗专业人员。该系统的关键解决方案在于其独特的架构设计,结合了最先进的深度学习方法和传统计算机视觉算法,提供了对眼底结构的全面和细致分析。通过这种自动化且增强的临床工作流程,系统不仅提升了眼底图像检查的客观性和效率,还为医疗应用的设计提供了一种新的方法论。
链接: https://arxiv.org/abs/2501.14689
作者: Dmitry Ryabtsev,Boris Vasilyev,Sergey Shershakov
机构: HSE University (俄罗斯高等经济大学); Utrecht University (乌得勒支大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures
点击查看摘要
Abstract:This paper introduces an innovative software system for fundus image analysis that deliberately diverges from the conventional screening approach, opting not to predict specific diagnoses. Instead, our methodology mimics the diagnostic process by thoroughly analyzing both normal and pathological features of fundus structures, leaving the ultimate decision-making authority in the hands of healthcare professionals. Our initiative addresses the need for objective clinical analysis and seeks to automate and enhance the clinical workflow of fundus image examination. The system, from its overarching architecture to the modular analysis design powered by artificial intelligence (AI) models, aligns seamlessly with ophthalmological practices. Our unique approach utilizes a combination of state-of-the-art deep learning methods and traditional computer vision algorithms to provide a comprehensive and nuanced analysis of fundus structures. We present a distinctive methodology for designing medical applications, using our system as an illustrative example. Comprehensive verification and validation results demonstrate the efficacy of our approach in revolutionizing fundus image analysis, with potential applications across various medical domains.
zh
[CV-3] Surface Vision Mamba: Leverag ing Bidirectional State Space Model for Efficient Spherical Manifold Representation
【速读】:该论文旨在解决基于注意力机制(attention-based methods)的模型在处理球面(spherical surfaces)数据时存在的推理时间长和内存消耗高的问题,特别是在计算资源有限的情况下应用于大规模数据集时。为了解决这一问题,作者提出了一种无注意力机制的视觉曼巴(Vision Mamba, Vim)模型,并将其应用于球面数据,称为球面视觉曼巴(Surface Vision Mamba, SiM)。该模型通过将球面数据表示为从细分二十面体(icosphere)中提取的三角形面片序列来实现表面分块(surface patching)。SiM在多个神经发育表型回归任务中表现出色,相较于基于注意力的球面视觉变换器(Surface Vision Transformer, SiT),推理速度提高了4.8倍,内存消耗降低了91.7%。实验结果表明,SiM不仅性能优越,还能有效识别细微的认知发育模式。
链接: https://arxiv.org/abs/2501.14679
作者: Rongzhao He,Weihao Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Attention-based methods have demonstrated exceptional performance in modelling long-range dependencies on spherical cortical surfaces, surpassing traditional Geometric Deep Learning (GDL) models. However, their extensive inference time and high memory demands pose challenges for application to large datasets with limited computing resources. Inspired by the state space model in computer vision, we introduce the attention-free Vision Mamba (Vim) to spherical surfaces, presenting a domain-agnostic architecture for analyzing data on spherical manifolds. Our method achieves surface patching by representing spherical data as a sequence of triangular patches derived from a subdivided icosphere. The proposed Surface Vision Mamba (SiM) is evaluated on multiple neurodevelopmental phenotype regression tasks using cortical surface metrics from neonatal brains. Experimental results demonstrate that SiM outperforms both attention- and GDL-based methods, delivering 4.8 times faster inference and achieving 91.7% lower memory consumption compared to the Surface Vision Transformer (SiT) under the Ico-4 grid partitioning. Sensitivity analysis further underscores the potential of SiM to identify subtle cognitive developmental patterns. The code is available at this https URL.
zh
[CV-4] MatAnyone: Stable Video Matting with Consistent Memory Propagation
【速读】:该论文旨在解决无辅助(auxiliary-free)人类视频抠图方法在处理复杂或模糊背景时表现不佳的问题。为了解决这一问题,作者提出了MatAnyone框架,其关键解决方案包括以下几个方面:首先,基于记忆(memory-based)的范式,引入了一种通过区域自适应记忆融合(region-adaptive memory fusion)实现的一致性记忆传播模块,该模块能够自适应地整合前一帧的记忆信息,从而在核心区域保持语义稳定性,并在物体边界处保留细粒度细节。其次,作者提出了一个更大规模、高质量且多样化的视频抠图数据集,以支持鲁棒训练。此外,还引入了一种新颖的训练策略,有效利用大规模分割数据,进一步提升抠图的稳定性。通过这些新的网络设计、数据集和训练策略,MatAnyone在多样化的现实场景中实现了鲁棒且准确的视频抠图效果,超越了现有方法。
链接: https://arxiv.org/abs/2501.14677
作者: Peiqing Yang,Shangchen Zhou,Jixin Zhao,Qingyi Tao,Chen Change Loy
机构: S-Lab, Nanyang Technological University (南洋理工大学); SenseTime Research, Singapore (商汤科技研究院, 新加坡)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To address this, we propose MatAnyone, a robust framework tailored for target-assigned video matting. Specifically, building on a memory-based paradigm, we introduce a consistent memory propagation module via region-adaptive memory fusion, which adaptively integrates memory from the previous frame. This ensures semantic stability in core regions while preserving fine-grained details along object boundaries. For robust training, we present a larger, high-quality, and diverse dataset for video matting. Additionally, we incorporate a novel training strategy that efficiently leverages large-scale segmentation data, boosting matting stability. With this new network design, dataset, and training strategy, MatAnyone delivers robust and accurate video matting results in diverse real-world scenarios, outperforming existing methods.
zh
[CV-5] owards Unified Structured Light Optimization
【速读】:该论文旨在解决结构化光(Structured Light, SL)三维重建中投影模式优化的两个主要局限性:一是每个场景需要单独训练校准参数,二是优化仅限于特定类型的结构化光,限制了其应用范围。为解决这些问题,论文提出了一种统一的优化框架,能够适应不同的光照条件、物体类型和结构化光类型。该框架通过仅使用一张投影图像快速确定最佳投影模式。关键贡献包括一种新颖的全局匹配方法,用于实现投影仪与相机的精确对齐,以及一种新的投影补偿模型,配备光度调整模块,以减少由于色域裁剪引起的伪影。实验结果表明,该方法在各种物体、结构化光模式和光照条件下均表现出优越的解码精度,显著优于现有方法。
链接: https://arxiv.org/abs/2501.14659
作者: Tinglei Wan,Tonghua Su,Zhongjie Wang
机构: Harbin Institute of Technology(哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Structured light (SL) 3D reconstruction captures the precise surface shape of objects, providing high-accuracy 3D data essential for industrial inspection and robotic vision systems. However, current research on optimizing projection patterns in SL 3D reconstruction faces two main limitations: each scene requires separate training of calibration parameters, and optimization is restricted to specific types of SL, which restricts their application range. To tackle these limitations, we present a unified framework for SL optimization, adaptable to diverse lighting conditions, object types, and different types of SL. Our framework quickly determines the optimal projection pattern using only a single projected image. Key contributions include a novel global matching method for projectors, enabling precise projector-camera alignment with just one projected image, and a new projection compensation model with a photometric adjustment module to reduce artifacts from out-of-gamut clipping. Experimental results show our method achieves superior decoding accuracy across various objects, SL patterns, and lighting conditions, significantly outperforming previous methods.
zh
[CV-6] SyncAnimation: A Real-Time End-to-End Framework for Audio-Driven Human Pose and Talking Head Animation
【速读】:该论文旨在解决音频驱动的说话头像生成(talking avatar driven by audio)中的关键挑战,包括高计算成本、面部细节和真实感不足,以及面部表情与上半身运动在静默期间的一致性等问题。现有方法通常难以同时满足高实时性和视觉质量的要求。论文提出的解决方案SyncAnimation,首次基于神经辐射场(NeRF)技术,通过结合广义的音频到姿态匹配(audio-to-pose matching)和音频到表情同步(audio-to-expression synchronization),实现了稳定且实时的说话头像生成。其关键技术包括AudioPose Syncer和AudioEmotion Syncer,分别用于高精度的姿态和表情生成,逐步生成与音频同步的上半身、头部和唇形。此外,高同步人体渲染器(High-Synchronization Human Renderer)确保了头部与上半身的无缝整合,并实现了音频同步的唇形生成。
链接: https://arxiv.org/abs/2501.14646
作者: Yujian Liu,Shidang Xu,Jing Guo,Dingbin Wang,Zairan Wang,Xianfeng Tan,Xiaoli Liu
机构: AiShiWeiLai AI Research, Beijing, China (艾什未来 AI 研究); South China University of Technology, Guangzhou, China (华南理工大学); Beijing Institute of Technology, Beijing, China (北京理工大学); Beijing University of Posts and Telecommunications, Beijing, China (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
点击查看摘要
Abstract:Generating talking avatar driven by audio remains a significant challenge. Existing methods typically require high computational costs and often lack sufficient facial detail and realism, making them unsuitable for applications that demand high real-time performance and visual quality. Additionally, while some methods can synchronize lip movement, they still face issues with consistency between facial expressions and upper body movement, particularly during silent periods. In this paper, we introduce SyncAnimation, the first NeRF-based method that achieves audio-driven, stable, and real-time generation of speaking avatar by combining generalized audio-to-pose matching and audio-to-expression synchronization. By integrating AudioPose Syncer and AudioEmotion Syncer, SyncAnimation achieves high-precision poses and expression generation, progressively producing audio-synchronized upper body, head, and lip shapes. Furthermore, the High-Synchronization Human Renderer ensures seamless integration of the head and upper body, and achieves audio-sync lip. The project page can be found at this https URL
zh
[CV-7] ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
【速读】:该论文试图解决的是基于文本描述的视频目标分割(Referring Video Object Segmentation, RVOS)任务中,现有模型在处理复杂对象描述时表现不佳的问题。这是由于现有模型在视频-语言理解方面的能力有限。为解决这一问题,作者提出了ReferDINO模型,该模型继承了预训练的视觉-语言基础模型的强大理解能力,并进一步增强了时间理解和目标分割能力。解决方案的关键在于三个技术创新:1)对象一致性时间增强器(object-consistent temporal enhancer),利用预训练的对象-文本表示来增强时间理解和对象一致性;2)基于定位的可变形掩码解码器(grounding-guided deformable mask decoder),通过整合文本和定位条件生成精确的对象掩码;3)置信度感知的查询剪枝策略(confidence-aware query pruning strategy),在不影响性能的情况下显著提高对象解码效率。通过这些创新,ReferDINO在五个公开的RVOS基准测试中显著优于现有方法。
链接: https://arxiv.org/abs/2501.14607
作者: Tianming Liang,Kun-Yu Lin,Chaolei Tan,Jianguo Zhang,Wei-Shi Zheng,Jian-Fang Hu
机构: Sun Yat-sen University(中山大学); The University of Hong Kong(香港大学); The Hong Kong University of Science and Technology(香港科技大学); Southern University of Science and Technology(南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. Despite notable progress in recent years, current RVOS models remain struggle to handle complicated object descriptions due to their limited video-language understanding. To address this limitation, we present \textbfReferDINO, an end-to-end RVOS model that inherits strong vision-language understanding from the pretrained visual grounding foundation models, and is further endowed with effective temporal understanding and object segmentation capabilities. In ReferDINO, we contribute three technical innovations for effectively adapting the foundation models to RVOS: 1) an object-consistent temporal enhancer that capitalizes on the pretrained object-text representations to enhance temporal understanding and object consistency; 2) a grounding-guided deformable mask decoder that integrates text and grounding conditions to generate accurate object masks; 3) a confidence-aware query pruning strategy that significantly improves the object decoding efficiency without compromising performance. We conduct extensive experiments on five public RVOS benchmarks to demonstrate that our proposed ReferDINO outperforms state-of-the-art methods significantly. Project page: \urlthis https URL
zh
[CV-8] 3DLabelProp: Geometric-Driven Domain Generalization for LiDAR Semantic Segmentation in Autonomous Driving
【速读】:该论文试图解决的是在自动驾驶领域中,LiDAR感知模型在面对训练和推理数据集之间存在显著领域偏移(domain shift)时的性能保持问题。这一问题尤其重要,因为自动驾驶模型需要具备鲁棒性,且训练成本较高。论文提出的解决方案关键是一种基于几何的方法,称为3DLabelProp,该方法利用了LiDAR传感器的序列结构,与文献中常见的学习方法不同。通过这种方法,论文在LiDAR语义分割(LiDAR Semantic Segmentation, LSS)任务中进行了广泛的实验,证明了其在七个数据集上的优越性能,超越了其他领域泛化方法。
链接: https://arxiv.org/abs/2501.14605
作者: Jules Sanchez,Jean-Emmanuel Deschaud,François Goulette
机构: Centre for Robotics, Mines Paris - PSL, PSL University, Paris, France (巴黎矿业学院 - PSL, PSL大学, 法国巴黎); U2IS, ENSTA Paris, Institut Polytechnique de Paris, Palaiseau, France (ENSTA巴黎, 巴黎综合理工学院, 法国帕莱索)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Domain generalization aims to find ways for deep learning models to maintain their performance despite significant domain shifts between training and inference datasets. This is particularly important for models that need to be robust or are costly to train. LiDAR perception in autonomous driving is impacted by both of these concerns, leading to the emergence of various approaches. This work addresses the challenge by proposing a geometry-based approach, leveraging the sequential structure of LiDAR sensors, which sets it apart from the learning-based methods commonly found in the literature. The proposed method, called 3DLabelProp, is applied on the task of LiDAR Semantic Segmentation (LSS). Through extensive experimentation on seven datasets, it is demonstrated to be a state-of-the-art approach, outperforming both naive and other domain generalization methods.
zh
[CV-9] Geometric Mean Improves Loss For Few-Shot Learning
【速读】:该论文试图解决小样本学习(Few-shot Learning, FSL)中的挑战,即在仅有少量标注样本的情况下,模型如何进行有效的判别分类。小样本学习要求模型能够在特征空间中学习到一个具有良好泛化能力的度量(metric),以便即使面对新类别的样本,也能通过少量标注样本构建有效的分类器。
解决方案的关键在于提出了一种基于几何平均(geometric mean)的新型小样本学习损失函数。与传统的基于算术平均(arithmetic mean)的softmax损失函数不同,该方法利用几何平均来聚合样本间的成对关系,从而增强跨类别的判别度量。该损失函数不仅形式简洁,而且通过理论分析揭示了其在学习特征度量方面的优势,特别是在小样本学习任务中表现出色。实验结果表明,该方法在少样本图像分类任务中具有竞争力。
链接: https://arxiv.org/abs/2501.14593
作者: Tong Wu,Takumi Kobayashi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Few-shot learning (FSL) is a challenging task in machine learning, demanding a model to render discriminative classification by using only a few labeled samples. In the literature of FSL, deep models are trained in a manner of metric learning to provide metric in a feature space which is well generalizable to classify samples of novel classes; in the space, even a few amount of labeled training examples can construct an effective classifier. In this paper, we propose a novel FSL loss based on \emphgeometric mean to embed discriminative metric into deep features. In contrast to the other losses such as utilizing arithmetic mean in softmax-based formulation, the proposed method leverages geometric mean to aggregate pair-wise relationships among samples for enhancing discriminative metric across class categories. The proposed loss is not only formulated in a simple form but also is thoroughly analyzed in theoretical ways to reveal its favorable characteristics which are favorable for learning feature metric in FSL. In the experiments on few-shot image classification tasks, the method produces competitive performance in comparison to the other losses.
zh
[CV-10] Visual Localization via Semantic Structures in Autonomous Photovoltaic Power Plant Inspection
【速读】:该论文旨在解决利用配备热成像相机(thermal cameras)的无人机(UAVs)对光伏(PV)电站进行自动化巡检时的精确定位问题。自动化巡检的挑战在于需要无人机从最佳距离和视角捕获图像,以确保检测的准确性。论文提出了一种新颖的定位流程,该流程将光伏组件的检测与无人机导航直接集成,从而实现巡检过程中的精确定位。解决方案的关键在于通过视觉可识别的锚点(anchor points)进行初始关联,并利用目标跟踪(object tracking)来识别全局关联。此外,论文还提出了三种基于传统计算机视觉、深度学习及其融合的光伏组件视觉分割方法,并评估了这些方法在定位流程中的性能。通过使用定制的空中巡检数据集进行验证和评估,证明了这些方法的鲁棒性和实时导航的适用性。同时,论文还评估了光伏电站模型精度对定位方法的影响。
链接: https://arxiv.org/abs/2501.14587
作者: Viktor Kozák,Karel Košnar,Jan Chudoba,Miroslav Kulich,Libor Přeučil
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 47 pages, 22 figures
点击查看摘要
Abstract:Inspection systems utilizing unmanned aerial vehicles (UAVs) equipped with thermal cameras are increasingly popular for the maintenance of photovoltaic (PV) power plants. However, automation of the inspection task is a challenging problem as it requires precise navigation to capture images from optimal distances and viewing angles. This paper presents a novel localization pipeline that directly integrates PV module detection with UAV navigation, allowing precise positioning during inspection. Detections are used to identify the power plant structures in the image and associate these with the power plant model. We define visually recognizable anchor points for the initial association and use object tracking to discern global associations. We present three distinct methods for visual segmentation of PV modules based on traditional computer vision, deep learning, and their fusion, and we evaluate their performance in relation to the proposed localization pipeline. The presented methods were verified and evaluated using custom aerial inspection data sets, demonstrating their robustness and applicability for real-time navigation. Additionally, we evaluate the influence of the power plant model’s precision on the localization methods. Comments: 47 pages, 22 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2501.14587 [cs.CV] (or arXiv:2501.14587v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.14587 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-11] Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding ICLR2025
【速读】:该论文试图解决在医学影像解读中,现有基于对比语言-图像预训练(CLIP)的方法通常将整个图像与放射报告进行对比,忽略了影像区域与报告句子之间的局部关联,从而可能影响模型性能和互操作性的问题。为解决这一问题,论文提出了一种细粒度的视觉-语言模型(fVLM),用于解剖级别的CT影像解读。其关键解决方案在于显式地将CT影像的解剖区域与放射报告中的对应描述进行匹配,并对每个解剖区域单独进行对比预训练。此外,针对细粒度对齐中存在的假阴性挑战(主要来自大量解剖级别的健康样本和相似的病变异常),论文提出通过识别正常和异常样本中的假阴性,并从患者级别到疾病感知配对进行对比学习的校准。实验结果表明,fVLM在54个主要疾病诊断任务中的零样本分类任务中,平均AUC达到81.3%,显著优于CLIP和监督学习方法。
链接: https://arxiv.org/abs/2501.14548
作者: Zhongyi Shui,Jianpeng Zhang,Weiwei Cao,Sinuo Wang,Ruizhe Guo,Le Lu,Lin Yang,Xianghua Ye,Tingbo Liang,Qi Zhang,Ling Zhang
机构: DAMO Academy, Alibaba Group(阿里巴巴集团达摩院); The First Affiliated Hospital of College of Medicine, Zhejiang University, China(浙江大学医学院附属第一医院); Zhejiang University, China(浙江大学); Westlake University, China(西湖大学); Hupan Lab, 310023, China(湖畔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025
点击查看摘要
Abstract:Artificial intelligence (AI) shows great potential in assisting radiologists to improve the efficiency and accuracy of medical image interpretation and diagnosis. However, a versatile AI model requires large-scale data and comprehensive annotations, which are often impractical in medical settings. Recent studies leverage radiology reports as a naturally high-quality supervision for medical images, using contrastive language-image pre-training (CLIP) to develop language-informed models for radiological image interpretation. Nonetheless, these approaches typically contrast entire images with reports, neglecting the local associations between imaging regions and report sentences, which may undermine model performance and interoperability. In this paper, we propose a fine-grained vision-language model (fVLM) for anatomy-level CT image interpretation. Specifically, we explicitly match anatomical regions of CT images with corresponding descriptions in radiology reports and perform contrastive pre-training for each anatomy individually. Fine-grained alignment, however, faces considerable false-negative challenges, mainly from the abundance of anatomy-level healthy samples and similarly diseased abnormalities. To tackle this issue, we propose identifying false negatives of both normal and abnormal samples and calibrating contrastive learning from patient-level to disease-aware pairing. We curated the largest CT dataset to date, comprising imaging and report data from 69,086 patients, and conducted a comprehensive evaluation of 54 major and important disease diagnosis tasks across 15 main anatomies. Experimental results demonstrate the substantial potential of fVLM in versatile medical image interpretation. In the zero-shot classification task, we achieved an average AUC of 81.3% on 54 diagnosis tasks, surpassing CLIP and supervised methods by 12.9% and 8.0%, respectively.
zh
[CV-12] Leverag ing ChatGPT s Multimodal Vision Capabilities to Rank Satellite Images by Poverty Level: Advancing Tools for Social Science Research
【速读】:该论文探讨了如何利用具有视觉能力的大语言模型(Large Language Models, LLMs)分析卫星图像,以实现村级贫困预测。尽管LLMs最初设计用于自然语言理解,但其在多模态任务(如地理空间分析)中的适应性为数据驱动研究开辟了新领域。论文的关键解决方案在于利用视觉增强的LLMs,评估其从卫星图像中提供可解释、可扩展且可靠的贫困洞察的能力。通过成对比较方法,研究证明ChatGPT能够根据贫困水平对卫星图像进行排序,其准确性与领域专家相当。这一发现不仅展示了LLMs在社会经济研究中的潜力,也为其在贫困评估工作流程中的整合奠定了基础,为大规模、低成本的贫困监测提供了新途径。
链接: https://arxiv.org/abs/2501.14546
作者: Hamid Sarmadi,Ola Hall,Thorsteinn Rögnvaldsson,Mattias Ohlsson
机构: Center for Applied Intelligent Systems Research (CAISR), Halmstad University (哈尔姆斯塔德大学); Department of Human Geography, Lund University (隆德大学); Center for Environmental and Climate Science, Lund University (隆德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:This paper investigates the novel application of Large Language Models (LLMs) with vision capabilities to analyze satellite imagery for village-level poverty prediction. Although LLMs were originally designed for natural language understanding, their adaptability to multimodal tasks, including geospatial analysis, has opened new frontiers in data-driven research. By leveraging advancements in vision-enabled LLMs, we assess their ability to provide interpretable, scalable, and reliable insights into human poverty from satellite images. Using a pairwise comparison approach, we demonstrate that ChatGPT can rank satellite images based on poverty levels with accuracy comparable to domain experts. These findings highlight both the promise and the limitations of LLMs in socioeconomic research, providing a foundation for their integration into poverty assessment workflows. This study contributes to the ongoing exploration of unconventional data sources for welfare analysis and opens pathways for cost-effective, large-scale poverty monitoring.
zh
[CV-13] Rethinking Encoder-Decoder Flow Through Shared Structures
【速读】:该论文试图解决密集预测任务(dense prediction tasks)中解码器架构相对简单的问题。尽管编码器架构的复杂性不断增加,但解码器仍然主要依赖于逐个解码中间特征图的独立模块。论文提出了一种称为“banks”的共享结构,每个解码块都可以利用这些结构来在解码过程中提供额外的上下文信息。通过重采样(resampling)和特征融合(feature fusion)的方式应用这些结构,论文展示了在自然图像和合成图像上进行深度估计时,基于transformer的最先进架构的性能得到了提升。解决方案的关键在于引入了共享的banks结构,增强了解码过程中的上下文信息,从而提高了模型的性能。
链接: https://arxiv.org/abs/2501.14535
作者: Frederik Laboyrie,Mehmet Kerim Yucel,Albert Saa-Garriga
机构: Samsung R&D Institute UK (SRUK)(三星研发英国研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Dense prediction tasks have enjoyed a growing complexity of encoder architectures, decoders, however, have remained largely the same. They rely on individual blocks decoding intermediate feature maps sequentially. We introduce banks, shared structures that are used by each decoding block to provide additional context in the decoding process. These structures, through applying them via resampling and feature fusion, improve performance on depth estimation for state-of-the-art transformer-based architectures on natural and synthetic images whilst training on large-scale datasets.
zh
[CV-14] rick-GS: A Balanced Bag of Tricks for Efficient Gaussian Splatting ICASSP’25
【速读】:该论文旨在解决基于高斯泼溅(Gaussian Splatting, GS)的三维重建方法在计算资源受限设备(如智能手机)上的应用难题。传统GS方法虽然具有快速训练、推理速度和高精度重建的优势,但其重建结果通常包含数百万个高斯分布,导致其在计算资源受限的设备上难以高效运行。为此,论文提出了Trick-GS方法,其关键解决方案包括:(1)采用渐进式训练策略,结合分辨率、噪声和高斯尺度的调整;(2)通过学习剪枝和掩码技术,根据重要性对基元和球谐函数(SH)频带进行筛选;(3)引入加速的GS训练框架。实验结果表明,Trick-GS在三个数据集上实现了高达2倍的训练速度提升、40倍的磁盘空间缩减以及2倍的渲染速度提升,同时保持了与原始GS相当的精度。
链接: https://arxiv.org/abs/2501.14534
作者: Anil Armagan,Albert Saà-Garriga,Bruno Manganelli,Mateusz Nowak,Mehmet Kerim Yucel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICASSP’25
点击查看摘要
Abstract:Gaussian splatting (GS) for 3D reconstruction has become quite popular due to their fast training, inference speeds and high quality reconstruction. However, GS-based reconstructions generally consist of millions of Gaussians, which makes them hard to use on computationally constrained devices such as smartphones. In this paper, we first propose a principled analysis of advances in efficient GS methods. Then, we propose Trick-GS, which is a careful combination of several strategies including (1) progressive training with resolution, noise and Gaussian scales, (2) learning to prune and mask primitives and SH bands by their significance, and (3) accelerated GS training framework. Trick-GS takes a large step towards resource-constrained GS, where faster run-time, smaller and faster-convergence of models is of paramount concern. Our results on three datasets show that Trick-GS achieves up to 2x faster training, 40x smaller disk size and 2x faster rendering speed compared to vanilla GS, while having comparable accuracy.
zh
[CV-15] CheapNVS: Real-Time On-Device Narrow-Baseline Novel View Synthesis ICASSP2025
【速读】:该论文试图解决单视角新视角合成(Single-view Novel View Synthesis, NVS)这一具有挑战性的问题,由于其不适定性(ill-posed nature),传统方法通常需要大量计算资源和复杂的模型才能取得显著效果。论文提出了一种名为CheapNVS的全端到端解决方案,其关键在于采用了一种新颖且高效的多编码器/解码器设计,并通过多阶段训练实现。CheapNVS首先利用轻量级可学习模块近似复杂的3D图像扭曲(3D image warping),这些模块基于目标视角的相机姿态嵌入(camera pose embeddings)进行条件化处理;随后,并行地对遮挡区域进行修复(inpainting),从而显著提升性能。通过在Open Images数据集子集上的训练,CheapNVS在性能上超越了现有最先进方法,同时速度提升了10倍,内存消耗减少了6%,并且能够在移动设备上实时运行,达到超过30 FPS的帧率。
链接: https://arxiv.org/abs/2501.14533
作者: Konstantinos Georgiadis,Mehmet Kerim Yucel,Albert Saa-Garriga
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICASSP 2025
点击查看摘要
Abstract:Single-view novel view synthesis (NVS) is a notorious problem due to its ill-posed nature, and often requires large, computationally expensive approaches to produce tangible results. In this paper, we propose CheapNVS: a fully end-to-end approach for narrow baseline single-view NVS based on a novel, efficient multiple encoder/decoder design trained in a multi-stage fashion. CheapNVS first approximates the laborious 3D image warping with lightweight learnable modules that are conditioned on the camera pose embeddings of the target view, and then performs inpainting on the occluded regions in parallel to achieve significant performance gains. Once trained on a subset of Open Images dataset, CheapNVS outperforms the state-of-the-art despite being 10 times faster and consuming 6% less memory. Furthermore, CheapNVS runs comfortably in real-time on mobile devices, reaching over 30 FPS on a Samsung Tab 9+.
zh
[CV-16] raining-Free Style and Content Transfer by Leverag ing U-Net Skip Connections in Stable Diffusion 2.*
【速读】:该论文试图解决扩散模型(diffusion models)在图像生成过程中内部潜在表示(latent representations)理解不足的问题。尽管近年来在图像生成方面取得了显著进展,但扩散模型的内部工作机制,尤其是其潜在表示的具体作用,仍然不够清晰。现有研究主要关注Stable Diffusion的U-Net架构中的瓶颈层(h-space)或利用交叉注意力(cross-attention)、自注意力(self-attention)或解码层(decoding layers)。本文提出的模型SkipInject则利用了U-Net的跳跃连接(skip connections),并通过深入分析发现,第三编码器块传递的残差连接(residual connections)携带了重建图像的大部分空间信息,能够将内容与风格分离。通过在该块中注入表示,SkipInject能够实现基于文本的编辑、精确修改和风格迁移(style transfer)。实验表明,该方法在内容对齐和结构保持的权衡上优于现有的最先进风格迁移和图像编辑方法。
链接: https://arxiv.org/abs/2501.14524
作者: Ludovica Schaerf,Andrea Alfarano,Fabrizio Silvestri,Leonardo Impett
机构: Max Planck Society (马克斯·普朗克学会); University of Zurich (苏黎世大学); Sapienza University of Rome (罗马大学); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Despite significant recent advances in image generation with diffusion models, their internal latent representations remain poorly understood. Existing works focus on the bottleneck layer (h-space) of Stable Diffusion’s U-Net or leverage the cross-attention, self-attention, or decoding layers. Our model, SkipInject takes advantage of U-Net’s skip connections. We conduct thorough analyses on the role of the skip connections and find that the residual connections passed by the third encoder block carry most of the spatial information of the reconstructed image, splitting the content from the style. We show that injecting the representations from this block can be used for text-based editing, precise modifications, and style transfer. We compare our methods state-of-the-art style transfer and image editing methods and demonstrate that our method obtains the best content alignment and optimal structural preservation tradeoff.
zh
[CV-17] PARASIDE: An Automatic Paranasal Sinus Segmentation and Structure Analysis Tool for MRI
【速读】:该论文试图解决慢性鼻窦炎(Chronic Rhinosinusitis, CRS)在临床评估中由于主观性而难以准确评估的问题。CRS是一种常见且持续的鼻窦炎症,影响5-12%的普通人群,显著影响患者的生活质量。为了解决这一问题,作者引入了PARASIDE,一种自动化工具,用于在T1 MRI中分割上颌窦(sinus maxillaris)、额窦(sinus frontalis)、蝶窦(sinus sphenodalis)和筛窦(sinus ethmoidalis)的气腔和软组织体积。通过这种分割,可以量化以往仅能通过手动和主观方式观察到的特征关系。该工具的关键在于其能够自动分割16个鼻部结构,并计算医学相关特征,如Lund-Mackay评分,从而提供客观的量化数据,帮助临床医生更准确地评估CRS。
链接: https://arxiv.org/abs/2501.14514
作者: Hendrik Möller,Lukas Krautschick,Matan Atad,Robert Graf,Chia-Jung Busch,Achim Beule,Christian Scharf,Lars Kaderali,Bjoern Menze,Daniel Rueckert,Jan Kirschke,Fabian Schwitzing
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Chronic rhinosinusitis (CRS) is a common and persistent sinus imflammation that affects 5 - 12% of the general population. It significantly impacts quality of life and is often difficult to assess due to its subjective nature in clinical evaluation. We introduce PARASIDE, an automatic tool for segmenting air and soft tissue volumes of the structures of the sinus maxillaris, frontalis, sphenodalis and ethmoidalis in T1 MRI. By utilizing that segmentation, we can quantify feature relations that have been observed only manually and subjectively before. We performed an exemplary study and showed both volume and intensity relations between structures and radiology reports. While the soft tissue segmentation is good, the automated annotations of the air volumes are excellent. The average intensity over air structures are consistently below those of the soft tissues, close to perfect separability. Healthy subjects exhibit lower soft tissue volumes and lower intensities. Our developed system is the first automated whole nasal segmentation of 16 structures, and capable of calculating medical relevant features such as the Lund-Mackay score.
zh
[CV-18] Deep-BrownConrady: Prediction of Camera Calibration and Distortion Parameters Using Deep Learning and Synthetic Data
【速读】:该论文旨在解决从单张图像中进行相机标定(camera calibration)和畸变参数预测的挑战。传统方法通常需要多张不同角度的标定物体图像,而这在公开数据集中往往不可行。论文的关键解决方案包括:(1)通过深度学习模型(基于ResNet架构)从单张图像中准确预测相机和镜头参数,特别是基于Brown-Conrady镜头模型(Brown-Conrady lens model)的标定参数;(2)利用AILiveSim仿真平台生成包含焦距和镜头畸变参数变化的综合合成数据集(synthetic dataset),并结合少量真实图像进行模型训练。该方法在自动驾驶、机器人和增强现实等应用中具有重要价值。
链接: https://arxiv.org/abs/2501.14510
作者: Faiz Muhammad Chaudhry,Jarno Ralli,Jerome Leudet,Fahad Sohrab,Farhad Pakdaman,Pierre Corbani,Moncef Gabbouj
机构: AILiveSim Ltd.; Faculty of Information Technology and Communication Sciences, Tampere University (坦佩雷大学信息与通信科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This research addresses the challenge of camera calibration and distortion parameter prediction from a single image using deep learning models. The main contributions of this work are: (1) demonstrating that a deep learning model, trained on a mix of real and synthetic images, can accurately predict camera and lens parameters from a single image, and (2) developing a comprehensive synthetic dataset using the AILiveSim simulation platform. This dataset includes variations in focal length and lens distortion parameters, providing a robust foundation for model training and testing. The training process predominantly relied on these synthetic images, complemented by a small subset of real images, to explore how well models trained on synthetic data can perform calibration tasks on real-world images. Traditional calibration methods require multiple images of a calibration object from various orientations, which is often not feasible due to the lack of such images in publicly available datasets. A deep learning network based on the ResNet architecture was trained on this synthetic dataset to predict camera calibration parameters following the Brown-Conrady lens model. The ResNet architecture, adapted for regression tasks, is capable of predicting continuous values essential for accurate camera calibration in applications such as autonomous driving, robotics, and augmented reality. Keywords: Camera calibration, distortion, synthetic data, deep learning, residual networks (ResNet), AILiveSim, horizontal field-of-view, principal point, Brown-Conrady Model. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2501.14510 [cs.CV] (or arXiv:2501.14510v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.14510 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-19] LiDAR-Based Vehicle Detection and Tracking for Autonomous Racing
【速读】:该论文旨在解决在高速自动驾驶赛车(autonomous racing)场景中,多辆赛车竞争性交互带来的车辆检测与跟踪挑战。特别是在高速(超过275 km/h)环境下,准确且低延迟的感知系统对于超车操作和应对危险情况至关重要。论文提出的解决方案核心在于基于LiDAR的感知算法,包括一种新颖的快速点云分割(Point Cloud Segmentation)技术、特定的车辆姿态估计(Vehicle Pose Estimation)方法,以及一种可变步长的多目标跟踪(Multi-Target Tracking)算法。这些技术的结合显著提升了系统的性能、鲁棒性和计算效率,使其适用于高速自动驾驶赛车场景,并实现了完全自主的超车操作。
链接: https://arxiv.org/abs/2501.14502
作者: Marcello Cellina,Matteo Corno,Sergio Matteo Savaresi
机构: Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano (米兰理工大学电子、信息和生物工程系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
点击查看摘要
Abstract:Autonomous racing provides a controlled environment for testing the software and hardware of autonomous vehicles operating at their performance limits. Competitive interactions between multiple autonomous racecars however introduce challenging and potentially dangerous scenarios. Accurate and consistent vehicle detection and tracking is crucial for overtaking maneuvers, and low-latency sensor processing is essential to respond quickly to hazardous situations. This paper presents the LiDAR-based perception algorithms deployed on Team PoliMOVE’s autonomous racecar, which won multiple competitions in the Indy Autonomous Challenge series. Our Vehicle Detection and Tracking pipeline is composed of a novel fast Point Cloud Segmentation technique and a specific Vehicle Pose Estimation methodology, together with a variable-step Multi-Target Tracking algorithm. Experimental results demonstrate the algorithm’s performance, robustness, computational efficiency, and suitability for autonomous racing applications, enabling fully autonomous overtaking maneuvers at velocities exceeding 275 km/h.
zh
[CV-20] A Note on Implementation Errors in Recent Adaptive Attacks Against Multi-Resolution Self-Ensembles
【速读】:该论文探讨了近期自适应攻击(adaptive attacks)在多分辨率自集成防御(multi-resolution self-ensemble defense)中的实现问题。具体而言,文献指出在Zhang等人(2024年)的攻击实现中,对抗扰动(adversarial perturbations)超出了标准的 ( L_\infty = 8/255 ) 界限,甚至达到了 ( L_\infty = 160/255 ) 的幅度,这比预期高出20倍。论文的关键解决方案在于确保攻击在适当的界限内进行约束,从而验证防御机制的有效性。通过这种约束,研究发现多分辨率自集成防御在对抗攻击时仍能保持显著的鲁棒性(robustness)。此外,研究还揭示了一个有趣的发现:在适当约束下,针对强大多分辨率自集成防御的自适应攻击往往与人类感知一致,这表明需要重新考虑如何衡量对抗鲁棒性。
链接: https://arxiv.org/abs/2501.14496
作者: Stanislav Fort
机构: Google DeepMind; Independent Researcher
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 pages, 2 figures, technical note addressing an issue in arXiv:2411.14834v1
点击查看摘要
Abstract:This note documents an implementation issue in recent adaptive attacks (Zhang et al. [2024]) against the multi-resolution self-ensemble defense (Fort and Lakshminarayanan [2024]). The implementation allowed adversarial perturbations to exceed the standard L_\infty = 8/255 bound by up to a factor of 20 \times , reaching magnitudes of up to L_\infty = 160/255 . When attacks are properly constrained within the intended bounds, the defense maintains non-trivial robustness. Beyond highlighting the importance of careful validation in adversarial machine learning research, our analysis reveals an intriguing finding: properly bounded adaptive attacks against strong multi-resolution self-ensembles often align with human perception, suggesting the need to reconsider how we measure adversarial robustness.
zh
[CV-21] BILLNET: A Binarized Conv3D-LSTM Network with Logic-gated residual architecture for hardware-efficient video inference
【速读】:该论文旨在解决在视频应用中,长短期记忆网络(LSTM)和三维卷积(Conv3D)模型所需的大内存和高计算资源的问题。为了解决这一问题,作者提出了一种名为BILLNET的紧凑型二值化Conv3D-LSTM模型架构,该架构适用于资源高度受限的硬件环境。解决方案的关键包括:首先,通过将高成本的常规Conv3D分解为两个点卷积(pointwise convolutions)并在其间加入分组卷积(grouped convolution)来降低计算复杂度;其次,采用MUX-OR门控残差架构实现权重和激活的二值化;最后,提出了一种多阶段训练策略,以实现LSTM层的完全量化。实验结果表明,与现有的资源高效Conv3D模型相比,BILLNET在极低的内存和计算预算下仍能保持高精度。
链接: https://arxiv.org/abs/2501.14495
作者: Van Thien Nguyen,William Guicquero,Gilles Sicard
机构: Smart Integrated Circuits for Imaging Laboratory, CEA-LETI (智能成像集成电路实验室, CEA-LETI)
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注: Published at IEEE SiPS 2022
点击查看摘要
Abstract:Long Short-Term Memory (LSTM) and 3D convolution (Conv3D) show impressive results for many video-based applications but require large memory and intensive computing. Motivated by recent works on hardware-algorithmic co-design towards efficient inference, we propose a compact binarized Conv3D-LSTM model architecture called BILLNET, compatible with a highly resource-constrained hardware. Firstly, BILLNET proposes to factorize the costly standard Conv3D by two pointwise convolutions with a grouped convolution in-between. Secondly, BILLNET enables binarized weights and activations via a MUX-OR-gated residual architecture. Finally, to efficiently train BILLNET, we propose a multi-stage training strategy enabling to fully quantize LSTM layers. Results on Jester dataset show that our method can obtain high accuracy with extremely low memory and computational budgets compared to existing Conv3D resource-efficient models.
zh
[CV-22] riple Path Enhanced Neural Architecture Search for Multimodal Fake News Detection ICASSP2024
【速读】:该论文旨在解决多模态假新闻检测中的两个主要挑战:一是由于模型架构固化导致的多模态新闻信息融合效果不佳,二是对部分模态缺失的假新闻泛化能力较弱。为解决这些问题,作者提出了一种新颖且灵活的三路径增强神经架构搜索模型(MUSE)。MUSE的关键在于其包含两条动态路径用于检测部分模态缺失的假新闻,以及一条静态路径用于挖掘潜在的多模态相关性。实验结果表明,MUSE在基线模型基础上实现了稳定的性能提升。
链接: https://arxiv.org/abs/2501.14455
作者: Bo Xu,Qiujie Xie,Jiahui Zhou,Linlin Zong
机构: School of Computer Science and Technology, Dalian University of Technology(大连理工大学计算机科学与技术学院); School of Computer Science, Fudan University(复旦大学计算机学院); School of Software, Dalian University of Technology(大连理工大学软件学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted into the IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP 2024)
点击查看摘要
Abstract:Multimodal fake news detection has become one of the most crucial issues on social media platforms. Although existing methods have achieved advanced performance, two main challenges persist: (1) Under-performed multimodal news information fusion due to model architecture solidification, and (2) weak generalization ability on partial-modality contained fake news. To meet these challenges, we propose a novel and flexible triple path enhanced neural architecture search model MUSE. MUSE includes two dynamic paths for detecting partial-modality contained fake news and a static path for exploiting potential multimodal correlations. Experimental results show that MUSE achieves stable performance improvement over the baselines.
zh
[CV-23] Optimizing Human Pose Estimation Through Focused Human and Joint Regions
【速读】:该论文试图解决视频姿态估计(video pose estimation)中的两个主要问题:一是现有方法从所有像素中学习运动线索,容易受到背景变化或其他人运动的干扰;二是当前基于Transformer的姿态估计方法虽然在全局建模上表现出色,但在局部上下文感知和精确位置识别方面存在不足。为解决这些问题,论文提出了三个关键解决方案:(1) 双层人体关键点掩码模块(bilayer Human-Keypoint Mask module),通过从粗到细的视觉标记(visual token)细化,逐步聚焦于目标人体和关键点,同时屏蔽不重要的区域;(2) 引入可变形交叉注意力机制(deformable cross attention mechanism)和双向分离策略(bidirectional separation strategy),自适应地从受限的周围上下文中聚合空间和时间运动线索;(3) 通过数学公式约束可变形交叉注意力,确保模型仅关注以目标人体为中心的区域。实验结果表明,该方法在三个大规模基准数据集上达到了最先进的性能,特别是在PoseTrack2017数据集上,手腕关节的平均精度(mAP)达到了84.8,显著优于当前最先进方法的81.5 mAP。
链接: https://arxiv.org/abs/2501.14439
作者: Yingying Jiao,Zhigang Wang,Zhenguang Liu,Shaojing Fan,Sifan Wu,Zheqi Wu,Zhuoyue Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Human pose estimation has given rise to a broad spectrum of novel and compelling applications, including action recognition, sports analysis, as well as surveillance. However, accurate video pose estimation remains an open challenge. One aspect that has been overlooked so far is that existing methods learn motion clues from all pixels rather than focusing on the target human body, making them easily misled and disrupted by unimportant information such as background changes or movements of other people. Additionally, while the current Transformer-based pose estimation methods has demonstrated impressive performance with global modeling, they struggle with local context perception and precise positional identification. In this paper, we try to tackle these challenges from three aspects: (1) We propose a bilayer Human-Keypoint Mask module that performs coarse-to-fine visual token refinement, which gradually zooms in on the target human body and keypoints while masking out unimportant figure regions. (2) We further introduce a novel deformable cross attention mechanism and a bidirectional separation strategy to adaptively aggregate spatial and temporal motion clues from constrained surrounding contexts. (3) We mathematically formulate the deformable cross attention, constraining that the model focuses solely on the regions centered at the target person body. Empirically, our method achieves state-of-the-art performance on three large-scale benchmark datasets. A remarkable highlight is that our method achieves an 84.8 mean Average Precision (mAP) on the challenging wrist joint, which significantly outperforms the 81.5 mAP achieved by the current state-of-the-art method on the PoseTrack2017 dataset.
zh
[CV-24] Context-CrackNet: A Context-Aware Framework for Precise Segmentation of Tiny Cracks in Pavement images
【速读】:该论文旨在解决路面病害(特别是微小裂缝)的精确检测和分割问题,这对于交通基础设施的早期干预和预防性维护至关重要。传统的手动检测方法不仅劳动密集且结果不一致,而现有的深度学习模型在细粒度分割和计算效率方面存在不足。为解决这些问题,论文提出了Context-CrackNet,一种新型的编码器-解码器架构,其核心创新在于引入了区域聚焦增强模块(Region-Focused Enhancement Module, RFEM)和上下文感知全局模块(Context-Aware Global Module, CAGM)。RFEM增强了模型捕捉细粒度局部细节的能力,而CAGM则提升了模型对全局上下文依赖的理解。通过这两种模块的协同作用,Context-CrackNet在多个公开的路面裂缝分割数据集上表现出色,显著优于现有的9种先进分割框架,并在保持较高推理效率的同时,取得了优异的性能指标(如mIoU和Dice分数)。
链接: https://arxiv.org/abs/2501.14413
作者: Blessing Agyei Kyem,Joshua Kofi Asamoah,Armstrong Aboah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The accurate detection and segmentation of pavement distresses, particularly tiny and small cracks, are critical for early intervention and preventive maintenance in transportation infrastructure. Traditional manual inspection methods are labor-intensive and inconsistent, while existing deep learning models struggle with fine-grained segmentation and computational efficiency. To address these challenges, this study proposes Context-CrackNet, a novel encoder-decoder architecture featuring the Region-Focused Enhancement Module (RFEM) and Context-Aware Global Module (CAGM). These innovations enhance the model’s ability to capture fine-grained local details and global contextual dependencies, respectively. Context-CrackNet was rigorously evaluated on ten publicly available crack segmentation datasets, covering diverse pavement distress scenarios. The model consistently outperformed 9 state-of-the-art segmentation frameworks, achieving superior performance metrics such as mIoU and Dice score, while maintaining competitive inference efficiency. Ablation studies confirmed the complementary roles of RFEM and CAGM, with notable improvements in mIoU and Dice score when both modules were integrated. Additionally, the model’s balance of precision and computational efficiency highlights its potential for real-time deployment in large-scale pavement monitoring systems.
zh
[CV-25] Kolmogorov Arnold Neural Interpolator for Downscaling and Correcting Meteorological Fields from In-Situ Observations
【速读】:该论文试图解决在站点位置获取准确天气预报的挑战,这一挑战主要源于多尺度、连续的大气特性与其离散、网格化表示之间的系统偏差。传统方法主要关注建模网格化的气象数据,忽略了大气状态的离网、连续特性,导致这些偏差无法得到有效解决。为解决这一问题,论文提出了Kolmogorov Arnold Neural Interpolator (KANI)框架,该框架通过将气象场重新定义为从离散网格导出的连续神经函数,基于Kolmogorov Arnold定理捕捉大气状态的连续性,并利用稀疏的现场观测数据系统性地校正这些偏差。此外,KANI引入了创新的零样本降尺度能力,通过高分辨率地形纹理进行指导,而无需高分辨率气象场的监督。实验结果表明,KANI在温度和风速预测上的准确率分别提高了40.28%和67.41%,显著优于传统插值方法。这一解决方案的关键在于通过神经网络实现气象变量的连续表示,突破了传统网格化表示的局限性。
链接: https://arxiv.org/abs/2501.14404
作者: Zili Liu,Hao Chen,Lei Bai,Wenyuan Li,Zhengxia Zou,Zhenwei Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Obtaining accurate weather forecasts at station locations is a critical challenge due to systematic biases arising from the mismatch between multi-scale, continuous atmospheric characteristic and their discrete, gridded representations. Previous works have primarily focused on modeling gridded meteorological data, inherently neglecting the off-grid, continuous nature of atmospheric states and leaving such biases unresolved. To address this, we propose the Kolmogorov Arnold Neural Interpolator (KANI), a novel framework that redefines meteorological field representation as continuous neural functions derived from discretized grids. Grounded in the Kolmogorov Arnold theorem, KANI captures the inherent continuity of atmospheric states and leverages sparse in-situ observations to correct these biases systematically. Furthermore, KANI introduces an innovative zero-shot downscaling capability, guided by high-resolution topographic textures without requiring high-resolution meteorological fields for supervision. Experimental results across three sub-regions of the continental United States indicate that KANI achieves an accuracy improvement of 40.28% for temperature and 67.41% for wind speed, highlighting its significant improvement over traditional interpolation methods. This enables continuous neural representation of meteorological variables through neural networks, transcending the limitations of conventional grid-based representations.
zh
[CV-26] CVOCSemRPL: Class-Variance Optimized Clustering Semantic Information Injection and Restricted Pseudo Labeling based Improved Semi-Supervised Few-Shot Learning
【速读】:该论文试图解决在半监督少样本学习(semi-supervised few-shot learning)场景中,由于标记样本数量有限,模型性能受限的问题。具体而言,尽管有大量未标记样本可用,但现有的基于聚类的方法在生成伪标签(pseudo-labeling)时,可能会因表示学习质量不佳而导致错误标记,进而影响模型性能。论文提出的解决方案关键在于:1)通过类方差优化聚类(class-variance optimized clustering)提高标记和未标记样本聚类的有效性;2)采用受限伪标签生成方法(restricted pseudo-labeling approach)优化基于聚类的伪标签生成过程;3)注入语义信息(semantic information injection)以进一步提升模型的半监督少样本学习性能。实验结果表明,该方法在基准数据集上显著优于现有最先进方法。
链接: https://arxiv.org/abs/2501.14401
作者: Rhythm Baghel,Souvik Maji,Pratik Mazumder
机构: Indian Institute of Technology Jodhpur, India (印度理工学院焦特布尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Few-shot learning has been extensively explored to address problems where the amount of labeled samples is very limited for some classes. In the semi-supervised few-shot learning setting, substantial quantities of unlabeled samples are available. Such unlabeled samples are generally cheaper to obtain and can be used to improve the few-shot learning performance of the model. Some of the recent methods for this setting rely on clustering to generate pseudo-labels for the unlabeled samples. Since the quality of the representation learned by the model heavily influences the effectiveness of clustering, this might also lead to incorrect labeling of the unlabeled samples and consequently lead to a drop in the few-shot learning performance. We propose an approach for semi-supervised few-shot learning that performs a class-variance optimized clustering in order to improve the effectiveness of clustering the labeled and unlabeled samples in this setting. It also optimizes the clustering-based pseudo-labeling process using a restricted pseudo-labeling approach and performs semantic information injection in order to improve the semi-supervised few-shot learning performance of the model. We experimentally demonstrate that our proposed approach significantly outperforms recent state-of-the-art methods on the benchmark datasets.
zh
[CV-27] Low-rank Prompt Interaction for Continual Vision-Language Retrieval
【速读】:该论文试图解决多模态任务中的持续学习(continual learning)问题,特别是现有研究大多忽视了显式的跨模态(cross-modal)和跨任务(cross-task)交互。为解决这一问题,论文创新性地提出了低秩提示交互(Low-rank Prompt Interaction, LPI)方法。其关键解决方案包括两个方面:首先,针对跨模态交互,论文通过多模态相关性模块(multi-modal correlation modules)增强Transformer层的跨模态关联,并采用低秩交互增强分解(low-rank interaction-augmented decomposition)来避免内存爆炸,同时通过共享和分离共同特定的低秩因子(common-specific low-rank factors)来提升跨模态关联。其次,针对跨任务交互,论文通过视觉分析发现不同任务在语义距离上存在明显差异,因此在提示学习(prompt learning)过程中引入了基于任务语义距离的显式任务对比约束(task contrastive constraints),以增强任务间的区分性。实验结果表明,该方法在引入少量参数的情况下显著提升了检索任务的性能,验证了其有效性。
链接: https://arxiv.org/abs/2501.14369
作者: Weicai Yan,Ye Wang,Wang Lin,Zirun Guo,Zhou Zhao,Tao Jin
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Research on continual learning in multi-modal tasks has been receiving increasing attention. However, most existing work overlooks the explicit cross-modal and cross-task interactions. In this paper, we innovatively propose the Low-rank Prompt Interaction (LPI) to address this general problem of multi-modal understanding, which considers both cross-modal and cross-task interactions. Specifically, as for the former, we employ multi-modal correlation modules for corresponding Transformer layers. Considering that the training parameters scale to the number of layers and tasks, we propose low-rank interaction-augmented decomposition to avoid memory explosion while enhancing the cross-modal association through sharing and separating common-specific low-rank factors. In addition, due to the multi-modal semantic differences carried by the low-rank initialization, we adopt hierarchical low-rank contrastive learning to ensure training robustness. As for the latter, we initially employ a visual analysis and identify that different tasks have clear distinctions in proximity. Therefore, we introduce explicit task contrastive constraints in the prompt learning process based on task semantic distances. Experiments on two retrieval tasks show performance improvements with the introduction of a minimal number of parameters, demonstrating the effectiveness of our method. Code is available at this https URL.
zh
[CV-28] Causal-Inspired Multitask Learning for Video-Based Human Pose Estimation
【速读】:该论文试图解决视频中人体姿态估计(human pose estimation)的挑战性问题,特别是现有方法在复杂场景下表现不佳的问题。现有研究通常通过增强架构设计和优化策略来进行时空建模,但忽视了关节之间的因果关系,导致模型在复杂场景下的估计效果较差。为此,论文提出了一种基于因果推理的多任务学习框架,分为两个阶段。第一阶段通过引入两个自监督辅助任务,赋予模型因果时空建模能力,使其能够基于观察到的关键点信息推断具有挑战性的关键点,从而增强模型的因果推理能力。第二阶段提出了一种Token Causal Importance Selection模块,用于区分因果(关键点相关)和非因果(如背景和物体)特征,并通过非因果特征聚类模块合并相似的非因果特征,以提高模型的解释性和鲁棒性。实验结果表明,该方法在三个大规模基准数据集上优于现有最先进方法。
链接: https://arxiv.org/abs/2501.14356
作者: Haipeng Chen,Sifan Wu,Zhigang Wang,Yifang Yin,Yingying Jiao,Yingda Lyu,Zhenguang Liu
机构: 1. 未知; 2. 未知; 3. 未知; 4. 未知; 5. 未知; 6. 未知; 7. 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures
点击查看摘要
Abstract:Video-based human pose estimation has long been a fundamental yet challenging problem in computer vision. Previous studies focus on spatio-temporal modeling through the enhancement of architecture design and optimization strategies. However, they overlook the causal relationships in the joints, leading to models that may be overly tailored and thus estimate poorly to challenging scenes. Therefore, adequate causal reasoning capability, coupled with good interpretability of model, are both indispensable and prerequisite for achieving reliable results. In this paper, we pioneer a causal perspective on pose estimation and introduce a causal-inspired multitask learning framework, consisting of two stages. \textitIn the first stage, we try to endow the model with causal spatio-temporal modeling ability by introducing two self-supervision auxiliary tasks. Specifically, these auxiliary tasks enable the network to infer challenging keypoints based on observed keypoint information, thereby imbuing causal reasoning capabilities into the model and making it robust to challenging scenes. \textitIn the second stage, we argue that not all feature tokens contribute equally to pose estimation. Prioritizing causal (keypoint-relevant) tokens is crucial to achieve reliable results, which could improve the interpretability of the model. To this end, we propose a Token Causal Importance Selection module to identify the causal tokens and non-causal tokens (\textite.g., background and objects). Additionally, non-causal tokens could provide potentially beneficial cues but may be redundant. We further introduce a non-causal tokens clustering module to merge the similar non-causal tokens. Extensive experiments show that our method outperforms state-of-the-art methods on three large-scale benchmark datasets.
zh
[CV-29] Correlation-Based Band Selection for Hyperspectral Image Classification
【速读】:该论文试图解决高光谱图像(hyperspectral images)处理中数据量大、相邻波段高度相关的问题。高光谱图像包含多个波段的光谱信息,但由于相邻波段之间的高度相关性,通常只需选择少量波段用于应用。论文提出了一种基于相关性的波段选择方法,通过计算波段间的平均相关系数(correlation coefficients)来分析不同波段之间的关系,并采用基于阈值的方法选择子集。该方法的关键在于通过分析平均相关性,筛选出具有较低波段间依赖性的波段,确保所选波段提供多样且非冗余的信息。实验结果表明,该方法在Pavia University (PA)和Salinas Valley (SA)两个标准数据集上的图像分类任务中表现优异,与其他标准波段选择方法相比具有竞争力。
链接: https://arxiv.org/abs/2501.14338
作者: Dibyabha Deb,Ujjwal Verma
机构: Manipal Institute of Technology Bengaluru(曼尼帕尔理工学院班加罗尔); Manipal Academy of Higher Education, Manipal, India(曼尼帕尔高等教育学院, 曼尼帕尔, 印度); Manipal Institute of Technology(曼尼帕尔理工学院); Manipal Academy of Higher Education, Manipal, India(曼尼帕尔高等教育学院, 曼尼帕尔, 印度)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 5 pages, 1 figure
点击查看摘要
Abstract:Hyperspectral images offer extensive spectral information about ground objects across multiple spectral bands. However, the large volume of data can pose challenges during processing. Typically, adjacent bands in hyperspectral data are highly correlated, leading to the use of only a few selected bands for various applications. In this work, we present a correlation-based band selection approach for hyperspectral image classification. Our approach calculates the average correlation between bands using correlation coefficients to identify the relationships among different bands. Afterward, we select a subset of bands by analyzing the average correlation and applying a threshold-based method. This allows us to isolate and retain bands that exhibit lower inter-band dependencies, ensuring that the selected bands provide diverse and non-redundant information. We evaluate our proposed approach on two standard benchmark datasets: Pavia University (PA) and Salinas Valley (SA), focusing on image classification tasks. The experimental results demonstrate that our method performs competitively with other standard band selection approaches.
zh
[CV-30] Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video ICLR2025
【速读】:该论文旨在解决现有模型在真实世界环境中由于噪声数据导致的性能下降问题,特别是在自运动估计(ego-motion estimation)和逼真3D重建(photorealistic 3D reconstruction)任务中。现有模型通常在无噪声的理想条件下进行评估,无法应对真实世界中的动态运动、传感器缺陷和同步误差等复杂噪声情况。为此,论文提出了三个核心解决方案:首先,开发了一个可扩展的噪声数据合成管道,生成模拟复杂运动、传感器缺陷和同步误差的多样化数据集;其次,利用该管道创建了Robust-Ego3D基准,用于暴露噪声引起的性能退化问题;最后,提出了一种新的测试时适应方法——Correspondence-guided Gaussian Splatting (CorrGS),通过将噪声观测与干净3D地图渲染的RGB-D帧对齐,逐步优化内部干净的3D表示,从而提升几何对齐和外观恢复的效果。实验表明,CorrGS在涉及快速运动和动态光照的场景中显著优于现有方法。
链接: https://arxiv.org/abs/2501.14319
作者: Xiaohao Xu,Tianyi Zhang,Shibo Zhao,Xiang Li,Sibo Wang,Yongqi Chen,Ye Li,Bhiksha Raj,Matthew Johnson-Roberson,Sebastian Scherer,Xiaonan Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICLR 2025; 92 Pages; Project Repo: this https URL . arXiv admin note: substantial text overlap with arXiv:2406.16850
点击查看摘要
Abstract:We aim to redefine robust ego-motion estimation and photorealistic 3D reconstruction by addressing a critical limitation: the reliance on noise-free data in existing models. While such sanitized conditions simplify evaluation, they fail to capture the unpredictable, noisy complexities of real-world environments. Dynamic motion, sensor imperfections, and synchronization perturbations lead to sharp performance declines when these models are deployed in practice, revealing an urgent need for frameworks that embrace and excel under real-world noise. To bridge this gap, we tackle three core challenges: scalable data generation, comprehensive benchmarking, and model robustness enhancement. First, we introduce a scalable noisy data synthesis pipeline that generates diverse datasets simulating complex motion, sensor imperfections, and synchronization errors. Second, we leverage this pipeline to create Robust-Ego3D, a benchmark rigorously designed to expose noise-induced performance degradation, highlighting the limitations of current learning-based methods in ego-motion accuracy and 3D reconstruction quality. Third, we propose Correspondence-guided Gaussian Splatting (CorrGS), a novel test-time adaptation method that progressively refines an internal clean 3D representation by aligning noisy observations with rendered RGB-D frames from clean 3D map, enhancing geometric alignment and appearance restoration through visual correspondence. Extensive experiments on synthetic and real-world data demonstrate that CorrGS consistently outperforms prior state-of-the-art methods, particularly in scenarios involving rapid motion and dynamic illumination.
zh
[CV-31] Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation
【速读】:该论文旨在解决当前自动网格生成方法在生成三角形网格(Triangle Meshes)时面临的挑战,包括中间表示缺乏连续表面质量、生成的网格过于密集且不理想,以及现有自回归方法在面数、可扩展性和结构保真度方面的局限性。为解决这些问题,论文提出了Nautilus,一种基于局部感知的自编码器(locality-aware autoencoder),通过利用流形网格(manifold meshes)的局部特性来实现结构保真和高效表示。其关键创新在于引入了一种新颖的标记化算法(tokenization algorithm),该算法通过局部共享顶点和边来保持面邻近关系并压缩序列长度,从而支持生成多达5,000个面的网格。此外,论文还开发了一种双流点条件器(Dual-stream Point Conditioner),通过捕捉多尺度几何特征,确保全局一致性和局部结构保真度。实验表明,Nautilus在保真度和可扩展性方面显著优于现有方法。
链接: https://arxiv.org/abs/2501.14317
作者: Yuxuan Wang,Xuanyu Yi,Haohan Weng,Qingshan Xu,Xiaokang Wei,Xianghui Yang,Chunchao Guo,Long Chen,Hanwang Zhang
机构: Nanyang Technological University(南洋理工大学); Tencent Hunyuan(腾讯混元); The Hong Kong Polytechnic University(香港理工大学); Hong Kong University of Science and Technology(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
点击查看摘要
Abstract:Triangle meshes are fundamental to 3D applications, enabling efficient modification and rasterization while maintaining compatibility with standard rendering pipelines. However, current automatic mesh generation methods typically rely on intermediate representations that lack the continuous surface quality inherent to meshes. Converting these representations into meshes produces dense, suboptimal outputs. Although recent autoregressive approaches demonstrate promise in directly modeling mesh vertices and faces, they are constrained by the limitation in face count, scalability, and structural fidelity. To address these challenges, we propose Nautilus, a locality-aware autoencoder for artist-like mesh generation that leverages the local properties of manifold meshes to achieve structural fidelity and efficient representation. Our approach introduces a novel tokenization algorithm that preserves face proximity relationships and compresses sequence length through locally shared vertices and edges, enabling the generation of meshes with an unprecedented scale of up to 5,000 faces. Furthermore, we develop a Dual-stream Point Conditioner that provides multi-scale geometric guidance, ensuring global consistency and local structural fidelity by capturing fine-grained geometric features. Extensive experiments demonstrate that Nautilus significantly outperforms state-of-the-art methods in both fidelity and scalability.
zh
[CV-32] PAID: A Framework of Product-Centric Advertising Image Design
【速读】:该论文旨在解决电子商务平台中广告图像自动设计的问题,以减少人力成本并提升设计效率。传统的广告图像设计方法通常以背景图像为输入,然后预测营销标语(taglines)的布局,这种方法由于背景图像内容的固定性限制了布局的灵活性。论文提出了一种名为“以产品为中心的广告图像设计”(Product-Centric Advertising Image Design, PAID)的自动框架,其关键创新在于通过四个顺序阶段——提示生成(prompt generation)、布局生成(layout generation)、背景图像生成(background image generation)和图形渲染(graphics rendering)——来生成广告图像。PAID框架通过视觉语言模型(Visual Language Model, VLM)生成与产品匹配的背景提示,并基于此提示、产品图像和营销标语联合预测文本和图像的布局,从而实现最佳视觉效果。此外,PAID还训练了一个基于SDXL的布局控制修复模型(layout-controlled inpainting model),用于生成美观的背景图像。通过调整设计阶段,PAID突破了传统方法中由于固定背景图像内容而导致的布局限制,生成了更具视觉吸引力的广告图像。
链接: https://arxiv.org/abs/2501.14316
作者: Hongyu Chen,Min Zhou,Jing Jiang,Jiale Chen,Yang Lu,Bo Xiao,Tiezheng Ge,Bo Zheng
机构: Alibaba Group(阿里巴巴集团); Beijing University of Posts and Telecommunications(北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In E-commerce platforms, a full advertising image is composed of a background image and marketing taglines. Automatic ad image design reduces human costs and plays a crucial role. For the convenience of users, a novel automatic framework named Product-Centric Advertising Image Design (PAID) is proposed in this work. PAID takes the product foreground image, required taglines, and target size as input and creates an ad image automatically. PAID consists of four sequential stages: prompt generation, layout generation, background image generation, and graphics rendering. Different expert models are trained to conduct these sub-tasks. A visual language model (VLM) based prompt generation model is leveraged to produce a product-matching background prompt. The layout generation model jointly predicts text and image layout according to the background prompt, product, and taglines to achieve the best harmony. An SDXL-based layout-controlled inpainting model is trained to generate an aesthetic background image. Previous ad image design methods take a background image as input and then predict the layout of taglines, which limits the spatial layout due to fixed image content. Innovatively, our PAID adjusts the stages to produce an unrestricted layout. To complete the PAID framework, we created two high-quality datasets, PITA and PIL. Extensive experimental results show that PAID creates more visually pleasing advertising images than previous methods.
zh
[CV-33] BrainGuard: Privacy-Preserving Multisubject Image Reconstructions from Brain Activities AAAI2025
【速读】:该论文旨在解决从多被试功能性磁共振成像(fMRI)数据中重建感知图像时面临的两个主要问题:个体差异性和数据隐私保护。早期方法通常为每个个体训练单独的模型,忽略了跨被试的共性,而现有的多被试方法则在数据隐私和个体差异性管理方面存在显著挑战。为解决这些问题,论文提出了BrainGuard,一种隐私保护的协作训练框架。其关键解决方案在于采用了一种全局-局部(global-local)架构,其中每个被试的本地模型在本地数据上进行训练,并与一个共享的全局模型协同工作,该全局模型捕捉并利用跨被试的共性模式。这种架构避免了跨被试数据的聚合,从而确保了隐私保护。此外,BrainGuard还引入了混合同步策略,使本地模型能够动态整合全局模型的参数,从而有效应对fMRI数据的复杂性。通过这一创新设计,BrainGuard不仅保护了敏感的脑数据,还显著提高了图像重建的准确性。
链接: https://arxiv.org/abs/2501.14309
作者: Zhibo Tian,Ruijie Quan,Fan Ma,Kun Zhan,Yi Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025 oral
点击查看摘要
Abstract:Reconstructing perceived images from human brain activity forms a crucial link between human and machine learning through Brain-Computer Interfaces. Early methods primarily focused on training separate models for each individual to account for individual variability in brain activity, overlooking valuable cross-subject commonalities. Recent advancements have explored multisubject methods, but these approaches face significant challenges, particularly in data privacy and effectively managing individual variability. To overcome these challenges, we introduce BrainGuard, a privacy-preserving collaborative training framework designed to enhance image reconstruction from multisubject fMRI data while safeguarding individual privacy. BrainGuard employs a collaborative global-local architecture where individual models are trained on each subject’s local data and operate in conjunction with a shared global model that captures and leverages cross-subject patterns. This architecture eliminates the need to aggregate fMRI data across subjects, thereby ensuring privacy preservation. To tackle the complexity of fMRI data, BrainGuard integrates a hybrid synchronization strategy, enabling individual models to dynamically incorporate parameters from the global model. By establishing a secure and collaborative training environment, BrainGuard not only protects sensitive brain data but also improves the image reconstructions accuracy. Extensive experiments demonstrate that BrainGuard sets a new benchmark in both high-level and low-level metrics, advancing the state-of-the-art in brain decoding through its innovative design.
zh
[CV-34] Learning Primitive Relations for Compositional Zero-Shot Learning ICASSP2025
【速读】:该论文试图解决组合零样本学习(Compositional Zero-Shot Learning, CZSL)中的问题,即如何通过从已见过的状态-对象组合中学习知识来识别未见过的状态-对象组合。现有方法通常独立预测状态和对象,忽略了它们之间的关系。论文提出了一种新的框架——学习原始关系(Learning Primitive Relations, LPR),该框架通过概率建模捕捉状态和对象之间的关系。关键解决方案在于利用交叉注意力机制(cross-attention mechanism),使模型能够考虑状态和对象之间的依赖关系,从而推断未见组合的可能性。实验结果表明,LPR在封闭世界和开放世界设置下的所有三个CZSL基准数据集上均优于现有最先进方法,并通过定性分析展示了LPR如何利用状态-对象关系进行未见组合的预测。
链接: https://arxiv.org/abs/2501.14308
作者: Insu Lee,Jiseob Kim,Kyuhong Shim,Byonghyo Shim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICASSP 2025
点击查看摘要
Abstract:Compositional Zero-Shot Learning (CZSL) aims to identify unseen state-object compositions by leveraging knowledge learned from seen compositions. Existing approaches often independently predict states and objects, overlooking their relationships. In this paper, we propose a novel framework, learning primitive relations (LPR), designed to probabilistically capture the relationships between states and objects. By employing the cross-attention mechanism, LPR considers the dependencies between states and objects, enabling the model to infer the likelihood of unseen compositions. Experimental results demonstrate that LPR outperforms state-of-the-art methods on all three CZSL benchmark datasets in both closed-world and open-world settings. Through qualitative analysis, we show that LPR leverages state-object relationships for unseen composition prediction.
zh
[CV-35] Additive Manufacturing Processes Protocol Prediction by Artificial Intelligence using X-ray Computed Tomography data
【速读】:该论文旨在解决增材制造(Additive Manufacturing, AM)过程中工艺参数优化的问题,以提高零件质量。传统方法通常需要迭代调整参数,而本文提出了一种无需人工干预的非迭代方法,利用人工智能(Artificial Intelligence, AI)实现全自动化参数设置。解决方案的关键在于引入基于AI的图像分割步骤,结合无损检测(Non-Destructive Testing, NDT)方法获取的质量检测数据,训练人工神经网络(Artificial Neural Network, ANN)模型,从而自动选择最优工艺参数。该方法通过对比经典阈值分割方法,展示了AI模型在精度上的显著优势(99.3% vs. 83.44%),并通过经典优化和机械测试验证了其有效性。
链接: https://arxiv.org/abs/2501.14306
作者: Sunita Khod,Akshay Dvivedi,Mayank Goswami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
备注: 21 pages, 21 figures, 5 tables
点击查看摘要
Abstract:The quality of the part fabricated from the Additive Manufacturing (AM) process depends upon the process parameters used, and therefore, optimization is required for apt quality. A methodology is proposed to set these parameters non-iteratively without human intervention. It utilizes Artificial Intelligence (AI) to fully automate the process, with the capability to self-train any apt AI model by further assimilating the training this http URL study includes three commercially available 3D printers for soft material printing based on the Material Extrusion (MEX) AM process. The samples are 3D printed for six different AM process parameters obtained by varying layer height and nozzle speed. The novelty part of the methodology is incorporating an AI-based image segmentation step in the decision-making stage that uses quality inspected training data from the Non-Destructive Testing (NDT) method. The performance of the trained AI model is compared with the two software tools based on the classical thresholding method. The AI-based Artificial Neural Network (ANN) model is trained from NDT-assessed and AI-segmented data to automate the selection of optimized process parameters. The AI-based model is 99.3 % accurate, while the best available commercial classical image method is 83.44 % accurate. The best value of overall R for training ANN is 0.82. The MEX process gives a 22.06 % porosity error relative to the design. The NDT-data trained two AI models integrated into a series pipeline for optimal process parameters are proposed and verified by classical optimization and mechanical testing methods.
zh
[CV-36] D-RD: A Top-Down Benchmark with Real-Time Framework for Road Damage Detection
【速读】:该论文试图解决道路损伤检测领域研究相对不足的问题,尽管道路损伤检测在基础设施维护和道路安全等应用中具有重要意义。现有的数据集在这一领域尚未得到充分探索。为此,论文提出了一种新颖的Top Down Road Damage Detection Dataset (TDRD) 数据集,专门针对道路损伤检测,提供了从俯视角度捕捉的三种主要道路损伤类型:裂缝(cracks)、坑洞(potholes)和修补(patches)。该数据集包含7,088张高分辨率图像,标注了12,882个道路损伤实例。此外,论文还提出了一种新的实时目标检测框架TDYOLOV10,旨在应对TDRD数据集带来的独特挑战。通过与现有最先进模型的对比研究,展示了具有竞争力的基线结果。通过发布TDRD数据集,论文旨在加速这一关键领域的研究进展。
链接: https://arxiv.org/abs/2501.14302
作者: Xi Xiao,Zhengji Li,Wentao Wang,Jiacheng Xie,Houjie Lin,Swalpa Kumar Roy,Tianyang Wang,Min Xu
机构: University of Alabama at Birmingham; Alipurduar Government Engineering and Management College; MBZUAI; Carnegie Mellon University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Object detection has witnessed remarkable advancements over the past decade, largely driven by breakthroughs in deep learning and the proliferation of large scale datasets. However, the domain of road damage detection remains relatively under explored, despite its critical significance for applications such as infrastructure maintenance and road safety. This paper addresses this gap by introducing a novel top down benchmark that offers a complementary perspective to existing datasets, specifically tailored for road damage detection. Our proposed Top Down Road Damage Detection Dataset (TDRD) includes three primary categories of road damage cracks, potholes, and patches captured from a top down viewpoint. The dataset consists of 7,088 high resolution images, encompassing 12,882 annotated instances of road damage. Additionally, we present a novel real time object detection framework, TDYOLOV10, designed to handle the unique challenges posed by the TDRD dataset. Comparative studies with state of the art models demonstrate competitive baseline results. By releasing TDRD, we aim to accelerate research in this crucial area. A sample of the dataset will be made publicly available upon the paper’s acceptance.
zh
[CV-37] Dense-SfM: Structure from Motion with Dense Consistent Matching
【速读】:该论文试图解决传统Structure from Motion (SfM)方法在稀疏关键点匹配(sparse keypoint matching)中存在的精度和点密度不足的问题,特别是在纹理缺失区域(texture-less areas)。传统方法依赖于稀疏关键点匹配,导致重建的3D模型在精度和密度上受限。Dense-SfM通过引入基于高斯泼溅(Gaussian Splatting, GS)的密集匹配和轨迹扩展技术,生成了更一致且更长的特征轨迹,从而提高了重建的精度和密度。此外,Dense-SfM还采用了多视角核化匹配模块(multi-view kernelized matching module),结合了Transformer和高斯过程(Gaussian Process)架构,进一步增强了多视角下的轨迹优化能力。实验结果表明,Dense-SfM在ETH3D和Texture-Poor SfM数据集上显著优于现有方法,提供了更高的重建精度和密度。
链接: https://arxiv.org/abs/2501.14277
作者: JongMin Lee,Sungjoo Yoo
机构: Seoul National University(首尔国立大学); Seoul National University(首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present Dense-SfM, a novel Structure from Motion (SfM) framework designed for dense and accurate 3D reconstruction from multi-view images. Sparse keypoint matching, which traditional SfM methods often rely on, limits both accuracy and point density, especially in texture-less areas. Dense-SfM addresses this limitation by integrating dense matching with a Gaussian Splatting (GS) based track extension which gives more consistent, longer feature tracks. To further improve reconstruction accuracy, Dense-SfM is equipped with a multi-view kernelized matching module leveraging transformer and Gaussian Process architectures, for robust track refinement across multi-views. Evaluations on the ETH3D and Texture-Poor SfM datasets show that Dense-SfM offers significant improvements in accuracy and density over state-of-the-art methods.
zh
[CV-38] Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models
【速读】:该论文试图解决在大规模视觉-语言模型(Large Vision-Language Models, LVLMs)中高分辨率图像处理时,由于固定分辨率处理导致的视觉信息丢失问题。现有的子图像分割方法通常对子图像进行均匀处理,导致图像理解效果不佳。论文提出了一种新的解决方案,即全局语义引导权重分配器(Global Semantic-guided Weight Allocator, GSWA)模块。该模块根据子图像相对于整个图像的语义相关性动态分配权重,模拟人类视觉注意力机制,使模型能够聚焦于信息更丰富的区域,从而克服均匀处理的局限性。通过将GSWA集成到InternVL2-2B框架中,论文开发了SleighVL模型,实验表明该模型在参数相当的情况下优于其他模型,并与更大规模的模型保持竞争力。这一工作为LVLMs中更高效和上下文感知的高分辨率图像处理提供了新的方向。
链接: https://arxiv.org/abs/2501.14276
作者: Yuxuan Liang,Xu Li,Xiaolei Chen,Haotian Chen,Yi Zheng,Chenghang Lai,Bin Li,Xiangyang Xue
机构: School of Computer Science, Fudan University (复旦大学计算机学院); Shanghai Key Laboratory of Intelligent Information Processing (上海智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures and tables
点击查看摘要
Abstract:As the demand for high-resolution image processing in Large Vision-Language Models (LVLMs) grows, sub-image partitioning has become a popular approach for mitigating visual information loss associated with fixed-resolution processing. However, existing partitioning methods uniformly process sub-images, resulting in suboptimal image understanding. In this work, we reveal that the sub-images with higher semantic relevance to the entire image encapsulate richer visual information for preserving the model’s visual understanding ability. Therefore, we propose the Global Semantic-guided Weight Allocator (GSWA) module, which dynamically allocates weights to sub-images based on their relative information density, emulating human visual attention mechanisms. This approach enables the model to focus on more informative regions, overcoming the limitations of uniform treatment. We integrate GSWA into the InternVL2-2B framework to create SleighVL, a lightweight yet high-performing model. Extensive experiments demonstrate that SleighVL outperforms models with comparable parameters and remains competitive with larger models. Our work provides a promising direction for more efficient and contextually aware high-resolution image processing in LVLMs, advancing multimodal system development.
zh
[CV-39] Bayesian Neural Networks for One-to-Many Mapping in Image Enhancement
【速读】:该论文试图解决在图像增强任务中,由于动态摄影条件(如光照变化)导致的退化图像与多个可能的目标图像之间的一对多映射问题。为了解决这一问题,作者提出了一种贝叶斯增强模型(Bayesian Enhancement Model, BEM),该模型结合了贝叶斯神经网络(Bayesian Neural Networks, BNNs)来捕捉数据不确定性并生成多样化的输出。解决方案的关键在于采用了两阶段方法:第一阶段使用BNN在低维空间中建模一对多映射,第二阶段则通过确定性神经网络(Deterministic Neural Network, DNN)细化图像的细节。此外,为了加速BNN的训练和收敛,作者引入了动态动量先验(Momentum Prior)。实验结果表明,该方法在多个低光和 underwater 图像增强基准数据集上优于确定性模型。
链接: https://arxiv.org/abs/2501.14265
作者: Guoxi Huang,Nantheera Anantrasirichai,Fei Ye,Zipeng Qi,RuiRui Lin,Qirui Yang,David Bull
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In image enhancement tasks, such as low-light and underwater image enhancement, a degraded image can correspond to multiple plausible target images due to dynamic photography conditions, such as variations in illumination. This naturally results in a one-to-many mapping challenge. To address this, we propose a Bayesian Enhancement Model (BEM) that incorporates Bayesian Neural Networks (BNNs) to capture data uncertainty and produce diverse outputs. To achieve real-time inference, we introduce a two-stage approach: Stage I employs a BNN to model the one-to-many mappings in the low-dimensional space, while Stage II refines fine-grained image details using a Deterministic Neural Network (DNN). To accelerate BNN training and convergence, we introduce a dynamic \emphMomentum Prior. Extensive experiments on multiple low-light and underwater image enhancement benchmarks demonstrate the superiority of our method over deterministic models.
zh
[CV-40] Point-LN: A Lightweight Framework for Efficient Point Cloud Classification Using Non-Parametric Positional Encoding
【速读】:该论文旨在解决3D点云分类任务中的效率和计算资源消耗问题。现有的点云分类方法通常计算复杂度高,难以在资源受限的环境中实时应用。为此,作者提出了Point-LN,一种轻量级框架,通过结合非参数化组件(如Farthest Point Sampling (FPS)、k-Nearest Neighbors (k-NN)和非可学习的位置编码)与简化的可学习分类器,显著提升了分类精度,同时保持了极低的参数量。这种混合架构确保了低计算成本和快速推理速度,使其特别适用于实时和资源受限的应用场景。通过在ModelNet40和ScanObjectNN等基准数据集上的全面评估,Point-LN展示了与现有最先进方法相媲美的性能,同时具有卓越的效率,证明了其在多样化点云分类任务中的鲁棒性和可扩展性。
链接: https://arxiv.org/abs/2501.14238
作者: Marzieh Mohammadi,Amir Salarpour,Pedram MohajerAnsari
机构: Sirjan University of Technology, Iran (伊朗锡尔詹科技大学); Clemson University, USA (美国克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: This paper has been accepted for presentation at the 29th International Computer Conference, Computer Society of Iran (CSICC) 2025
点击查看摘要
Abstract:We introduce Point-LN, a novel lightweight framework engineered for efficient 3D point cloud classification. Point-LN integrates essential non-parametric components-such as Farthest Point Sampling (FPS), k-Nearest Neighbors (k-NN), and non-learnable positional encoding-with a streamlined learnable classifier that significantly enhances classification accuracy while maintaining a minimal parameter footprint. This hybrid architecture ensures low computational costs and rapid inference speeds, making Point-LN ideal for real-time and resource-constrained applications. Comprehensive evaluations on benchmark datasets, including ModelNet40 and ScanObjectNN, demonstrate that Point-LN achieves competitive performance compared to state-of-the-art methods, all while offering exceptional efficiency. These results establish Point-LN as a robust and scalable solution for diverse point cloud classification tasks, highlighting its potential for widespread adoption in various computer vision applications.
zh
[CV-41] Micro-macro Wavelet-based Gaussian Splatting for 3D Reconstruction from Unconstrained Images AAAI2025
【速读】:该论文试图解决从无约束图像集合中进行3D重建时面临的挑战,这些挑战主要源于图像外观的变化和瞬态遮挡。为了解决这些问题,论文提出了一种名为Micro-macro Wavelet-based Gaussian Splatting (MW-GS)的新方法。该方法的核心创新包括两个方面:首先,Micro-macro Projection(微观-宏观投影)通过多尺度特征图捕捉细节,增强了高斯点的多样性;其次,Wavelet-based Sampling(基于小波的采样)利用频域信息优化特征表示,显著改善了场景外观的建模。此外,论文还引入了Hierarchical Residual Fusion Network(分层残差融合网络)来无缝集成这些特征。实验结果表明,MW-GS在渲染性能上达到了最先进的水平,超越了现有方法。
链接: https://arxiv.org/abs/2501.14231
作者: Yihui Li,Chengxin Lv,Hongyu Yang,Di Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures,accepted by AAAI 2025
点击查看摘要
Abstract:3D reconstruction from unconstrained image collections presents substantial challenges due to varying appearances and transient occlusions. In this paper, we introduce Micro-macro Wavelet-based Gaussian Splatting (MW-GS), a novel approach designed to enhance 3D reconstruction by disentangling scene representations into global, refined, and intrinsic components. The proposed method features two key innovations: Micro-macro Projection, which allows Gaussian points to capture details from feature maps across multiple scales with enhanced diversity; and Wavelet-based Sampling, which leverages frequency domain information to refine feature representations and significantly improve the modeling of scene appearances. Additionally, we incorporate a Hierarchical Residual Fusion Network to seamlessly integrate these features. Extensive experiments demonstrate that MW-GS delivers state-of-the-art rendering performance, surpassing existing methods.
zh
[CV-42] GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm
【速读】:该论文旨在解决深度学习模型在面对对抗攻击时的鲁棒性问题,特别是如何在黑盒设置下生成高质量的对抗样本。现有的白盒攻击算法虽然有效,但依赖于模型梯度信息,限制了其在实际应用中的实用性。论文提出的解决方案GreedyPixel是一种新颖的像素级贪婪算法,通过仅使用目标模型的查询反馈来生成对抗样本。其关键在于利用代理模型生成的梯度信息构建像素优先级图,指导逐个像素的扰动,从而在无需梯度信息的情况下实现与白盒方法相当的攻击成功率。GreedyPixel在计算效率、攻击成功率和视觉质量方面均优于现有黑盒攻击算法。
链接: https://arxiv.org/abs/2501.14230
作者: Hanrui Wang,Ching-Chun Chang,Chun-Shien Lu,Christopher Leckie,Isao Echizen
机构: National Institute of Informatics, Japan(日本国立情报学研究所); Academia Sinica, Taiwan(中央研究院); The University of Melbourne, Australia(墨尔本大学); The University of Tokyo, Japan(东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:A critical requirement for deep learning models is ensuring their robustness against adversarial attacks. These attacks commonly introduce noticeable perturbations, compromising the visual fidelity of adversarial examples. Another key challenge is that while white-box algorithms can generate effective adversarial perturbations, they require access to the model gradients, limiting their practicality in many real-world scenarios. Existing attack mechanisms struggle to achieve similar efficacy without access to these gradients. In this paper, we introduce GreedyPixel, a novel pixel-wise greedy algorithm designed to generate high-quality adversarial examples using only query-based feedback from the target model. GreedyPixel improves computational efficiency in what is typically a brute-force process by perturbing individual pixels in sequence, guided by a pixel-wise priority map. This priority map is constructed by ranking gradients obtained from a surrogate model, providing a structured path for perturbation. Our results demonstrate that GreedyPixel achieves attack success rates comparable to white-box methods without the need for gradient information, and surpasses existing algorithms in black-box settings, offering higher success rates, reduced computational time, and imperceptible perturbations. These findings underscore the advantages of GreedyPixel in terms of attack efficacy, time efficiency, and visual quality.
zh
[CV-43] Detection and Classification of Acute Lymphoblastic Leukemia Utilizing Deep Transfer Learning
【速读】:该论文旨在解决白血病(leukemia)早期诊断的难题。白血病是一种由单个细胞DNA突变引发的疾病,导致未成熟白细胞的过度生成,进而影响健康血液的生成。传统的诊断方法既耗时又复杂。为此,研究提出了一种基于深度学习(deep learning)的新方法,用于在四个阶段(良性、早期、前期和进展期)诊断白血病。解决方案的关键在于使用两种卷积神经网络(Convolutional Neural Network, CNN)模型:一种是经过修改头部的MobileNetV2模型,另一种是自定义的多层卷积网络模型。研究采用了公开的“急性淋巴细胞白血病(ALL)图像数据集”,并通过合成少数类过采样技术(Synthetic Minority Oversampling Technique, SMOTE)对训练数据进行增强和平衡。最终,自定义模型达到了98.6%的准确率,而MobileNetV2模型则表现出更高的准确率(99.69%),显示出在实际应用中的潜力。
链接: https://arxiv.org/abs/2501.14228
作者: Md. Abu Ahnaf Mollick,Md. Mahfujur Rahman,D.M. Asadujjaman,Abdullah Tamim,Nosin Anjum Dristi,Md. Takbir Hossen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 4 figures, Submitted to UCICS
点击查看摘要
Abstract:A mutation in the DNA of a single cell that compromises its function initiates leukemia,leading to the overproduction of immature white blood cells that encroach upon the space required for the generation of healthy blood this http URL is treatable if identified in its initial stages. However,its diagnosis is both arduous and time consuming. This study proposes a novel approach for diagnosing leukemia across four stages Benign,Early,Pre,and Pro using deep learning this http URL employed two Convolutional Neural Network (CNN) models as MobileNetV2 with an altered head and a custom model. The custom model consists of multiple convolutional layers,each paired with corresponding max pooling this http URL utilized MobileNetV2 with ImageNet weights,adjusting the head to integrate the final this http URL dataset used is the publicly available “Acute Lymphoblastic Leukemia (ALL) Image Dataset”, and we applied the Synthetic Minority Oversampling Technique (SMOTE) to augment and balance the training this http URL custom model achieved an accuracy of 98.6%, while MobileNetV2 attained a superior accuracy of 99.69%. The pretrained model showed promising results,indicating an increased likelihood of real-world application.
zh
[CV-44] PuzzleGPT : Emulating Human Puzzle-Solving Ability for Time and Location Prediction NAACL2025
【速读】:该论文试图解决从图像中预测时间和位置的任务,这一任务具有挑战性,需要类似人类的复杂推理能力来解析不同的线索。为了解决这一问题,作者提出了一种名为PuzzleGPT的专家级管道系统,该系统通过多个模块来实现这一能力。关键模块包括:感知器(perceiver)用于识别视觉线索,推理器(reasoner)用于推导预测候选,组合器(combiner)用于组合不同线索的信息,网络检索器(web retriever)用于在本地无法解决任务时获取外部知识,以及噪声过滤器(noise filter)用于增强系统的鲁棒性。这种零样本(zero-shot)、可解释且鲁棒的方法在两个数据集(TARA和WikiTilo)上取得了最先进的性能,显著优于BLIP-2、InstructBLIP、LLaVA和GPT-4V等大型视觉语言模型(VLMs),甚至在某些情况下超越了经过微调的模型。
链接: https://arxiv.org/abs/2501.14210
作者: Hammad Ayyubi,Xuande Feng,Junzhang Liu,Xudong Lin,Zhecan Wang,Shih-Fu Chang
机构: Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NAACL 2025 Findings
点击查看摘要
Abstract:The task of predicting time and location from images is challenging and requires complex human-like puzzle-solving ability over different clues. In this work, we formalize this ability into core skills and implement them using different modules in an expert pipeline called PuzzleGPT. PuzzleGPT consists of a perceiver to identify visual clues, a reasoner to deduce prediction candidates, a combiner to combinatorially combine information from different clues, a web retriever to get external knowledge if the task can’t be solved locally, and a noise filter for robustness. This results in a zero-shot, interpretable, and robust approach that records state-of-the-art performance on two datasets – TARA and WikiTilo. PuzzleGPT outperforms large VLMs such as BLIP-2, InstructBLIP, LLaVA, and even GPT-4V, as well as automatically generated reasoning pipelines like VisProg, by at least 32% and 38%, respectively. It even rivals or surpasses finetuned models.
zh
[CV-45] You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations
【速读】:该论文旨在解决双手机器人操作(bimanual robotic manipulation)中的挑战,特别是双臂时空协调和高维动作空间的问题。以往的研究依赖于预定义的动作分类或直接遥操作来缓解或规避这些问题,但这些方法通常缺乏简洁性、通用性和可扩展性。论文提出的解决方案是YOTO(You Only Teach Once),其关键创新在于通过从人类演示视频中学习,提取并注入双手机器人动作的模式,仅需一次双目观察即可生成复杂的双手机器人任务。此外,基于关键帧的运动轨迹,YOTO能够快速生成多样化的训练演示数据,涵盖不同操作对象及其位置的变化。这些数据用于学习一种定制的双手机器人扩散策略(BiDP),从而在多种场景中实现高效的双手机器人操作。实验表明,YOTO在模仿5种复杂的长时程双手机器人任务中表现出色,具有较强的视觉和空间条件下的泛化能力,并在准确性和效率上优于现有的视觉运动模仿学习方法。
链接: https://arxiv.org/abs/2501.14208
作者: Huayi Zhou,Ruixiang Wang,Yunxin Tai,Yueci Deng,Guiliang Liu,Kui Jia
机构: The Chinese University of Hong Kong, Shenzhen(香港中文大学深圳校区); Harbin Institute of Technology, Weihai(哈尔滨工业大学威海校区); DexForce, Shenzhen(深圳德力西)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: under review
点击查看摘要
Abstract:Bimanual robotic manipulation is a long-standing challenge of embodied intelligence due to its characteristics of dual-arm spatial-temporal coordination and high-dimensional action spaces. Previous studies rely on pre-defined action taxonomies or direct teleoperation to alleviate or circumvent these issues, often making them lack simplicity, versatility and scalability. Differently, we believe that the most effective and efficient way for teaching bimanual manipulation is learning from human demonstrated videos, where rich features such as spatial-temporal positions, dynamic postures, interaction states and dexterous transitions are available almost for free. In this work, we propose the YOTO (You Only Teach Once), which can extract and then inject patterns of bimanual actions from as few as a single binocular observation of hand movements, and teach dual robot arms various complex tasks. Furthermore, based on keyframes-based motion trajectories, we devise a subtle solution for rapidly generating training demonstrations with diverse variations of manipulated objects and their locations. These data can then be used to learn a customized bimanual diffusion policy (BiDP) across diverse scenes. In experiments, YOTO achieves impressive performance in mimicking 5 intricate long-horizon bimanual tasks, possesses strong generalization under different visual and spatial conditions, and outperforms existing visuomotor imitation learning methods in accuracy and efficiency. Our project link is this https URL.
zh
[CV-46] Dynamic Token Reduction during Generation for Vision Language Models
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在解码器注意力机制和自回归生成过程中面临的二次复杂度问题。现有方法如FASTV和VTW虽然在减少冗余视觉标记(visual tokens)方面取得了显著成果,但这些方法主要关注单次前向传播中的标记剪枝,而未系统分析整个生成过程中视觉标记的冗余性。论文提出了一种名为动态剪枝率(Dynamic Rate, DyRate)的策略,该策略在生成过程中逐步调整压缩率。通过对注意力分布的分析,作者发现视觉标记的重要性在生成过程中逐渐降低,因此采用了更为激进的压缩率。通过集成一个基于注意力分布的轻量级预测器,该方法能够根据注意力分布灵活调整剪枝率。实验结果表明,该方法不仅降低了计算需求,还保持了生成响应的质量。
链接: https://arxiv.org/abs/2501.14204
作者: Xiaoyu Liang,Chaofeng Guan,Jiaying Lu,Huiyao Chen,Huan Wang,Haoji Hu
机构: Zhejiang University (浙江大学); Harbin Institute of Technology (哈尔滨工业大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have achieved notable success in multimodal tasks but face practical limitations due to the quadratic complexity of decoder attention mechanisms and autoregressive generation. Existing methods like FASTV and VTW have achieved notable results in reducing redundant visual tokens, but these approaches focus on pruning tokens in a single forward pass without systematically analyzing the redundancy of visual tokens throughout the entire generation process. In this paper, we introduce a dynamic pruning strategy tailored for VLMs, namedDynamic Rate (DyRate), which progressively adjusts the compression rate during generation. Our analysis of the distribution of attention reveals that the importance of visual tokens decreases throughout the generation process, inspiring us to adopt a more aggressive compression rate. By integrating a lightweight predictor based on attention distribution, our approach enables flexible adjustment of pruning rates based on the attention distribution. Our experimental results demonstrate that our method not only reduces computational demands but also maintains the quality of responses.
zh
[CV-47] VideoShield: Regulating Diffusion-based Video Generation Models via Watermarking ICLR
【速读】:该论文旨在解决视频生成模型(如文本到视频(T2V)和图像到视频(I2V)模型)中的内容控制问题,特别是如何在生成过程中嵌入水印而不影响视频质量。传统的后处理方法逐帧嵌入水印,往往会导致视频质量下降。为此,论文提出了VideoShield,一种专门为基于扩散模型的视频生成设计的新型水印框架。该框架的关键创新在于直接在视频生成过程中嵌入水印,避免了额外的训练需求。通过将水印比特映射到模板比特,并在去噪过程中生成带水印的噪声,VideoShield能够在视频生成的同时嵌入水印。此外,该方法引入了篡改定位功能,能够在时间(跨帧)和空间(单帧内)维度上检测视频的修改。通过DDIM反演技术,视频可以还原到其原始的水印噪声状态,从而实现水印的提取和篡改检测。实验表明,该方法在多种视频模型中均能有效提取水印并检测篡改,且不影响视频质量。此外,该方法还可应用于图像生成模型,实现对生成图像的篡改检测。
链接: https://arxiv.org/abs/2501.14195
作者: Runyi Hu,Jie Zhang,Yiming Li,Jiwei Li,Qing Guo,Han Qiu,Tianwei Zhang
机构: Nanyang Technological University(南洋理工大学); CFAR and IHPC, A*STAR, Singapore(新加坡科技研究局); Zhejiang University(浙江大学); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Learning Representations (ICLR) 2025
点击查看摘要
Abstract:Artificial Intelligence Generated Content (AIGC) has advanced significantly, particularly with the development of video generation models such as text-to-video (T2V) models and image-to-video (I2V) models. However, like other AIGC types, video generation requires robust content control. A common approach is to embed watermarks, but most research has focused on images, with limited attention given to videos. Traditional methods, which embed watermarks frame-by-frame in a post-processing manner, often degrade video quality. In this paper, we propose VideoShield, a novel watermarking framework specifically designed for popular diffusion-based video generation models. Unlike post-processing methods, VideoShield embeds watermarks directly during video generation, eliminating the need for additional training. To ensure video integrity, we introduce a tamper localization feature that can detect changes both temporally (across frames) and spatially (within individual frames). Our method maps watermark bits to template bits, which are then used to generate watermarked noise during the denoising process. Using DDIM Inversion, we can reverse the video to its original watermarked noise, enabling straightforward watermark extraction. Additionally, template bits allow precise detection for potential temporal and spatial modification. Extensive experiments across various video models (both T2V and I2V models) demonstrate that our method effectively extracts watermarks and detects tamper without compromising video quality. Furthermore, we show that this approach is applicable to image generation models, enabling tamper detection in generated images as well. Codes and models are available at \hrefthis https URLthis https URL.
zh
[CV-48] ENTER: Event Based Interpretable Reasoning for VideoQA
【速读】:该论文旨在解决现有视频问答(VideoQA)系统在可解释性和鲁棒性方面的不足。现有的可解释性VideoQA系统通常是自上而下的(top-down),在推理计划生成过程中忽略了低层次的视觉信息,导致系统脆弱且难以解释。而自下而上(bottom-up)的方法虽然能够从视觉数据中生成响应,但缺乏可解释性。论文提出的解决方案是ENTER系统,该系统基于事件图(event graphs)将视频转换为图形表示,其中视频事件作为节点,事件之间的关系(如时间、因果、层次关系)作为边。这种结构化表示具有以下关键优势:1)通过生成解析事件图的代码实现可解释的VideoQA;2)在推理过程中通过事件图整合上下文视觉信息;3)通过事件图的分层迭代更新实现鲁棒的VideoQA。实验结果表明,该方法不仅优于现有的自上而下方法,且在与自下而上方法的竞争中表现出色,同时在推理过程中提供了更高的可解释性和解释能力。
链接: https://arxiv.org/abs/2501.14194
作者: Hammad Ayyubi,Junzhang Liu,Ali Asgarov,Zaber Ibn Abdul Hakim,Najibul Haque Sarker,Zhecan Wang,Chia-Wei Tang,Hani Alomari,Md. Atabuzzaman,Xudong Lin,Naveen Reddy Dyava,Shih-Fu Chang,Chris Thomas
机构: Columbia University(哥伦比亚大学); Virginia Tech(弗吉尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:In this paper, we present ENTER, an interpretable Video Question Answering (VideoQA) system based on event graphs. Event graphs convert videos into graphical representations, where video events form the nodes and event-event relationships (temporal/causal/hierarchical) form the edges. This structured representation offers many benefits: 1) Interpretable VideoQA via generated code that parses event-graph; 2) Incorporation of contextual visual information in the reasoning process (code generation) via event graphs; 3) Robust VideoQA via Hierarchical Iterative Update of the event graphs. Existing interpretable VideoQA systems are often top-down, disregarding low-level visual information in the reasoning plan generation, and are brittle. While bottom-up approaches produce responses from visual data, they lack interpretability. Experimental results on NExT-QA, IntentQA, and EgoSchema demonstrate that not only does our method outperform existing top-down approaches while obtaining competitive performance against bottom-up approaches, but more importantly, offers superior interpretability and explainability in the reasoning process.
zh
[CV-49] High-Precision Fabric Defect Detection via Adaptive Shape Convolutions and Large Kernel Spatial Modeling
【速读】:该论文旨在解决纺织行业中织物缺陷检测的挑战,特别是传统方法在处理复杂或细微缺陷时存在的推理速度慢、准确率低和识别率不足的问题。为了解决这些局限性,论文提出了Fab-ASLKS框架,该框架基于YOLOv8s架构,并引入了两个关键模块:(1) 自适应形状卷积模块(Adaptive Shape Convolution Module, ASCM),通过在Neck部分利用自适应形状卷积增强特征融合,并扩展标准C2f结构的能力以提高效率;(2) 大核移位卷积模块(Large Kernel Shift Convolution Module, LKSCM),在Backbone中模拟大核效应,以提取更优的空间信息。这两个模块协同优化了网络中的特征提取和信息集成。实验结果表明,Fab-ASLKS在Tianchi织物缺陷检测数据集上的mAP@50比基线模型提高了5%,展示了其高精度和高效率的检测能力。
链接: https://arxiv.org/abs/2501.14190
作者: Shuai Wang,Yang Xu,Hui Zheng,Baotian Li
机构: School of Information Engineering, Shandong Youth University of Political Science (山东青年政治学院信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures
点击查看摘要
Abstract:Detecting fabric defects in the textile industry remains a challenging task due to the diverse and complex nature of defect patterns. Traditional methods often suffer from slow inference speeds, limited accuracy, and inadequate recognition rates, particularly in scenarios involving intricate or subtle defects. To overcome these limitations, we introduce Fab-ASLKS, an advanced fabric defect detection framework built upon the YOLOv8s architecture. Fab-ASLKS incorporates two key modules: (1) the Adaptive Shape Convolution Module (ASCM), which leverages adaptive shape convolution within the Neck to enhance feature fusion and improve efficiency by extending the capabilities of the standard C2f structure, and (2) the Large Kernel Shift Convolution Module (LKSCM), designed to emulate large kernel effects within the Backbone, enabling superior spatial information extraction. These modules collaboratively optimize feature extraction and information integration across the network. Extensive experiments conducted on the Tianchi fabric defect detection dataset demonstrate that Fab-ASLKS achieves a 5% improvement in mAP@50 over the baseline, showcasing its capability to deliver high precision and efficiency.
zh
[CV-50] Post-hoc Spurious Correlation Neutralization with Single-Weight Fictitious Class Unlearning
【速读】:该论文试图解决神经网络训练过程中模型倾向于利用与目标标签虚假相关(spuriously correlated)的特征进行预测的问题。这些虚假特征可能导致模型在部署后出现错误预测。现有的方法通常通过抑制这些虚假相关性来改进模型训练,但这种方法不仅增加了训练成本,且在实际应用中效果有限,因为虚假相关性通常在模型部署后才被发现。论文提出了一种后处理(post-hoc)方法,能够以可控的方式中和虚假特征的影响。其关键解决方案是将虚假特征概念化为原始类别中的虚构子类(fictitious sub-classes),并通过一种独特的精确类别移除技术(precise class removal technique)来消除这些子类。该技术仅需修改单一权重,从而在最小化性能损失的情况下实现可靠的预测。实验表明,该方法在仅修改一个权重的情况下,能够达到甚至超越现有最先进方法的性能。
链接: https://arxiv.org/abs/2501.14182
作者: Shahin Hakemi,Naveed Akhtar,Ghulam Mubashar Hassan,Ajmal Mian
机构: The University of Western Australia (西澳大利亚大学); The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Neural network training tends to exploit the simplest features as shortcuts to greedily minimize training loss. However, some of these features might be spuriously correlated with the target labels, leading to incorrect predictions by the model. Several methods have been proposed to address this issue. Focusing on suppressing the spurious correlations with model training, they not only incur additional training cost, but also have limited practical utility as the model misbehavior due to spurious relations is usually discovered after its deployment. It is also often overlooked that spuriousness is a subjective notion. Hence, the precise questions that must be investigated are; to what degree a feature is spurious, and how we can proportionally distract the model’s attention from it for reliable prediction. To this end, we propose a method that enables post-hoc neutralization of spurious feature impact, controllable to an arbitrary degree. We conceptualize spurious features as fictitious sub-classes within the original classes, which can be eliminated by a class removal scheme. We then propose a unique precise class removal technique that employs a single-weight modification, which entails negligible performance compromise for the remaining classes. We perform extensive experiments, demonstrating that by editing just a single weight in a post-hoc manner, our method achieves highly competitive, or better performance against the state-of-the-art methods.
zh
[CV-51] Dreamweaver: Learning Compositional World Representations from Pixels
【速读】:该论文试图解决的是如何在没有辅助数据(如文本、掩码或边界框)的情况下,将视频分解为组合概念并生成未见过的、重新组合的未来场景的问题。人类天生具备将感知分解为对象及其属性(如颜色、形状和运动模式)的能力,这种认知过程使我们能够通过重新组合熟悉的概念来想象新的未来。然而,在人工智能系统中复制这种能力具有挑战性,尤其是在建模视频为组合概念并生成未见过的未来场景时。
论文提出的解决方案是Dreamweaver,一种神经架构,旨在从原始视频中发现层次化和组合化的表示,并生成组合化的未来模拟。该方案的关键在于引入了新颖的循环块-槽单元(Recurrent Block-Slot Unit, RBSU),用于将视频分解为其构成对象和属性。此外,Dreamweaver采用多未来帧预测目标,以更有效地捕捉动态概念和静态概念的解耦表示。实验表明,该模型在多个数据集上基于DCI框架评估时,优于当前最先进的世界建模基线方法,并展示了其模块化概念表示如何支持组合想象力,允许通过重新组合不同对象的属性生成新颖的视频。
链接: https://arxiv.org/abs/2501.14174
作者: Junyeob Baek,Yi-Fu Wu,Gautam Singh,Sungjin Ahn
机构: KAIST(韩国科学技术院); Rutgers University(罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Humans have an innate ability to decompose their perceptions of the world into objects and their attributes, such as colors, shapes, and movement patterns. This cognitive process enables us to imagine novel futures by recombining familiar concepts. However, replicating this ability in artificial intelligence systems has proven challenging, particularly when it comes to modeling videos into compositional concepts and generating unseen, recomposed futures without relying on auxiliary data, such as text, masks, or bounding boxes. In this paper, we propose Dreamweaver, a neural architecture designed to discover hierarchical and compositional representations from raw videos and generate compositional future simulations. Our approach leverages a novel Recurrent Block-Slot Unit (RBSU) to decompose videos into their constituent objects and attributes. In addition, Dreamweaver uses a multi-future-frame prediction objective to capture disentangled representations for dynamic concepts more effectively as well as static concepts. In experiments, we demonstrate our model outperforms current state-of-the-art baselines for world modeling when evaluated under the DCI framework across multiple datasets. Furthermore, we show how the modularized concept representations of our model enable compositional imagination, allowing the generation of novel videos by recombining attributes from different objects.
zh
[CV-52] UltraLightSqueezeNet: A Deep Learning Architecture for Malaria Classification with up to 54x fewer trainable parameters for resource constrained devices
【速读】:该论文旨在解决在资源受限环境中通过轻量级深度学习模型进行疟疾检测的问题。研究选择了SqueezeNet1.1作为基础架构,并提出了三种超轻量级架构变体(Variant 1、Variant 2和Variant 3),这些变体比SqueezeNet1.1更为紧凑。通过评估这些变体在疟疾血细胞分类中的性能,研究目标是找到在保持高准确率的同时显著降低计算开销的最佳模型。实验结果表明,SqueezeNet1.1在所有评估指标上表现最佳,分类准确率达到97.12%。然而,Variant 3(四个fire模块)在几乎相同的准确率(96.55%)下,计算开销减少了6倍,提供了更具竞争力的替代方案。Variant 2和Variant 1在性能上略低于Variant 3,但分别实现了28倍和54倍的计算开销减少。这些发现表明,SqueezeNet1.1的变体为疟疾检测提供了一种灵活的解决方案,能够在资源受限和性能之间找到平衡。
链接: https://arxiv.org/abs/2501.14172
作者: Suresh Babu Nettur,Shanthi Karpurapu,Unnati Nettur,Likhit Sagar Gajja,Sravanthy Myneni,Akhil Dusi,Lalithya Posham
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Lightweight deep learning approaches for malaria detection have gained attention for their potential to enhance diagnostics in resource constrained environments. For our study, we selected SqueezeNet1.1 as it is one of the most popular lightweight architectures. SqueezeNet1.1 is a later version of SqueezeNet1.0 and is 2.4 times more computationally efficient than the original model. We proposed and implemented three ultra-lightweight architecture variants to SqueezeNet1.1 architecture, namely Variant 1 (one fire module), Variant 2 (two fire modules), and Variant 3 (four fire modules), which are even more compact than SqueezeNetV1.1 (eight fire modules). These models were implemented to evaluate the best performing variant that achieves superior computational efficiency without sacrificing accuracy in malaria blood cell classification. The models were trained and evaluated using the NIH Malaria dataset. We assessed each model’s performance based on metrics including accuracy, recall, precision, F1-score, and Area Under the Curve (AUC). The results show that the SqueezeNet1.1 model achieves the highest performance across all metrics, with a classification accuracy of 97.12%. Variant 3 (four fire modules) offers a competitive alternative, delivering almost identical results (accuracy 96.55%) with a 6x reduction in computational overhead compared to SqueezeNet1.1. Variant 2 and Variant 1 perform slightly lower than Variant 3, with Variant 2 (two fire modules) reducing computational overhead by 28x, and Variant 1 (one fire module) achieving a 54x reduction in trainable parameters compared to SqueezeNet1.1. These findings demonstrate that our SqueezeNet1.1 architecture variants provide a flexible approach to malaria detection, enabling the selection of a variant that balances resource constraints and performance.
zh
[CV-53] Enhancing Multimodal Entity Linking with Jaccard Distance-based Conditional Contrastive Learning and Contextual Visual Augmentation
【速读】:该论文旨在解决多模态实体链接(Multimodal Entity Linking, MEL)中现有方法在对比学习(contrastive learning)中可能忽略实体独特细节的问题。现有方法通常使用批次中的其他样本作为负样本,这可能导致模型依赖易于学习的特征,而忽略实体的关键细节。为此,论文提出了JD-CCL(Jaccard Distance-based Conditional Contrastive Learning),通过利用元信息选择具有相似属性的负样本,使链接任务更具挑战性和鲁棒性。此外,针对视觉模态中提及和实体之间的变化问题,论文提出了CVaCPT(Contextual Visual-aid Controllable Patch Transform),通过引入多视角合成图像和上下文文本表示来增强视觉表示,从而调整和缩放局部图像表示。实验结果表明,该方法在基准MEL数据集上表现出显著的有效性。
链接: https://arxiv.org/abs/2501.14166
作者: Cong-Duy Nguyen,Xiaobao Wu,Thong Nguyen,Shuai Zhao,Khoi Le,Viet-Anh Nguyen,Feng Yichao,Anh Tuan Luu
机构: Nanyang Technological University, Singapore (南洋理工大学, 新加坡); National University of Singapore, Singapore (新加坡国立大学, 新加坡); VinAI Research, Vietnam (VinAI 研究院, 越南)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Previous research on multimodal entity linking (MEL) has primarily employed contrastive learning as the primary objective. However, using the rest of the batch as negative samples without careful consideration, these studies risk leveraging easy features and potentially overlook essential details that make entities unique. In this work, we propose JD-CCL (Jaccard Distance-based Conditional Contrastive Learning), a novel approach designed to enhance the ability to match multimodal entity linking models. JD-CCL leverages meta-information to select negative samples with similar attributes, making the linking task more challenging and robust. Additionally, to address the limitations caused by the variations within the visual modality among mentions and entities, we introduce a novel method, CVaCPT (Contextual Visual-aid Controllable Patch Transform). It enhances visual representations by incorporating multi-view synthetic images and contextual textual representations to scale and shift patch representations. Experimental results on benchmark MEL datasets demonstrate the strong effectiveness of our approach.
zh
[CV-54] Advancing MRI Reconstruction: A Systematic Review of Deep Learning and Compressed Sensing Integration
【速读】:该论文试图解决磁共振成像(MRI)在数据采集过程中耗时较长的问题,这些问题可能导致患者不适、运动伪影以及限制实时应用。为了解决这些挑战,论文探讨了多种策略,包括并行成像(parallel imaging)和压缩感知(compressed sensing, CS),这些方法通过减少数据采集量来加速成像过程。近年来,深度学习(deep learning, DL)作为一种强大的工具被引入MRI重建领域,并与并行成像和压缩感知相结合,以实现更快、更准确的MRI重建。论文的关键解决方案在于系统地回顾和分类了基于深度学习的MRI重建方法,包括端到端方法、展开优化和联邦学习等,并强调了这些方法的潜在优势。此外,论文还总结了深度学习在MRI重建中的关键成果和趋势,包括定量指标、数据集、加速因子以及研究兴趣的进展,并探讨了未来的研究方向。
链接: https://arxiv.org/abs/2501.14158
作者: Mojtaba Safari,Zach Eidex,Chih-Wei Chang,Richard L.J. Qiu,Xiaofeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注:
点击查看摘要
Abstract:Magnetic resonance imaging (MRI) is a non-invasive imaging modality and provides comprehensive anatomical and functional insights into the human body. However, its long acquisition times can lead to patient discomfort, motion artifacts, and limiting real-time applications. To address these challenges, strategies such as parallel imaging have been applied, which utilize multiple receiver coils to speed up the data acquisition process. Additionally, compressed sensing (CS) is a method that facilitates image reconstruction from sparse data, significantly reducing image acquisition time by minimizing the amount of data collection needed. Recently, deep learning (DL) has emerged as a powerful tool for improving MRI reconstruction. It has been integrated with parallel imaging and CS principles to achieve faster and more accurate MRI reconstructions. This review comprehensively examines DL-based techniques for MRI reconstruction. We categorize and discuss various DL-based methods, including end-to-end approaches, unrolled optimization, and federated learning, highlighting their potential benefits. Our systematic review highlights significant contributions and underscores the potential of DL in MRI reconstruction. Additionally, we summarize key results and trends in DL-based MRI reconstruction, including quantitative metrics, the dataset, acceleration factors, and the progress of and research interest in DL techniques over time. Finally, we discuss potential future directions and the importance of DL-based MRI reconstruction in advancing medical imaging. To facilitate further research in this area, we provide a GitHub repository that includes up-to-date DL-based MRI reconstruction publications and public datasets-this https URL.
zh
[CV-55] Effective Defect Detection Using Instance Segmentation for NDI AAAI2025
【速读】:该论文试图解决在航空航天制造中使用超声波检测(Ultrasonic Testing)时,由于超声波扫描图像的复杂性和尺寸较大,导致通过视觉检查或机器学习模型识别缺陷的困难问题。解决方案的关键在于采用实例分割(Instance Segmentation)技术,具体使用了基于Mask-RCNN(Detectron 2)和YOLO 11的两种模型来识别复合材料面板超声波扫描图像中的缺陷。此外,研究还引入了一种简单的统计预处理技术,减少了定制化预处理的需求,从而显著降低了数据预处理时间、检测时间和整体成本,证明了实例分割在无损检测(NDI)流程中的可行性和有效性。
链接: https://arxiv.org/abs/2501.14149
作者: Ashiqur Rahman,Venkata Devesh Reddy Seethi,Austin Yunker,Zachary Kral,Rajkumar Kettimuthu,Hamed Alhoori
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 2 tables. Published at AI2ASE 2025 workshop at AAAI2025. Accepted publication is available at this https URL
点击查看摘要
Abstract:Ultrasonic testing is a common Non-Destructive Inspection (NDI) method used in aerospace manufacturing. However, the complexity and size of the ultrasonic scans make it challenging to identify defects through visual inspection or machine learning models. Using computer vision techniques to identify defects from ultrasonic scans is an evolving research area. In this study, we used instance segmentation to identify the presence of defects in the ultrasonic scan images of composite panels that are representative of real components manufactured in aerospace. We used two models based on Mask-RCNN (Detectron 2) and YOLO 11 respectively. Additionally, we implemented a simple statistical pre-processing technique that reduces the burden of requiring custom-tailored pre-processing techniques. Our study demonstrates the feasibility and effectiveness of using instance segmentation in the NDI pipeline by significantly reducing data pre-processing time, inspection time, and overall costs.
zh
[CV-56] SelfPrompt: Confidence-Aware Semi-Supervised Tuning for Robust Vision-Language Model Adaptation
【速读】:该论文试图解决在半监督学习(semi-supervised learning)设置下,视觉-语言模型(vision-language models, VLMs)在伪标签生成过程中由于模型校准不准确导致的伪标签质量下降以及噪声伪标签累积的问题。为了解决这些问题,论文提出了SelfPrompt方法,其关键创新点包括:1)引入了一种基于聚类的伪标签生成方法(cluster-guided pseudo-labelling),通过聚类指导提高伪标签的准确性;2)设计了一个置信度感知的半监督学习模块(confidence-aware semi-supervised learning module),结合监督学习和弱监督学习,最大化未标注数据的利用率。此外,论文还探讨了在主动半监督学习(active semi-supervised learning)设置下的应用,提出了一种弱监督采样技术(weakly-supervised sampling technique),通过选择多样化和代表性的标注数据集,优化有限标注预算的利用。实验结果表明,SelfPrompt在13个数据集上显著超越了现有方法,在半监督学习、主动半监督学习和基础到新类别的泛化任务中分别实现了6.23%、6.25%和4.9%的平均性能提升,并在单样本设置中表现出优异的泛化能力,平均提升11.78%。
链接: https://arxiv.org/abs/2501.14148
作者: Shuvendu Roy,Ali Etemad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We present SelfPrompt, a novel prompt-tuning approach for vision-language models (VLMs) in a semi-supervised learning setup. Existing methods for tuning VLMs in semi-supervised setups struggle with the negative impact of the miscalibrated VLMs on pseudo-labelling, and the accumulation of noisy pseudo-labels. SelfPrompt addresses these challenges by introducing a cluster-guided pseudo-labelling method that improves pseudo-label accuracy, and a confidence-aware semi-supervised learning module that maximizes the utilization of unlabelled data by combining supervised learning and weakly-supervised learning. Additionally, we investigate our method in an active semi-supervised learning setup, where the labelled set is strategically selected to ensure the best utilization of a limited labelling budget. To this end, we propose a weakly-supervised sampling technique that selects a diverse and representative labelled set, which can be seamlessly integrated into existing methods to enhance their performance. We conduct extensive evaluations across 13 datasets, significantly surpassing state-of-the-art performances with average improvements of 6.23% in standard semi-supervised learning, 6.25% in active semi-supervised learning, and 4.9% in base-to-novel generalization, using a 2-shot setup. Furthermore, SelfPrompt shows excellent generalization in single-shot settings, achieving an average improvement of 11.78%.
zh
[CV-57] Reinforcement Learning Platform for Adversarial Black-box Attacks with Custom Distortion Filters AAAI
【速读】:该论文旨在解决对抗性攻击(Adversarial Attack)中的黑箱攻击问题,特别是无目标攻击(untargeted attack)和目标攻击(targeted attack)。论文提出的解决方案是一个基于强化学习(Reinforcement Learning)的平台RLAB,该平台允许用户选择不同的失真滤波器(distortion filters)来生成对抗样本(adversarial examples)。关键创新在于使用了一种新颖的双动作方法(dual-action method),该方法在每个步骤中探索输入图像的敏感区域以添加失真,同时移除对目标模型影响较小的噪声。这种双动作方法显著加快了攻击的收敛速度,并提高了攻击效率。此外,RLAB平台还能够评估图像分类模型对特定失真类型的鲁棒性,并通过对抗样本重新训练模型,显著提升了模型在基准数据集上的鲁棒性。该平台在导致误分类所需的平均查询次数方面优于现有方法,从而在提升模型可信度方面具有积极的社会影响。
链接: https://arxiv.org/abs/2501.14122
作者: Soumyendu Sarkar,Ashwin Ramesh Babu,Sajad Mousavi,Vineet Gundecha,Sahand Ghorbanpour,Avisek Naug,Ricardo Luna Gutierrez,Antonio Guillen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review for 2025 AAAI Conference on Artificial Intelligence Proceedings
点击查看摘要
Abstract:We present a Reinforcement Learning Platform for Adversarial Black-box untargeted and targeted attacks, RLAB, that allows users to select from various distortion filters to create adversarial examples. The platform uses a Reinforcement Learning agent to add minimum distortion to input images while still causing misclassification by the target model. The agent uses a novel dual-action method to explore the input image at each step to identify sensitive regions for adding distortions while removing noises that have less impact on the target model. This dual action leads to faster and more efficient convergence of the attack. The platform can also be used to measure the robustness of image classification models against specific distortion types. Also, retraining the model with adversarial samples significantly improved robustness when evaluated on benchmark datasets. The proposed platform outperforms state-of-the-art methods in terms of the average number of queries required to cause misclassification. This advances trustworthiness with a positive social impact.
zh
[CV-58] StreamingRAG : Real-time Contextual Retrieval and Generation Framework
【速读】:该论文旨在解决从多模态数据流(如医疗、智能交通和卫星遥感等领域)中实时提取洞察的挑战。现有方法中,多模态大语言模型(MM-LLMs)由于高计算需求和知识范围有限,难以有效处理这些数据流。传统的检索增强生成(RAG)系统虽然能够弥补模型的知识局限性,但其预处理速度较慢,无法满足实时分析的需求。为此,论文提出了StreamingRAG,一种专为流数据设计的新型RAG框架。其核心解决方案在于构建动态演化的知识图谱,实时捕捉场景-对象-实体之间的关系。该知识图谱利用MM-LLMs实现时间感知的场景表示,从而能够及时响应特定事件或用户查询。StreamingRAG通过显著提升实时分析速度(5-6倍吞吐量)、提高上下文准确性(通过时间感知知识图谱)以及减少资源消耗(使用轻量级模型,资源消耗降低2-3倍),有效解决了现有方法的局限性。
链接: https://arxiv.org/abs/2501.14101
作者: Murugan Sankaradas,Ravi K.Rajendran,Srimat T.Chakradhar
机构: NEC Laboratories America (NEC实验室美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and Presented at AI4Sys, HPDC 2024
点击查看摘要
Abstract:Extracting real-time insights from multi-modal data streams from various domains such as healthcare, intelligent transportation, and satellite remote sensing remains a challenge. High computational demands and limited knowledge scope restrict the applicability of Multi-Modal Large Language Models (MM-LLMs) on these data streams. Traditional Retrieval-Augmented Generation (RAG) systems address knowledge limitations of these models, but suffer from slow preprocessing, making them unsuitable for real-time analysis. We propose StreamingRAG, a novel RAG framework designed for streaming data. StreamingRAG constructs evolving knowledge graphs capturing scene-object-entity relationships in real-time. The knowledge graph achieves temporal-aware scene representations using MM-LLMs and enables timely responses for specific events or user queries. StreamingRAG addresses limitations in existing methods, achieving significant improvements in real-time analysis (5-6x faster throughput), contextual accuracy (through a temporal knowledge graph), and reduced resource consumption (using lightweight models by 2-3x).
zh
[CV-59] Expanding on the BRIAR Dataset: A Comprehensive Whole Body Biometric Recognition Resource at Extreme Distances and Real-World Scenarios (Collections 1-4) CVPR
【速读】:该论文试图解决生物识别技术(biometric recognition)在非传统环境中的应用挑战,特别是在极端距离或从建筑物或无人机(UAVs)上安装的摄像头进行识别时的性能问题。尽管近年来生物识别算法和操作系统在准确性和鲁棒性方面取得了显著进展,但在这些非传统场景中,技术的表现仍然存在较大局限。论文的关键解决方案是通过扩展当前最大的数据集来应对这些操作挑战,并详细描述了该数据集的组成、收集、整理和标注方法。通过提供更丰富和多样化的数据,该研究旨在提升生物识别技术在复杂环境中的适用性和准确性。
链接: https://arxiv.org/abs/2501.14070
作者: Gavin Jager,David Cornett III,Gavin Glenn,Deniz Aykac,Christi Johnson,Robert Zhang,Ryan Shivers,David Bolme,Laura Davies,Scott Dolvin,Nell Barber,Joel Brogan,Nick Burchfield,Carl Dukes,Andrew Duncan,Regina Ferrell,Austin Garrett,Jim Goddard,Jairus Hines,Bart Murphy,Sean Pharris,Brandon Stockwell,Leanne Thompson,Matthew Yohe
机构: Oak Ridge National Laboratory(橡树岭国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 11 figures, 2 tables, submitted to CVPR
点击查看摘要
Abstract:The state-of-the-art in biometric recognition algorithms and operational systems has advanced quickly in recent years providing high accuracy and robustness in more challenging collection environments and consumer applications. However, the technology still suffers greatly when applied to non-conventional settings such as those seen when performing identification at extreme distances or from elevated cameras on buildings or mounted to UAVs. This paper summarizes an extension to the largest dataset currently focused on addressing these operational challenges, and describes its composition as well as methodologies of collection, curation, and annotation.
zh
[CV-60] Prior Knowledge Injection into Deep Learning Models Predicting Gene Expression from Whole Slide Images
【速读】:该论文试图解决的问题是如何通过深度学习(Deep Learning)技术从全切片图像(Whole Slide Images, WSIs)中预测分子信息,以替代或补充昂贵的肿瘤测序(tumor sequencing)方法。当前的方法虽然具有一定的潜力,但在准确性和鲁棒性方面仍不足以完全取代直接的测序。论文提出的解决方案关键在于引入一个模型无关的框架(model-agnostic framework),该框架能够将基因-基因相互作用(gene-gene interactions)的先验知识注入到深度学习架构中,从而提高基因表达预测的准确性和鲁棒性。通过这一框架,作者在乳腺癌的案例研究中展示了显著的效果,平均增加了983个显著基因(out of 25,761),并且在独立数据集上也有14个实验表现出性能提升。这一结果表明,注入先验知识在提升WSIs基因表达预测性能方面具有广泛的应用潜力。
链接: https://arxiv.org/abs/2501.14056
作者: Max Hallemeesch,Marija Pizurica,Paloma Rabaey,Olivier Gevaert,Thomas Demeester,Kathleen Marchal
机构: 1. Ghent University (根特大学); 2. Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Cancer diagnosis and prognosis primarily depend on clinical parameters such as age and tumor grade, and are increasingly complemented by molecular data, such as gene expression, from tumor sequencing. However, sequencing is costly and delays oncology workflows. Recent advances in Deep Learning allow to predict molecular information from morphological features within Whole Slide Images (WSIs), offering a cost-effective proxy of the molecular markers. While promising, current methods lack the robustness to fully replace direct sequencing. Here we aim to improve existing methods by introducing a model-agnostic framework that allows to inject prior knowledge on gene-gene interactions into Deep Learning architectures, thereby increasing accuracy and robustness. We design the framework to be generic and flexibly adaptable to a wide range of architectures. In a case study on breast cancer, our strategy leads to an average increase of 983 significant genes (out of 25,761) across all 18 experiments, with 14 generalizing to an increase on an independent dataset. Our findings reveal a high potential for injection of prior knowledge to increase gene expression prediction performance from WSIs across a wide range of architectures.
zh
[CV-61] Revisiting CLIP: Efficient Alignment of 3D MRI and Tabular Data using Domain-Specific Foundation Models
【速读】:该论文试图解决多模态模型(multi-modal models)在医学领域中面临的挑战,特别是如何在没有大量样本的情况下,实现3D MRI数据与表格数据(tabular data)的对齐。现有的基于CLIP(Contrastive Language–Image Pretraining)的方法通常需要大量样本,并且无法原生支持3D或表格数据。为了解决这些问题,论文提出了一种新的方法,通过训练一个特定领域的3D基础模型(3D foundation model)作为图像编码器,并证明仅需62个MRI扫描即可实现模态对齐(modality alignment)。该方案的关键在于采用了一种简单的嵌入累积策略(embedding accumulation strategy),通过在批次间扩展负样本对的数量来稳定训练过程。论文还详细评估了多种设计选择,包括骨干网络(backbone)和损失函数的选择,并在零样本分类(zero-shot classification)和图像检索(image-retrieval)任务上验证了所提方法的有效性。尽管零样本图像检索仍具有挑战性,但零样本分类结果表明,该方法能够有效地对齐3D MRI与表格数据的表示。
链接: https://arxiv.org/abs/2501.14051
作者: Jakob Krogh Petersen,Valdemar Licht,Mads Nielsen,Asbjørn Munk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 2 figures. To be published in ISBI 2025
点击查看摘要
Abstract:Multi-modal models require aligned, shared embedding spaces. However, common CLIP-based approaches need large amounts of samples and do not natively support 3D or tabular data, both of which are crucial in the medical domain. To address these issues, we revisit CLIP-style alignment by training a domain-specific 3D foundation model as an image encoder and demonstrate that modality alignment is feasible with only 62 MRI scans. Our approach is enabled by a simple embedding accumulation strategy required for training in 3D, which scales the amount of negative pairs across batches in order to stabilize training. We perform a thorough evaluation of various design choices, including the choice of backbone and loss functions, and evaluate the proposed methodology on zero-shot classification and image-retrieval tasks. While zero-shot image-retrieval remains challenging, zero-shot classification results demonstrate that the proposed approach can meaningfully align the representations of 3D MRI with tabular data.
zh
[CV-62] SIDDA: SInkhorn Dynamic Domain Adaptation for Image Classification with Equivariant Neural Networks
【速读】:该论文试图解决现代神经网络(NNs)在存在“协变量偏移”(covariate shift)时泛化能力不足的问题。协变量偏移指的是训练数据和测试数据的分布不同,但分类标签的条件分布保持不变的情况。在这种情况下,神经网络的泛化能力可以通过学习更具域不变性(domain-invariant)的特征来提升。论文提出的解决方案是SIDDA(Sinkhorn Divergence-based Domain Adaptation),这是一种基于Sinkhorn散度的域适应(DA)训练算法。SIDDA的关键在于能够以最小的超参数调优和计算开销实现有效的域对齐(domain alignment),并且与多种神经网络架构兼容,特别是在与等变神经网络(ENNs)结合时,能够显著提高分类准确性和模型校准能力。实验表明,SIDDA在未标记的目标数据上能够提升约40%的分类准确率,并且在模型校准方面也取得了显著改进。
链接: https://arxiv.org/abs/2501.14048
作者: Sneh Pandya,Purvik Patel,Brian D. Nord,Mike Walmsley,Aleksandra Ćiprijanović
机构: Department of Physics, Northeastern University (东北大学物理系); NSF AI Institute for Artificial Intelligence & Fundamental Interactions (IAIFI) (NSF人工智能与基础交互研究所); Fermi National Accelerator Laboratory (费米国家加速器实验室); Khoury College of Computer Science, Northeastern University (东北大学计算机科学学院); Department of Astronomy and Astrophysics, University of Chicago (芝加哥大学天文与天体物理系); Kavli Institute for Cosmological Physics, University of Chicago (芝加哥大学卡夫利宇宙物理研究所); Dunlap Institute for Astronomy & Astrophysics, University of Toronto (多伦多大学邓拉普天文与天体物理研究所); Jodrell Bank Centre for Astrophysics, Department of Physics & Astronomy, University of Manchester (曼彻斯特大学物理与天文系乔德雷尔班克天体物理中心); NSF-Simons AI Institute for the Sky (SkAI) (NSF-西蒙斯天空人工智能研究所)
类目: Machine Learning (cs.LG); Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 5 figures, 4 tables. code available at: this https URL
点击查看摘要
Abstract:Modern neural networks (NNs) often do not generalize well in the presence of a “covariate shift”; that is, in situations where the training and test data distributions differ, but the conditional distribution of classification labels remains unchanged. In such cases, NN generalization can be reduced to a problem of learning more domain-invariant features. Domain adaptation (DA) methods include a range of techniques aimed at achieving this; however, these methods have struggled with the need for extensive hyperparameter tuning, which then incurs significant computational costs. In this work, we introduce SIDDA, an out-of-the-box DA training algorithm built upon the Sinkhorn divergence, that can achieve effective domain alignment with minimal hyperparameter tuning and computational overhead. We demonstrate the efficacy of our method on multiple simulated and real datasets of varying complexity, including simple shapes, handwritten digits, and real astronomical observations. SIDDA is compatible with a variety of NN architectures, and it works particularly well in improving classification accuracy and model calibration when paired with equivariant neural networks (ENNs). We find that SIDDA enhances the generalization capabilities of NNs, achieving up to a \approx40% improvement in classification accuracy on unlabeled target data. We also study the efficacy of DA on ENNs with respect to the varying group orders of the dihedral group D_N , and find that the model performance improves as the degree of equivariance increases. Finally, we find that SIDDA enhances model calibration on both source and target data–achieving over an order of magnitude improvement in the ECE and Brier score. SIDDA’s versatility, combined with its automated approach to domain alignment, has the potential to advance multi-dataset studies by enabling the development of highly generalizable models.
zh
[CV-63] LLM -guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps BMVC2024
【速读】:该论文旨在解决文本到图像生成(text-to-image synthesis)中实例级别图像属性精确控制的挑战。现有方法通过微调(fine-tuning)或辅助信息(auxiliary information)提供一定程度的控制,但在灵活性和准确性方面存在局限。为解决这些问题,论文提出了一种基于大语言模型(Large Language Models, LLMs)、开放词汇检测器(open-vocabulary detectors)、交叉注意力图(cross-attention maps)和扩散U-Net(diffusion U-Net)中间激活的管道(pipeline)。该方法通过检测提示文本中提到的对象并利用交叉注意力图,实现了无需大量训练或输入掩码(masks)的精确图像操纵,确保了操纵图像的连贯性并控制对象位置。其关键创新在于结合了多种技术手段,实现了实例级别的精确控制,且无需微调或辅助信息(如掩码或边界框)。
链接: https://arxiv.org/abs/2501.14046
作者: Andrey Palaev,Adil Khan,Syed M. Ahsan Kazmi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at BMVC 2024
点击查看摘要
Abstract:The advancement of text-to-image synthesis has introduced powerful generative models capable of creating realistic images from textual prompts. However, precise control over image attributes remains challenging, especially at the instance level. While existing methods offer some control through fine-tuning or auxiliary information, they often face limitations in flexibility and accuracy. To address these challenges, we propose a pipeline leveraging Large Language Models (LLMs), open-vocabulary detectors, cross-attention maps and intermediate activations of diffusion U-Net for instance-level image manipulation. Our method detects objects mentioned in the prompt and present in the generated image, enabling precise manipulation without extensive training or input masks. By incorporating cross-attention maps, our approach ensures coherence in manipulated images while controlling object positions. Our method enables precise manipulations at the instance level without fine-tuning or auxiliary information such as masks or bounding boxes. Code is available at this https URL
zh
[CV-64] Implicit Neural Surface Deformation with Explicit Velocity Fields ICLR2025
【速读】:该论文旨在解决无监督条件下同时预测时变神经隐式表面(time-varying neural implicit surfaces)和点云对之间的变形(deformations)的问题。现有的方法通常需要中间形状的监督,而本文提出了一种无需中间形状监督的方法,能够处理刚性和非刚性变形。解决方案的关键在于使用显式速度场(explicit velocity field)来建模点的运动,并通过改进的水平集方程(modified level-set equation)直接变形时变隐式场。该方程在紧凑的公式中利用带有Eikonal约束的等值面演化(iso-surface evolution),确保有符号距离场(signed distance field)的完整性。此外,通过对速度场施加平滑且保持体积的约束,该方法能够恢复物理上合理的中间形状。实验结果表明,该方法在质量和效率上均显著优于现有方法。
链接: https://arxiv.org/abs/2501.14038
作者: Lu Sang,Zehranaz Canfes,Dongliang Cao,Florian Bernard,Daniel Cremers
机构: Technical University of Munich(慕尼黑工业大学); Munich Center of Machine Learning(慕尼黑机器学习中心); University of Bonn(波恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025, 10 pages
点击查看摘要
Abstract:In this work, we introduce the first unsupervised method that simultaneously predicts time-varying neural implicit surfaces and deformations between pairs of point clouds. We propose to model the point movement using an explicit velocity field and directly deform a time-varying implicit field using the modified level-set equation. This equation utilizes an iso-surface evolution with Eikonal constraints in a compact formulation, ensuring the integrity of the signed distance field. By applying a smooth, volume-preserving constraint to the velocity field, our method successfully recovers physically plausible intermediate shapes. Our method is able to handle both rigid and non-rigid deformations without any intermediate shape supervision. Our experimental results demonstrate that our method significantly outperforms existing works, delivering superior results in both quality and efficiency.
zh
[CV-65] INDIGO: A Unified INN-Guided Probabilistic Diffusion Algorithm for Blind and Non-Blind Image Restoration
【速读】:该论文旨在解决基于生成扩散模型(Generative Diffusion Models)的图像恢复(Image Restoration, IR)任务中的两个主要问题:1)大多数非盲方法需要退化模型(degradation model)的解析表达式来指导采样过程;2)现有的盲方法依赖于预定义的退化模型族来训练深度网络。这些问题限制了这些方法在处理真实世界退化任务时的灵活性和能力。
论文提出的解决方案是一种新颖的基于可逆神经网络(Invertible Neural Networks, INN)引导的概率扩散算法,分别称为INDIGO和BlindINDIGO。该方案结合了INN的完美重建特性和预训练扩散模型的强大生成能力。具体而言,通过训练INN的前向过程来模拟任意退化过程,并利用其逆过程获得中间图像,通过梯度步骤引导反向扩散采样过程。此外,论文还引入了一种初始化策略,以进一步提高算法的性能和推理速度。实验结果表明,该算法在合成和真实世界的低质量图像上,与当前领先方法相比,在定量和视觉上都取得了竞争性的结果。
链接: https://arxiv.org/abs/2501.14014
作者: Di You,Pier Luigi Dragotti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by IEEE Journal of Selected Topics in Signal Processing (JSTSP)
点击查看摘要
Abstract:Generative diffusion models are becoming one of the most popular prior in image restoration (IR) tasks due to their remarkable ability to generate realistic natural images. Despite achieving satisfactory results, IR methods based on diffusion models present several limitations. First of all, most non-blind approaches require an analytical expression of the degradation model to guide the sampling process. Secondly, most existing blind approaches rely on families of pre-defined degradation models for training their deep networks. The above issues limit the flexibility of these approaches and so their ability to handle real-world degradation tasks. In this paper, we propose a novel INN-guided probabilistic diffusion algorithm for non-blind and blind image restoration, namely INDIGO and BlindINDIGO, which combines the merits of the perfect reconstruction property of invertible neural networks (INN) with the strong generative capabilities of pre-trained diffusion models. Specifically, we train the forward process of the INN to simulate an arbitrary degradation process and use the inverse to obtain an intermediate image that we use to guide the reverse diffusion sampling process through a gradient step. We also introduce an initialization strategy, to further improve the performance and inference speed of our algorithm. Experiments demonstrate that our algorithm obtains competitive results compared with recently leading methods both quantitatively and visually on synthetic and real-world low-quality images.
zh
[CV-66] Device-aware Optical Adversarial Attack for a Portable Projector-camera System
【速读】:该论文旨在解决基于深度学习的人脸识别(FR)系统在实际部署中面临的物理对抗攻击问题,特别是现有基于投影仪-摄像头的对抗光攻击在实际FR设置中的局限性。论文提出的解决方案关键在于将设备感知的适应性(device-aware adaptations)引入数字攻击算法中,例如分辨率感知(resolution-aware)和颜色感知(color-aware)的调整,从而减少从数字到物理领域的性能退化。通过实验验证,该算法在对抗真实和欺骗性攻击时表现出色,能够在FR模型和商业系统中实现高物理相似性评分,且从数字攻击到物理攻击的评分下降仅为14%,在白盒和黑盒场景下均具有较高的攻击成功率。
链接: https://arxiv.org/abs/2501.14005
作者: Ning Jiang(1 and 2),Yanhong Liu(2),Dingheng Zeng(2),Yue Feng(2),Weihong Deng(2),Ying Li(1) ((1) School of Software amp; Microelectronics, Peking University, Beijing, China (2) Mashang Consumer Finance Co., Ltd., Chongqing, China)
机构: 1School of Software & Microelectronics, Peking University, Beijing, China (北京大学软件与微电子学院); 2Mashang Consumer Finance Co., Ltd., Chongqing, China (马上消费金融股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Deep-learning-based face recognition (FR) systems are susceptible to adversarial examples in both digital and physical domains. Physical attacks present a greater threat to deployed systems as adversaries can easily access the input channel, allowing them to provide malicious inputs to impersonate a victim. This paper addresses the limitations of existing projector-camera-based adversarial light attacks in practical FR setups. By incorporating device-aware adaptations into the digital attack algorithm, such as resolution-aware and color-aware adjustments, we mitigate the degradation from digital to physical domains. Experimental validation showcases the efficacy of our proposed algorithm against real and spoof adversaries, achieving high physical similarity scores in FR models and state-of-the-art commercial systems. On average, there is only a 14% reduction in scores from digital to physical attacks, with high attack success rate in both white- and black-box scenarios.
zh
[CV-67] ME-CPT: Multi-Task Enhanced Cross-Temporal Point Transformer for Urban 3D Change Detection
【速读】:该论文旨在解决利用多时相机载激光扫描(Airborne Laser Scanning, ALS)点云数据进行城市区域语义变化检测时面临的三个主要挑战:(1) 跨时相点云空间关系的准确建模,以有效提取变化特征;(2) 变化样本的类别不平衡问题,影响语义特征的可区分性;(3) 缺乏用于三维语义变化检测的真实世界数据集。为解决这些问题,论文提出了多任务增强跨时相点变换器(Multi-task Enhanced Cross-temporal Point Transformer, ME-CPT)网络。该网络通过建立不同时相点云之间的时空对应关系,并利用注意力机制联合提取语义变化特征,促进信息交换和变化比较。此外,通过引入语义分割任务和多任务训练策略,进一步增强了语义特征的可区分性,减少了变化类型中类别不平衡的影响。同时,论文还发布了一个覆盖22.5平方公里的三维语义变化检测数据集,为全面评估提供了多样化的场景。实验结果表明,ME-CPT在多个数据集上均优于现有的最先进方法。
链接: https://arxiv.org/abs/2501.14004
作者: Luqi Zhang,Haiping Wang,Chong Liu,Zhen Dong,Bisheng Yang
机构: State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University (武汉大学测绘遥感信息工程国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The point clouds collected by the Airborne Laser Scanning (ALS) system provide accurate 3D information of urban land covers. By utilizing multi-temporal ALS point clouds, semantic changes in urban area can be captured, demonstrating significant potential in urban planning, emergency management, and infrastructure maintenance. Existing 3D change detection methods struggle to efficiently extract multi-class semantic information and change features, still facing the following challenges: (1) the difficulty of accurately modeling cross-temporal point clouds spatial relationships for effective change feature extraction; (2) class imbalance of change samples which hinders distinguishability of semantic features; (3) the lack of real-world datasets for 3D semantic change detection. To resolve these challenges, we propose the Multi-task Enhanced Cross-temporal Point Transformer (ME-CPT) network. ME-CPT establishes spatiotemporal correspondences between point cloud across different epochs and employs attention mechanisms to jointly extract semantic change features, facilitating information exchange and change comparison. Additionally, we incorporate a semantic segmentation task and through the multi-task training strategy, further enhance the distinguishability of semantic features, reducing the impact of class imbalance in change types. Moreover, we release a 22.5 km^2 3D semantic change detection dataset, offering diverse scenes for comprehensive evaluation. Experiments on multiple datasets show that the proposed MT-CPT achieves superior performance compared to existing state-of-the-art methods. The source code and dataset will be released upon acceptance at \urlthis https URL.
zh
[CV-68] Enhancing kelp forest detection in remote sensing images using crowdsourced labels with Mixed Vision Transformers and ConvNeXt segmentation models
【速读】:该论文旨在解决利用Landsat卫星图像快速、准确地检测海带(kelp)冠层的问题。海带作为海洋生态系统的基础物种,为众多生物提供了重要的食物和栖息地。研究的关键解决方案是结合众包标签(crowdsourced labels)与先进的人工智能模型,特别是混合视觉变换器(Mixed Vision Transformers, MIT)和ConvNeXt模型,构建一个高效的海带冠层检测流程。通过在不同尺寸的图像上训练这些模型,显著提高了集成结果的准确性。U-Net被证明是最佳的分割架构,而UpperNet也对最终集成结果有所贡献。研究还强调了关键Landsat波段(如短波红外SWIR1和近红外NIR)的重要性,并在后处理中使用海拔数据以消除陆地上的误报。该方法实现了高检测率,准确识别了约四分之三包含海带冠层的像素,同时保持了较低的误报率。尽管Landsat卫星的分辨率中等,但其广泛的历史覆盖范围使其成为研究海带森林的有效工具。该研究展示了将机器学习模型与众包数据结合用于环境监测的潜力。
链接: https://arxiv.org/abs/2501.14001
作者: Ioannis Nasios
机构: Taylor & Francis, 4 Park Square, Milton Park, Abingdon, UK; Institut für Informatik, Albert-Ludwigs-Universität, Freiburg, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Kelp forests, as foundation species, are vital to marine ecosystems, providing essential food and habitat for numerous organisms. This study explores the integration of crowdsourced labels with advanced artificial intelligence models to develop a fast and accurate kelp canopy detection pipeline using Landsat images. Building on the success of a machine learning competition, where this approach ranked third and performed consistently well on both local validation and public and private leaderboards, the research highlights the effectiveness of combining Mixed Vision Transformers (MIT) with ConvNeXt models. Training these models on various image sizes significantly enhanced the accuracy of the ensemble results. U-Net emerged as the best segmentation architecture, with UpperNet also contributing to the final ensemble. Key Landsat bands, such as ShortWave InfraRed (SWIR1) and Near-InfraRed (NIR), were crucial while altitude data was used in postprocessing to eliminate false positives on land. The methodology achieved a high detection rate, accurately identifying about three out of four pixels containing kelp canopy while keeping false positives low. Despite the medium resolution of Landsat satellites, their extensive historical coverage makes them effective for studying kelp forests. This work also underscores the potential of combining machine learning models with crowdsourced data for effective and scalable environmental monitoring. All running code for training all models and inference can be found at this https URL.
zh
[CV-69] Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction
【速读】:该论文旨在解决机器人在社交环境中通过唇读(lip reading)技术提升对人类交流理解能力的问题,特别是在嘈杂环境中(如护理和客户服务场景)的应用。研究通过生成波斯语唇读数据集,并将波斯语唇读技术整合到Surena-V人形机器人中,以增强其语音识别能力。解决方案的关键在于探索了两种互补的方法:间接方法通过面部关键点(facial landmarks)跟踪,尤其是唇部周围的运动来推断语音内容;直接方法则利用卷积神经网络(CNNs)和长短期记忆网络(LSTMs)处理原始视频数据,进行动作和语音识别。其中,LSTM模型表现最佳,达到了89%的准确率,并已成功应用于Surena-V机器人中,实现了实时人机交互。研究强调了这些方法在言语交流受限环境中的有效性。
链接: https://arxiv.org/abs/2501.13996
作者: Ali Farshian Abbasi,Aghil Yousefi-Koma,Soheil Dehghani Firouzabadi,Parisa Rashidi,Alireza Naeini
机构: Center of Advanced Systems and Technologies (CAST), School of Mechanical Engineering, University of Tehran (德黑兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Lip reading is vital for robots in social settings, improving their ability to understand human communication. This skill allows them to communicate more easily in crowded environments, especially in caregiving and customer service roles. Generating a Persian Lip-reading dataset, this study integrates Persian lip-reading technology into the Surena-V humanoid robot to improve its speech recognition capabilities. Two complementary methods are explored, an indirect method using facial landmark tracking and a direct method leveraging convolutional neural networks (CNNs) and long short-term memory (LSTM) networks. The indirect method focuses on tracking key facial landmarks, especially around the lips, to infer movements, while the direct method processes raw video data for action and speech recognition. The best-performing model, LSTM, achieved 89% accuracy and has been successfully implemented into the Surena-V robot for real-time human-robot interaction. The study highlights the effectiveness of these methods, particularly in environments where verbal communication is limited.
zh
[CV-70] CSAOT: Cooperative Multi-Agent System for Active Object Tracking
【速读】:该论文试图解决主动目标跟踪(Active Object Tracking, AOT)在动态和复杂环境中面临的挑战,特别是单智能体系统在信息收集和处理能力上的局限性,导致决策效果不佳的问题。现有的多智能体方法通常依赖外部辅助设备,增加了成本。论文提出的解决方案是协作式主动目标跟踪系统(Collaborative System for Active Object Tracking, CSAOT),其关键是通过多智能体深度强化学习(Multi-Agent Deep Reinforcement Learning, MADRL)和专家混合(Mixture of Experts, MoE)框架,使多个智能体能够在单一设备上协同工作,从而提升跟踪性能并降低成本。该方法通过优化相机运动,增强了对遮挡和快速运动的鲁棒性,并延长了跟踪持续时间。实验验证表明,CSAOT在动态和静态障碍物的交互地图上表现优异。
链接: https://arxiv.org/abs/2501.13994
作者: Hy Nguyen,Bao Pham,Hung Du,Srikanth Thudumu,Rajesh Vasa,Kon Mouzakis
机构: Deakin University(迪肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Object Tracking is essential for many computer vision applications, such as autonomous navigation, surveillance, and robotics. Unlike Passive Object Tracking (POT), which relies on static camera viewpoints to detect and track objects across consecutive frames, Active Object Tracking (AOT) requires a controller agent to actively adjust its viewpoint to maintain visual contact with a moving target in complex environments. Existing AOT solutions are predominantly single-agent-based, which struggle in dynamic and complex scenarios due to limited information gathering and processing capabilities, often resulting in suboptimal decision-making. Alleviating these limitations necessitates the development of a multi-agent system where different agents perform distinct roles and collaborate to enhance learning and robustness in dynamic and complex environments. Although some multi-agent approaches exist for AOT, they typically rely on external auxiliary agents, which require additional devices, making them costly. In contrast, we introduce the Collaborative System for Active Object Tracking (CSAOT), a method that leverages multi-agent deep reinforcement learning (MADRL) and a Mixture of Experts (MoE) framework to enable multiple agents to operate on a single device, thereby improving tracking performance and reducing costs. Our approach enhances robustness against occlusions and rapid motion while optimizing camera movements to extend tracking duration. We validated the effectiveness of CSAOT on various interactive maps with dynamic and stationary obstacles.
zh
[CV-71] CGI: Identifying Conditional Generative Models with Example Images
【速读】:该论文试图解决在模型库(model hubs)中用户难以高效搜索到最适合其需求的生成模型(generative models)的问题。现有的模型库通常假设基本的文本匹配足以搜索模型,但由于模型抽象层次不同且数量庞大,用户难以通过手动审查模型描述和示例图像来选择最合适的模型。为解决这一问题,论文提出了条件生成模型识别(Conditional Generative Model Identification, CGI),其核心解决方案是基于用户提供的示例图像来识别最合适的模型,而不是要求用户手动审查大量模型。具体而言,论文提出了基于提示的模型识别(Prompt-Based Model Identification, PMI),该方法能够充分描述模型功能并精确匹配用户需求与模型规格。通过提供包含65个模型和9100个识别任务的基准测试,实验和人工评估结果表明,PMI方法在提供四个示例图像时能够正确识别92%的模型,并显著改善了FID(Fréchet Inception Distance)分数。
链接: https://arxiv.org/abs/2501.13991
作者: Zhi Zhou,Hao-Zhe Tan,Peng-Xiao Song,Lan-Zhe Guo
机构: Nanjing University(南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Generative models have achieved remarkable performance recently, and thus model hubs have emerged. Existing model hubs typically assume basic text matching is sufficient to search for models. However, in reality, due to different abstractions and the large number of models in model hubs, it is not easy for users to review model descriptions and example images, choosing which model best meets their needs. Therefore, it is necessary to describe model functionality wisely so that future users can efficiently search for the most suitable model for their needs. Efforts to address this issue remain limited. In this paper, we propose Conditional Generative Model Identification (CGI), which aims to provide an effective way to identify the most suitable model using user-provided example images rather than requiring users to manually review a large number of models with example images. To address this problem, we propose the PromptBased Model Identification (PMI) , which can adequately describe model functionality and precisely match requirements with specifications. To evaluate PMI approach and promote related research, we provide a benchmark comprising 65 models and 9100 identification tasks. Extensive experimental and human evaluation results demonstrate that PMI is effective. For instance, 92% of models are correctly identified with significantly better FID scores when four example images are provided.
zh
[CV-72] MCRL4OR: Multimodal Contrastive Representation Learning for Off-Road Environmental Perception
【速读】:该论文试图解决在非结构化越野环境中进行环境感知的挑战。由于越野环境的非结构化特性,手动标注大规模越野驾驶数据集非常困难,这限制了监督学习模型的应用。为此,论文提出了一种多模态对比表示学习方法(Multimodal Contrastive Representation Learning for Off-Road environmental perception, MCRL4OR),旨在通过对比学习框架联合学习视觉图像、运动状态和控制动作的编码器。解决方案的关键在于将运动状态与视觉图像和控制动作的融合特征对齐,其背后的因果关系是:惯性运动状态是在当前地形条件下采取特定控制动作的结果。通过在大规模越野驾驶数据集上预训练MCRL4OR,并将其学习到的多模态表示应用于各种下游感知任务,实验结果表明该方法在越野驾驶场景中具有显著优势。
链接: https://arxiv.org/abs/2501.13988
作者: Yi Yang,Zhang Zhang,Liang Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Github repository: this https URL
点击查看摘要
Abstract:Most studies on environmental perception for autonomous vehicles (AVs) focus on urban traffic environments, where the objects/stuff to be perceived are mainly from man-made scenes and scalable datasets with dense annotations can be used to train supervised learning models. By contrast, it is hard to densely annotate a large-scale off-road driving dataset manually due to the inherently unstructured nature of off-road environments. In this paper, we propose a Multimodal Contrastive Representation Learning approach for Off-Road environmental perception, namely MCRL4OR. This approach aims to jointly learn three encoders for processing visual images, locomotion states, and control actions by aligning the locomotion states with the fused features of visual images and control actions within a contrastive learning framework. The causation behind this alignment strategy is that the inertial locomotion state is the result of taking a certain control action under the current landform/terrain condition perceived by visual sensors. In experiments, we pre-train the MCRL4OR with a large-scale off-road driving dataset and adopt the learned multimodal representations for various downstream perception tasks in off-road driving scenarios. The superior performance in downstream tasks demonstrates the advantages of the pre-trained multimodal representations. The codes can be found in \urlthis https URL.
zh
[CV-73] Pilot: Building the Federated Multimodal Instruction Tuning Framework
【速读】:该论文旨在解决在分布式设备上协作微调多模态大语言模型(MLLMs)时面临的挑战,特别是针对不同类型多模态指令数据的联邦多模态指令调优任务(FedMIT)。为了解决这一新任务,论文提出了一个联邦多模态指令调优框架(Pilot)。该框架的关键在于将“适配器叠加”(adapter on adapter)的两阶段机制集成到视觉编码器与大语言模型(LLM)的连接器中。第一阶段从视觉信息中提取任务特定特征和客户端特定特征;第二阶段通过跨任务混合适配器(CT-MoA)模块实现跨任务交互,使每个客户端不仅能捕捉本地数据的个性化信息和任务相关的多模态信息,还能从其他任务中学习通用知识。此外,论文还引入了一种基于欧几里得距离的自适应参数聚合策略,用于优化文本训练参数的聚合,从而在最大程度发挥正向影响的同时有效减少负面影响。该框架能够在指令调优过程中不受任务异质性影响,协同利用不同本地客户端的分布式数据学习跨任务知识。
链接: https://arxiv.org/abs/2501.13985
作者: Baochen Xiong,Xiaoshan Yang,Yaguang Song,Yaowei Wang,Changsheng Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In this paper, we explore a novel federated multimodal instruction tuning task(FedMIT), which is significant for collaboratively fine-tuning MLLMs on different types of multimodal instruction data on distributed devices. To solve the new task, we propose a federated multimodal instruction tuning framework(Pilot). Our framework integrates two stages of “adapter on adapter” into the connector of the vision encoder and the LLM. In stage 1, we extract task-specific features and client-specific features from visual information. In stage 2, we build the cross-task Mixture-of-Adapters(CT-MoA) module to perform cross-task interaction. Each client can not only capture personalized information of local data and learn task-related multimodal information, but also learn general knowledge from other tasks. In addition, we introduce an adaptive parameter aggregation strategy for text training parameters, which optimizes parameter aggregation by calculating weights based on the euclidean distance between parameters, so that parameter aggregation can benefit from positive effects to the greatest extent while effectively reducing negative effects. Our framework can collaboratively exploit distributed data from different local clients to learn cross-task knowledge without being affected by the task heterogeneity during instruction tuning. The effectiveness of our method is verified in two different cross-task scenarios.
zh
[CV-74] Attribute-based Visual Reprogramming for Image Classification with CLIP
【速读】:该论文试图解决现有视觉重编程(Visual Reprogramming, VR)方法在应用于视觉-语言模型(如CLIP)时,由于依赖固定的文本模板和真实类别标签(ground-truth class labels)而导致的样本误分类问题。现有方法忽略了CLIP模型可以利用的丰富信息和多样化的属性引导的文本表示,从而限制了其性能。
解决方案的关键在于提出了基于属性的视觉重编程(Attribute-based Visual Reprogramming, AttrVR),该方法利用描述性属性(Descriptive Attributes, DesAttrs)和区分性属性(Distinctive Attributes, DistAttrs)来优化重编程模式。DesAttrs代表不同类别的共同特征描述,而DistAttrs则代表每个类别的独特特征描述。此外,AttrVR通过迭代地为每个图像样本使用k近邻的DesAttrs和DistAttrs来细化重编程模式,从而实现更动态和样本特定的优化。理论分析表明,AttrVR能够减少类内方差并增加类间分离度。实验结果表明,该方法在12个下游任务中均表现出色,适用于基于ViT和ResNet的CLIP模型。
链接: https://arxiv.org/abs/2501.13982
作者: Chengyi Cai,Zesheng Ye,Lei Feng,Jianzhong Qi,Feng Liu
机构: The University of Melbourne(墨尔本大学); Singapore University of Technology and Design(新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs. When applied to vision-language models (e.g., CLIP), existing VR approaches follow the same pipeline used in vision models (e.g., ResNet, ViT), where ground-truth class labels are inserted into fixed text templates to guide the optimization of VR patterns. This label-based approach, however, overlooks the rich information and diverse attribute-guided textual representations that CLIP can exploit, which may lead to the misclassification of samples. In this paper, we propose Attribute-based Visual Reprogramming (AttrVR) for CLIP, utilizing descriptive attributes (DesAttrs) and distinctive attributes (DistAttrs), which respectively represent common and unique feature descriptions for different classes. Besides, as images of the same class may reflect different attributes after VR, AttrVR iteratively refines patterns using the k -nearest DesAttrs and DistAttrs for each image sample, enabling more dynamic and sample-specific optimization. Theoretically, AttrVR is shown to reduce intra-class variance and increase inter-class separation. Empirically, it achieves superior performance in 12 downstream tasks for both ViT-based and ResNet-based CLIP. The success of AttrVR facilitates more effective integration of VR from unimodal vision models into vision-language models. Our code is available at this https URL.
zh
[CV-75] Enhanced PEC-YOLO for Detecting Improper Safety Gear Wearing Among Power Line Workers
【速读】:该论文旨在解决在复杂电力线路环境中,由于目标遮挡和较大变化导致的工人安全装备(safety gear)使用不当的高风险问题。为了解决这一问题,论文提出了一种增强的PEC-YOLO目标检测算法。该算法的关键创新点包括:1)通过引入PConv(Partial Convolution)和EMA(Efficient Multi-scale Attention)注意力机制,提升了特征提取效率并降低了模型复杂度;2)在SPPF(Spatial Pyramid Pooling Fusion)模块中整合了CPCA(Cross Partial Channel Attention)注意力机制,增强了模型对关键信息的关注能力,特别是在复杂条件下的检测精度;3)采用BiFPN(Bi-directional Feature Pyramid Network)颈部架构,通过自适应融合和上下文感知机制优化了低层次和高层次特征的利用,进一步提升了特征表示能力。实验结果表明,PEC-YOLO在检测精度上比YOLOv8s提高了2.7%,同时减少了42.58%的模型参数,并在检测速度上优于其他模型,满足了施工现场安全装备检测的高精度要求。
链接: https://arxiv.org/abs/2501.13981
作者: Chen Zuguo,Kuang Aowei,Huang Yi,Jin Jie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:To address the high risks associated with improper use of safety gear in complex power line environments, where target occlusion and large variance are prevalent, this paper proposes an enhanced PEC-YOLO object detection algorithm. The method integrates deep perception with multi-scale feature fusion, utilizing PConv and EMA attention mechanisms to enhance feature extraction efficiency and minimize model complexity. The CPCA attention mechanism is incorporated into the SPPF module, improving the model’s ability to focus on critical information and enhance detection accuracy, particularly in challenging conditions. Furthermore, the introduction of the BiFPN neck architecture optimizes the utilization of low-level and high-level features, enhancing feature representation through adaptive fusion and context-aware mechanism. Experimental results demonstrate that the proposed PEC-YOLO achieves a 2.7% improvement in detection accuracy compared to YOLOv8s, while reducing model parameters by 42.58%. Under identical conditions, PEC-YOLO outperforms other models in detection speed, meeting the stringent accuracy requirements for safety gear detection in construction sites. This study contributes to the development of efficient and accurate intelligent monitoring systems for ensuring worker safety in hazardous environments.
zh
[CV-76] 3DGS2: Near Second-order Converging 3D Gaussian Splatting SIGGRAPH2025
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)训练过程中收敛速度慢的问题。传统的3DGS训练采用标准的随机梯度下降(Stochastic Gradient Descent, SGD)方法,其收敛速度最多为线性,导致即使使用GPU加速,训练时间仍需数十分钟。论文提出了一种接近二阶收敛的训练算法,通过利用3DGS的独特性质来加速训练。解决方案的关键在于两点:首先,高斯核的属性对图像空间损失(image-space loss)的贡献是独立的,这使得可以针对单个核属性进行局部优化,并通过构建小规模的牛顿系统(Newton systems)在GPU线程上高效求解,从而实现类似牛顿法的收敛速度。其次,高斯核在输入图像之间表现出稀疏且结构化的耦合关系,这使得可以利用空间信息来减少随机训练中的过冲(overshoot)现象。该方法在保持或超越基于SGD的3DGS重建质量的同时,收敛速度比标准GPU加速的3DGS训练快一个数量级,所需迭代次数减少了10倍以上。
链接: https://arxiv.org/abs/2501.13975
作者: Lei Lan,Tianjia Shao,Zixuan Lu,Yu Zhang,Chenfanfu Jiang,Yin Yang
机构: University of Utah (犹他大学); Zhejiang University (浙江大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 11 pages, submit on SIGGRAPH 2025
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has emerged as a mainstream solution for novel view synthesis and 3D reconstruction. By explicitly encoding a 3D scene using a collection of Gaussian kernels, 3DGS achieves high-quality rendering with superior efficiency. As a learning-based approach, 3DGS training has been dealt with the standard stochastic gradient descent (SGD) method, which offers at most linear convergence. Consequently, training often requires tens of minutes, even with GPU acceleration. This paper introduces a (near) second-order convergent training algorithm for 3DGS, leveraging its unique properties. Our approach is inspired by two key observations. First, the attributes of a Gaussian kernel contribute independently to the image-space loss, which endorses isolated and local optimization algorithms. We exploit this by splitting the optimization at the level of individual kernel attributes, analytically constructing small-size Newton systems for each parameter group, and efficiently solving these systems on GPU threads. This achieves Newton-like convergence per training image without relying on the global Hessian. Second, kernels exhibit sparse and structured coupling across input images. This property allows us to effectively utilize spatial information to mitigate overshoot during stochastic training. Our method converges an order faster than standard GPU-based 3DGS training, requiring over 10\times fewer iterations while maintaining or surpassing the quality of the compared with the SGD-based 3DGS reconstructions.
zh
[CV-77] A Spatio-temporal Graph Network Allowing Incomplete Trajectory Input for Pedestrian Trajectory Prediction
【速读】:该论文试图解决在行人轨迹预测中,传统算法要求输入的历史轨迹必须完整的问题。当行人在过去某一帧中不可观测时,其历史轨迹将变得不完整,导致算法无法预测其未来轨迹。为解决这一限制,论文提出了STGN-IT(Spatio-Temporal Graph Network with Incomplete Trajectory),一种允许不完整轨迹输入的时空图网络。该解决方案的关键在于引入了时空图结构,并采用额外的编码方法来表示行人的历史轨迹和观测状态。此外,STGN-IT还将环境中可能影响未来轨迹的静态障碍物作为节点引入,以进一步提高预测精度。实验结果表明,STGN-IT在公开数据集上的表现优于现有的先进算法。
链接: https://arxiv.org/abs/2501.13973
作者: Juncen Long,Gianluca Bardaro,Simone Mentasti,Matteo Matteucci
机构: Politecnico di Milano (米兰理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
点击查看摘要
Abstract:Pedestrian trajectory prediction is important in the research of mobile robot navigation in environments with pedestrians. Most pedestrian trajectory prediction algorithms require the input historical trajectories to be complete. If a pedestrian is unobservable in any frame in the past, then its historical trajectory become incomplete, the algorithm will not predict its future trajectory. To address this limitation, we propose the STGN-IT, a spatio-temporal graph network allowing incomplete trajectory input, which can predict the future trajectories of pedestrians with incomplete historical trajectories. STGN-IT uses the spatio-temporal graph with an additional encoding method to represent the historical trajectories and observation states of pedestrians. Moreover, STGN-IT introduces static obstacles in the environment that may affect the future trajectories as nodes to further improve the prediction accuracy. A clustering algorithm is also applied in the construction of spatio-temporal graphs. Experiments on public datasets show that STGN-IT outperforms state of the art algorithms on these metrics.
zh
[CV-78] GS-LiDAR: Generating Realistic LiDAR Point Clouds with Panoramic Gaussian Splatting
【速读】:该论文试图解决LiDAR新视角合成(LiDAR novel view synthesis, NVS)中的两个主要问题:一是现有方法通常依赖神经辐射场(Neural Radiance Fields, NeRF)作为3D表示,导致训练和渲染过程中的计算成本较高;二是NeRF及其变体主要适用于对称场景,难以有效处理驾驶场景中的静态和动态元素。为解决这些问题,论文提出了GS-LiDAR框架,其关键创新在于使用具有周期性振动特性的2D高斯基元(Gaussian primitives),能够精确重建驾驶场景中的静态和动态几何结构。此外,GS-LiDAR引入了一种新颖的全景渲染技术,通过显式的射线-基元交点和全景LiDAR监督,结合强度和射线丢失的球谐系数(spherical harmonic, SH),显著提升了渲染点云的逼真度。实验结果表明,该方法在定量指标、视觉质量以及训练和渲染效率方面均优于现有方法。
链接: https://arxiv.org/abs/2501.13971
作者: Junzhe Jiang,Chun Gu,Yurui Chen,Li Zhang
机构: School of Data Science, Fudan University (复旦大学数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:LiDAR novel view synthesis (NVS) has emerged as a novel task within LiDAR simulation, offering valuable simulated point cloud data from novel viewpoints to aid in autonomous driving systems. However, existing LiDAR NVS methods typically rely on neural radiance fields (NeRF) as their 3D representation, which incurs significant computational costs in both training and rendering. Moreover, NeRF and its variants are designed for symmetrical scenes, making them ill-suited for driving scenarios. To address these challenges, we propose GS-LiDAR, a novel framework for generating realistic LiDAR point clouds with panoramic Gaussian splatting. Our approach employs 2D Gaussian primitives with periodic vibration properties, allowing for precise geometric reconstruction of both static and dynamic elements in driving scenarios. We further introduce a novel panoramic rendering technique with explicit ray-splat intersection, guided by panoramic LiDAR supervision. By incorporating intensity and ray-drop spherical harmonic (SH) coefficients into the Gaussian primitives, we enhance the realism of the rendered point clouds. Extensive experiments on KITTI-360 and nuScenes demonstrate the superiority of our method in terms of quantitative metrics, visual quality, as well as training and rendering efficiency.
zh
[CV-79] InsTex: Indoor Scenes Stylized Texture Synthesis
【速读】:该论文旨在解决在3D场景中生成高质量纹理(textures)时面临的挑战,特别是在室内设计、游戏和增强/虚拟现实(AR/VR)应用中。当前的方法,如基于2D扩散模型(2D diffusion models)的3D纹理生成,存在处理时间长和视觉伪影(visual artifacts)的问题,而基于3D数据的方法则难以实现有效的泛化(generalization)。为了解决这些问题,论文提出了InsTex,一种两阶段架构,通过从粗到细的流程生成高质量且风格一致的3D室内场景纹理。InsTex的关键在于利用深度到图像的扩散先验(depth-to-image diffusion priors),首先使用预训练的2D扩散模型生成多视角图像,然后对这些纹理进行一致性优化。该方法支持文本和视觉提示(prompts),在视觉质量和定量指标上达到了最先进的水平,并在多种3D纹理应用中展示了其有效性。
链接: https://arxiv.org/abs/2501.13969
作者: Yunfan Zhang,Zhiwei Xiong,Zhiqi Shen,Guosheng Lin,Hao Wang,Nicolas Vun
机构: College of Computing and Data Science, Nanyang Technological University, Singapore(新加坡南洋理工大学计算机与数据科学学院); Alibaba-NTU Singapore Joint Research Institute, Nanyang Technological University, Singapore(新加坡南洋理工大学阿里巴巴-南洋理工大学联合研究院); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Generating high-quality textures for 3D scenes is crucial for applications in interior design, gaming, and augmented/virtual reality (AR/VR). Although recent advancements in 3D generative models have enhanced content creation, significant challenges remain in achieving broad generalization and maintaining style consistency across multiple viewpoints. Current methods, such as 2D diffusion models adapted for 3D texturing, suffer from lengthy processing times and visual artifacts, while approaches driven by 3D data often fail to generalize effectively. To overcome these challenges, we introduce InsTex, a two-stage architecture designed to generate high-quality, style-consistent textures for 3D indoor scenes. InsTex utilizes depth-to-image diffusion priors in a coarse-to-fine pipeline, first generating multi-view images with a pre-trained 2D diffusion model and subsequently refining the textures for consistency. Our method supports both textual and visual prompts, achieving state-of-the-art results in visual quality and quantitative metrics, and demonstrates its effectiveness across various 3D texturing applications.
zh
[CV-80] riplet Synthesis For Enhancing Composed Image Retrieval via Counterfactual Image Generation
【速读】:该论文试图解决构建高质量训练数据集时所需的大量手动标注问题,这一问题在组合图像检索(Composed Image Retrieval, CIR)模型的训练中尤为突出,因为传统方法需要人工标注大量的三元组(reference image, modification text, target image),耗时且费力。为解决这一问题,论文提出了一种基于反事实图像生成(counterfactual image generation)的三元组合成方法。该方案的关键在于通过反事实图像生成技术自动控制视觉特征的修改,从而无需人工干预即可生成多样化的训练三元组。这种方法不仅能够自动生成大量训练数据,还能提升数据集的多样性和表达能力,进而提高CIR模型的性能。
链接: https://arxiv.org/abs/2501.13968
作者: Kenta Uesugi,Naoki Saito,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 4 pages, 4 figures
点击查看摘要
Abstract:Composed Image Retrieval (CIR) provides an effective way to manage and access large-scale visual data. Construction of the CIR model utilizes triplets that consist of a reference image, modification text describing desired changes, and a target image that reflects these changes. For effectively training CIR models, extensive manual annotation to construct high-quality training datasets, which can be time-consuming and labor-intensive, is required. To deal with this problem, this paper proposes a novel triplet synthesis method by leveraging counterfactual image generation. By controlling visual feature modifications via counterfactual image generation, our approach automatically generates diverse training triplets without any manual intervention. This approach facilitates the creation of larger and more expressive datasets, leading to the improvement of CIR model’s performance.
zh
[CV-81] FedDAG: Federated Domain Adversarial Generation Towards Generalizable Medical Image Analysis
【速读】:该论文试图解决联邦领域泛化(Federated Domain Generalization)中的模型泛化能力问题,特别是在面对未见过的目标领域时,如何通过多个源领域训练一个全局模型并确保其泛化能力。现有方法主要通过共享和重组局部领域特定属性来增加数据多样性并模拟潜在的领域偏移,但这些方法可能不足以应对全局数据的分布外(out-of-distribution)问题。论文提出的解决方案关键是一个名为联邦领域对抗生成(Federated Domain Adversarial Generation, FedDAG)的框架。该框架通过对抗生成与局部和全局源领域不同的新领域来模拟领域偏移,从而提升模型的泛化能力。具体而言,FedDAG通过最大化原始图像与生成图像之间的实例级特征差异来生成新风格的图像,并通过最小化这些特征差异来训练一个可泛化的任务模型。此外,FedDAG还通过使用锐度(sharpness)概念在客户端内和跨客户端层次上聚合局部模型,以缓解由于数据隔离和异构性导致的泛化贡献不平衡问题,从而进一步提升全局模型的泛化能力。
链接: https://arxiv.org/abs/2501.13967
作者: Haoxuan Che,Yifei Wu,Haibo Jin,Yong Xia,Hao Chen
机构: Hong Kong University of Science and Technology (香港科技大学); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Federated domain generalization aims to train a global model from multiple source domains and ensure its generalization ability to unseen target domains. Due to the target domain being with unknown domain shifts, attempting to approximate these gaps by source domains may be the key to improving model generalization capability. Existing works mainly focus on sharing and recombining local domain-specific attributes to increase data diversity and simulate potential domain shifts. However, these methods may be insufficient since only the local attribute recombination can be hard to touch the out-of-distribution of global data. In this paper, we propose a simple-yet-efficient framework named Federated Domain Adversarial Generation (FedDAG). It aims to simulate the domain shift and improve the model generalization by adversarially generating novel domains different from local and global source domains. Specifically, it generates novel-style images by maximizing the instance-level feature discrepancy between original and generated images and trains a generalizable task model by minimizing their feature discrepancy. Further, we observed that FedDAG could cause different performance improvements for local models. It may be due to inherent data isolation and heterogeneity among clients, exacerbating the imbalance in their generalization contributions to the global model. Ignoring this imbalance can lead the global model’s generalization ability to be sub-optimal, further limiting the novel domain generation procedure. Thus, to mitigate this imbalance, FedDAG hierarchically aggregates local models at the within-client and across-client levels by using the sharpness concept to evaluate client model generalization contributions. Extensive experiments across four medical benchmarks demonstrate FedDAG’s ability to enhance generalization in federated medical scenarios.
zh
[CV-82] Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble
【速读】:该论文试图解决的问题是如何利用视觉-语言模型(Vision-Language Models, VLMs)来自动评估增强现实(Augmented Reality, AR)生成场景的质量、可用性和安全性。解决方案的关键在于评估三种先进的商业VLM(GPT、Gemini和Claude)在识别和描述AR场景方面的能力。为此,研究使用了DiverseAR数据集,这是首个专门设计用于评估VLM在分析不同复杂度的AR场景中虚拟内容能力的AR数据集。研究结果表明,VLM在感知和描述AR场景方面表现出色,感知的准确率(True Positive Rate, TPR)高达93%,描述的准确率达到71%。然而,VLM在处理无缝集成的虚拟内容(如具有逼真阴影的虚拟花盆)时表现较差。研究还识别了影响VLM性能的关键因素,包括虚拟内容的放置、渲染质量和物理合理性,从而强调了VLM在评估AR体验质量方面的潜力。
链接: https://arxiv.org/abs/2501.13964
作者: Lin Duan,Yanming Xiu,Maria Gorlatova
机构: Department of Electrical and Computer Engineering, Duke University (杜克大学电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages
点击查看摘要
Abstract:Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs – GPT, Gemini, and Claude – in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs’ ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for description. While they excel at identifying obvious virtual objects, such as a glowing apple, they struggle when faced with seamlessly integrated content, such as a virtual pot with realistic shadows. Our results highlight both the strengths and the limitations of VLMs in understanding AR scenarios. We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility. This study underscores the potential of VLMs as tools for evaluating the quality of AR experiences.
zh
[CV-83] Procedural Generation of 3D Maize Plant Architecture from LIDAR Data
【速读】:该论文旨在解决传统田间表型分析(field-based phenotyping)在玉米(Zea mays)植物三维建模中的可扩展性问题,提出了一种基于LiDAR点云数据生成玉米植物三维模型的鲁棒框架。解决方案的关键在于结合了非均匀有理B样条(Non-Uniform Rational B-Spline, NURBS)曲面建模和粒子群优化(Particle Swarm Optimization, PSO)技术,通过两阶段优化策略实现高精度的三维重建。首先,PSO用于生成初始的NURBS曲面近似,优化控制点以对齐LiDAR数据;其次,采用可微分编程框架(NURBS-Diff)对初始曲面进行精确细化,捕捉叶片细节。这种分层优化策略显著提高了重建曲面的质量和保真度,支持跨基因型的复杂性状(如叶序)提取。
链接: https://arxiv.org/abs/2501.13963
作者: Mozhgan Hadadi,Mehdi Saraeian,Jackson Godbersen,Talukder Jubery,Yawei Li,Lakshmi Attigala,Aditya Balu,Soumik Sarkar,Patrick S. Schnable,Adarsh Krishnamurthy,Baskar Ganapathysubramanian
机构: Iowa State University(爱荷华州立大学); University of California, Davis(加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:This study introduces a robust framework for generating procedural 3D models of maize (Zea mays) plants from LiDAR point cloud data, offering a scalable alternative to traditional field-based phenotyping. Our framework leverages Non-Uniform Rational B-Spline (NURBS) surfaces to model the leaves of maize plants, combining Particle Swarm Optimization (PSO) for an initial approximation of the surface and a differentiable programming framework for precise refinement of the surface to fit the point cloud data. In the first optimization phase, PSO generates an approximate NURBS surface by optimizing its control points, aligning the surface with the LiDAR data, and providing a reliable starting point for refinement. The second phase uses NURBS-Diff, a differentiable programming framework, to enhance the accuracy of the initial fit by refining the surface geometry and capturing intricate leaf details. Our results demonstrate that, while PSO establishes a robust initial fit, the integration of differentiable NURBS significantly improves the overall quality and fidelity of the reconstructed surface. This hierarchical optimization strategy enables accurate 3D reconstruction of maize leaves across diverse genotypes, facilitating the subsequent extraction of complex traits like phyllotaxy. We demonstrate our approach on diverse genotypes of field-grown maize plants. All our codes are open-source to democratize these phenotyping approaches.
zh
[CV-84] A Fast Scalable and Robust Deep Learning-based Iterative Reconstruction Framework for Accelerated Industrial Cone-beam X-ray Computed Tomography
【速读】:该论文旨在解决工业锥束X射线计算机断层扫描(Cone-beam X-ray Computed Tomography, XCT)在重建高密度厚金属部件时面临的挑战,特别是噪声和条纹伪影(streak artifacts)问题。传统方法在处理这些复杂材料时往往难以获得高质量的3D重建结果。论文提出了一种基于深度神经网络的迭代算法,该算法集成了经过伪影减少训练的卷积神经网络(CNN)作为先验模型,并结合自动正则化参数选择。这一方法能够在仅几次迭代中实现高质量的3D重建,且在处理不同扫描条件下的分布外数据时表现出良好的泛化能力。其关键创新在于将深度学习与传统迭代重建方法相结合,显著提升了重建质量,超越了现有基于监督学习的方法。
链接: https://arxiv.org/abs/2501.13961
作者: Aniket Pramanik,Obaidullah Rahman,Singanallur V. Venkatakrishnan,Amirkoushyar Ziabari
机构: Oak Ridge National Laboratory (橡树岭国家实验室); UT-Battelle, LLC (UT-巴特尔有限责任公司); US Department of Energy (美国能源部); Office of Energy Efficiency and Renewable Energy (能源效率和可再生能源办公室); Advanced Materials & Manufacturing Technologies Office (先进材料与制造技术办公室); Technology Commercialization Fund (技术商业化基金)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Cone-beam X-ray Computed Tomography (XCT) with large detectors and corresponding large-scale 3D reconstruction plays a pivotal role in micron-scale characterization of materials and parts across various industries. In this work, we present a novel deep neural network-based iterative algorithm that integrates an artifact reduction-trained CNN as a prior model with automated regularization parameter selection, tailored for large-scale industrial cone-beam XCT data. Our method achieves high-quality 3D reconstructions even for extremely dense thick metal parts - which traditionally pose challenges to industrial CT images - in just a few iterations. Furthermore, we show the generalizability of our approach to out-of-distribution scans obtained under diverse scanning conditions. Our method effectively handles significant noise and streak artifacts, surpassing state-of-the-art supervised learning methods trained on the same data.
zh
[CV-85] DEFEND: A Large-scale 1M Dataset and Foundation Model for Tobacco Addiction Prevention
【速读】:该论文试图解决烟草广告在社交媒体上快速创新与传统监控方法滞后之间的差距问题。由于缺乏大规模、全面的数据集和复杂的监控系统,公共卫生监管无法跟上烟草行业的快速发展。论文提出了两个关键解决方案:一是引入了Tobacco-1M数据集,这是一个包含100万张烟草产品图像的综合数据集,涵盖了75个产品类别,并带有分层标签;二是提出了DEFEND模型,这是一个用于烟草产品理解的基础模型。DEFEND模型通过集成特征增强模块(Feature Enhancement Module)进行多模态表示学习,采用局部-全局视觉一致性机制(Local-Global Visual Coherence)进行细节特征区分,并通过增强的图像-文本对齐策略(Enhanced Image-Text Alignment)实现精确的产品表征。实验结果表明,DEFEND在产品分类和视觉问答任务中分别达到了83.1%和73.8%的准确率,显著优于现有方法,并且在零样本学习(zero-shot learning)任务中也表现出色,准确率达到45.6%。这些工具为监管机构和公共卫生研究人员提供了强大的监控手段,有望革新烟草控制和公共卫生监控的方法。
链接: https://arxiv.org/abs/2501.13950
作者: Naga VS Raviteja Chappa,Matthew Shepard,Connor McCurtain,Charlotte McCormick,Page Daniel Dobbs,Khoa Luu
机构: Dept. of EECS, University of Arkansas(阿肯色大学电子工程与计算机科学系); Center for Public Health and Technology, University of Arkansas(阿肯色大学公共卫生与技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 5 tables
点击查看摘要
Abstract:While tobacco advertising innovates at unprecedented speed, traditional surveillance methods remain frozen in time, especially in the context of social media. The lack of large-scale, comprehensive datasets and sophisticated monitoring systems has created a widening gap between industry advancement and public health oversight. This paper addresses this critical challenge by introducing Tobacco-1M, a comprehensive dataset of one million tobacco product images with hierarchical labels spanning 75 product categories, and DEFEND, a novel foundation model for tobacco product understanding. Our approach integrates a Feature Enhancement Module for rich multimodal representation learning, a Local-Global Visual Coherence mechanism for detailed feature discrimination, and an Enhanced Image-Text Alignment strategy for precise product characterization. Experimental results demonstrate DEFEND’s superior performance, achieving 83.1% accuracy in product classification and 73.8% in visual question-answering tasks, outperforming existing methods by significant margins. Moreover, the model exhibits robust zero-shot learning capabilities with 45.6% accuracy on novel product categories. This work provides regulatory bodies and public health researchers with powerful tools for monitoring emerging tobacco products and marketing strategies, potentially revolutionizing approaches to tobacco control and public health surveillance.
zh
[CV-86] Advanced deep architecture pruning using single filter performance
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在推理过程中计算复杂度高、能耗大和延迟长的问题。通过剪枝(pruning)神经网络的参数和结构,论文提出了一种新的方法来减少这些开销。解决方案的关键在于应用滤波器集群连接(Applied Filter Cluster Connections, AFCC),该方法通过定量测量深度学习中每一层单个滤波器的性能,提出了一种新的机制来理解深度学习的工作原理。AFCC 方法在 VGG-11 和 EfficientNet-B0 架构上进行了验证,并在 CIFAR-100 数据集上展示了其在高剪枝率下的优越性能。此外,该方法还扩展到全连接层的高剪枝,表明其可能显著降低过参数化 AI 任务的复杂性。
链接: https://arxiv.org/abs/2501.12880
作者: Yarden Tzach,Yuval Meir,Ronit D. Gross,Ofek Tevet,Ella Koresh,Ido Kanter
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 4 figures
点击查看摘要
Abstract:Pruning the parameters and structure of neural networks reduces the computational complexity, energy consumption, and latency during inference. Recently, a novel underlying mechanism for successful deep learning (DL) was presented based on a method that quantitatively measures the single filter performance in each layer of a DL architecture, and a new comprehensive mechanism of how deep learning works was presented. Herein, we demonstrate how this understanding paves the path to highly dilute the convolutional layers of deep architectures without affecting their overall accuracy using applied filter cluster connections (AFCC). AFCC is exemplified on VGG-11 and EfficientNet-B0 architectures trained on CIFAR-100, and its high pruning outperforms other techniques using the same pruning magnitude. Additionally, this technique is broadened to single nodal performance and highly pruning of fully connected layers, suggesting a possible implementation to considerably reduce the complexity of over-parameterized AI tasks.
zh
[CV-87] Enhanced Confocal Laser Scanning Microscopy with Adaptive Physics Informed Deep Autoencoders
【速读】:该论文旨在解决共聚焦激光扫描显微镜(Confocal Laser Scanning Microscopy, CLSM)在成像过程中常见的局限性,包括衍射极限分辨率、噪声问题以及由于低激光功率条件导致的欠采样问题。为了解决这些问题,作者提出了一种基于物理信息的深度学习框架。该框架的关键在于将光学系统的点扩散函数(Point Spread Function, PSF)以及常见的CLSM图像退化机制(如光子散粒噪声、暗电流噪声、运动模糊、散斑噪声和欠采样)直接纳入模型架构中。通过使用卷积层和转置卷积层,该模型能够从严重噪声污染的输入中重建出高保真图像。此外,该方法借鉴了压缩感知(Compressed Sensing)的进展,显著减少了数据采集需求,同时不牺牲图像分辨率。实验结果表明,该网络在恢复精细结构细节方面优于传统的反卷积算法(如Richardson-Lucy、非负最小二乘法)以及其他方法(如全变分正则化、维纳滤波和小波去噪),并在低光和稀疏采样条件下表现出可靠的性能,有望应用于活细胞成像、动态生物学研究和高通量材料表征等领域。
链接: https://arxiv.org/abs/2501.14709
作者: Zaheer Ahmad,Junaid Shabeer,Usman Saleem,Tahir Qadeer,Abdul Sami,Zahira El Khalidi,Saad Mehmood
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:We present a physics-informed deep learning framework to address common limitations in Confocal Laser Scanning Microscopy (CLSM), such as diffraction limited resolution, noise, and undersampling due to low laser power conditions. The optical system’s point spread function (PSF) and common CLSM image degradation mechanisms namely photon shot noise, dark current noise, motion blur, speckle noise, and undersampling were modeled and were directly included into model architecture. The model reconstructs high fidelity images from heavily noisy inputs by using convolutional and transposed convolutional layers. Following the advances in compressed sensing, our approach significantly reduces data acquisition requirements without compromising image resolution. The proposed method was extensively evaluated on simulated CLSM images of diverse structures, including lipid droplets, neuronal networks, and fibrillar systems. Comparisons with traditional deconvolution algorithms such as Richardson-Lucy (RL), non-negative least squares (NNLS), and other methods like Total Variation (TV) regularization, Wiener filtering, and Wavelet denoising demonstrate the superiority of the network in restoring fine structural details with high fidelity. Assessment metrics like Structural Similarity Index (SSIM) and Peak Signal to Noise Ratio (PSNR), underlines that the AdaptivePhysicsAutoencoder achieved robust image enhancement across diverse CLSM conditions, helping faster acquisition, reduced photodamage, and reliable performance in low light and sparse sampling scenarios holding promise for applications in live cell imaging, dynamic biological studies, and high throughput material characterization.
zh
[CV-88] Stroke classification using Virtual Hybrid Edge Detection from in silico electrical impedance tomography data
【速读】:该论文旨在解决如何提高基于电阻抗成像(EIT)的脑卒中分类准确性问题,特别是在噪声环境下。传统的EIT方法通常使用原始的电压数据作为神经网络输入,但这种方法在噪声存在时表现不佳。论文提出了一种新的解决方案,即使用噪声鲁棒的虚拟混合边缘检测(VHED)函数作为网络输入。通过构建高细节和数学上逼真的模型,论文验证了VHED函数在噪声环境下的优越性。具体来说,研究采用了物理细节丰富的二维头部模型,并结合统计上真实的电导率分布,模拟了不同形状和大小的出血性和缺血性脑卒中。实验结果表明,使用VHED函数作为输入在噪声环境下显著优于原始电压数据,能够实现高精度的脑卒中分类。
链接: https://arxiv.org/abs/2501.14704
作者: Juan Pablo Agnelli,Fernando S. Moura,Siiri Rautio,Melody Alsaker,Rashmi Murthy,Matti Lassas,Samuli Siltanen
机构: FaMAF, National University of Córdoba and CIEM, National Scientific and Technical Research Council (CONICET), Argentina; Engineering, Modeling and Applied Social Sciences Center, Federal University of ABC, São Paulo, Brazil; Department of Mathematics and Statistics, University of Helsinki, Finland; Department of Mathematics, Gonzaga University, Spokane, USA; Department of Mathematics, Bangalore University, India
类目: Analysis of PDEs (math.AP); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 21 pages, 5 figures
点击查看摘要
Abstract:Electrical impedance tomography (EIT) is a non-invasive imaging method for recovering the internal conductivity of a physical body from electric boundary measurements. EIT combined with machine learning has shown promise for the classification of strokes. However, most previous works have used raw EIT voltage data as network inputs. We build upon a recent development which suggested the use of special noise-robust Virtual Hybrid Edge Detection (VHED) functions as network inputs, although that work used only highly simplified and mathematically ideal models. In this work we strengthen the case for the use of EIT, and VHED functions especially, for stroke classification. We design models with high detail and mathematical realism to test the use of VHED functions as inputs. Virtual patients are created using a physically detailed 2D head model which includes features known to create challenges in real-world imaging scenarios. Conductivity values are drawn from statistically realistic distributions, and phantoms are afflicted with either hemorrhagic or ischemic strokes of various shapes and sizes. Simulated noisy EIT electrode data, generated using the realistic Complete Electrode Model (CEM) as opposed to the mathematically ideal continuum model, is processed to obtain VHED functions. We compare the use of VHED functions as inputs against the alternative paradigm of using raw EIT voltages. Our results show that (i) stroke classification can be performed with high accuracy using 2D EIT data from physically detailed and mathematically realistic models, and (ii) in the presence of noise, VHED functions outperform raw data as network inputs.
zh
[CV-89] Rethinking Foundation Models for Medical Image Classification through a Benchmark Study on MedMNIST
【速读】:该论文旨在解决在医学图像分类任务中选择和使用基础模型(foundation models)的问题。随着越来越多的基础模型被发布,如何选择适合的模型成为了一个重要挑战。论文通过在MedMNIST数据集上进行基准研究,评估了从卷积神经网络(CNN)到基于Transformer的模型等多种基础模型在医学图像分类任务中的表现。关键解决方案包括采用端到端训练(end-to-end training)和线性探测(linear probing)两种方法,并对不同图像尺寸和训练数据规模进行了实验。通过这些实验,论文展示了预训练模型在医学图像分类任务中的显著潜力,并提供了初步但有用的见解和结论。
链接: https://arxiv.org/abs/2501.14685
作者: Fuping Wu,Bartlomiej W. Papiez
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: submitted to MIDL2025
点击查看摘要
Abstract:Foundation models are widely employed in medical image analysis, due to their high adaptability and generalizability for downstream tasks. With the increasing number of foundation models being released, model selection has become an important issue. In this work, we study the capabilities of foundation models in medical image classification tasks by conducting a benchmark study on the MedMNIST dataset. Specifically, we adopt various foundation models ranging from convolutional to Transformer-based models and implement both end-to-end training and linear probing for all classification tasks. The results demonstrate the significant potential of these pre-trained models when transferred for medical image classification. We further conduct experiments with different image sizes and various sizes of training data. By analyzing all the results, we provide preliminary, yet useful insights and conclusions on this topic.
zh
[CV-90] Improved Vessel Segmentation with Symmetric Rotation-Equivariant U-Net
【速读】:该论文旨在解决医学图像分析中自动分割(automated segmentation)任务中现有卷积神经网络(CNNs)方法忽视图像等变性质(equivariant properties)的问题,特别是旋转和反射等变性(rotational and reflection equivariance)。这一忽视可能导致性能下降和预测不一致,尤其是在血管分割(vessel segmentation)等应用中,由于缺乏明确的取向信息,这一问题尤为突出。尽管现有的等变学习方法试图缓解这些问题,但它们通常显著增加了学习成本或模型大小,或两者兼有。
论文提出的解决方案是引入一种高效的对称旋转等变卷积核(symmetric rotation-equivariant convolutional kernel, SRE-Conv)实现,并将其应用于U-Net架构中。该方法能够在学习旋转和反射等变特征的同时,显著减少模型大小。通过在视网膜血管眼底成像(retina vessel fundus imaging)上的实验验证,该SRE U-Net不仅在处理旋转图像时显著优于标准U-Net,还在减少可训练参数和内存成本的情况下,超越了现有的等变学习方法。
链接: https://arxiv.org/abs/2501.14592
作者: Jiazhen Zhang,Yuexi Du,Nicha C. Dvornek,John A. Onofrey
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by IEEE ISBI 2025
点击查看摘要
Abstract:Automated segmentation plays a pivotal role in medical image analysis and computer-assisted interventions. Despite the promising performance of existing methods based on convolutional neural networks (CNNs), they neglect useful equivariant properties for images, such as rotational and reflection equivariance. This limitation can decrease performance and lead to inconsistent predictions, especially in applications like vessel segmentation where explicit orientation is absent. While existing equivariant learning approaches attempt to mitigate these issues, they substantially increase learning cost, model size, or both. To overcome these challenges, we propose a novel application of an efficient symmetric rotation-equivariant (SRE) convolutional (SRE-Conv) kernel implementation to the U-Net architecture, to learn rotation and reflection-equivariant features, while also reducing the model size dramatically. We validate the effectiveness of our method through improved segmentation performance on retina vessel fundus imaging. Our proposed SRE U-Net not only significantly surpasses standard U-Net in handling rotated images, but also outperforms existing equivariant learning methods and does so with a reduced number of trainable parameters and smaller memory cost. The code is available at this https URL.
zh
[CV-91] Scene Understanding Enabled Semantic Communication with Open Channel Coding
【速读】:该论文试图解决传统语义通信(semantic communication)在第六代(6G)网络中面临的局限性,包括静态编码策略、泛化能力差以及对特定任务知识库的依赖,这些问题限制了系统的适应性和灵活性。为解决这些问题,论文提出了一种名为OpenSC的新系统,其关键解决方案包括结合场景理解(scene understanding)、大语言模型(Large Language Models, LLMs)和开放信道编码(open channel coding)。通过利用公开的共享知识库,OpenSC实现了动态、自适应的编码,减少了对静态任务特定数据的依赖,从而提高了系统在不同任务和环境中的适应性。此外,系统采用场景图(scene graphs)进行结构化语义编码,捕捉对象关系和上下文信息,进一步提升任务如视觉问答(Visual Question Answering, VQA)的性能。实验结果表明,该方法在语义理解和传输效率方面均有显著提升,推动了6G网络中自适应、可泛化语义通信的发展。
链接: https://arxiv.org/abs/2501.14520
作者: Zhe Xiang,Fei Yu,Quan Deng,Yuandi Li,Zhiguo Wan
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:As communication systems transition from symbol transmission to conveying meaningful information, sixth-generation (6G) networks emphasize semantic communication. This approach prioritizes high-level semantic information, improving robustness and reducing redundancy across modalities like text, speech, and images. However, traditional semantic communication faces limitations, including static coding strategies, poor generalization, and reliance on task-specific knowledge bases that hinder adaptability. To overcome these challenges, we propose a novel system combining scene understanding, Large Language Models (LLMs), and open channel coding, named \textbfOpenSC. Traditional systems rely on fixed domain-specific knowledge bases, limiting their ability to generalize. Our open channel coding approach leverages shared, publicly available knowledge, enabling flexible, adaptive encoding. This dynamic system reduces reliance on static task-specific data, enhancing adaptability across diverse tasks and environments. Additionally, we use scene graphs for structured semantic encoding, capturing object relationships and context to improve tasks like Visual Question Answering (VQA). Our approach selectively encodes key semantic elements, minimizing redundancy and improving transmission efficiency. Experimental results show significant improvements in both semantic understanding and efficiency, advancing the potential of adaptive, generalizable semantic communication in 6G networks.
zh
[CV-92] Registration of Longitudinal Liver Examinations for Tumor Progress Assessment
【速读】:该论文旨在解决肝脏CT扫描中癌症进展评估的临床挑战,特别是由于不同时间扫描之间的不对齐问题。这种不对齐可能由多种因素引起,如非刚性变形、病理的出现或消失等,导致现有基于内在特征的配准方法可能扭曲肿瘤区域,从而影响肿瘤进展评估和诊断的准确性。论文提出了一种基于肝脏分割的几何和解剖信息的配准方法,旨在对齐纵向肝脏图像以辅助诊断。该方法的关键在于仅利用肝脏分割的几何和解剖信息进行配准,从而在保持肿瘤负担(即被视为肿瘤的组织总体积)的同时,提供更平滑的变形。实验结果表明,该方法在平滑变形和肿瘤外观保持方面优于其他配准技术。
链接: https://arxiv.org/abs/2501.14483
作者: Walid Yassine,Martin Charachon,Céline Hudelot,Roberto Ardon
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
点击查看摘要
Abstract:Assessing cancer progression in liver CT scans is a clinical challenge, requiring a comparison of scans at different times for the same patient. Practitioners must identify existing tumors, compare them with prior exams, identify new tumors, and evaluate overall disease evolution. This process is particularly complex in liver examinations due to misalignment between exams caused by several factors. Indeed, longitudinal liver examinations can undergo different non-pathological and pathological changes due to non-rigid deformations, the appearance or disappearance of pathologies, and other variations. In such cases, existing registration approaches, mainly based on intrinsic features may distort tumor regions, biasing the tumor progress evaluation step and the corresponding diagnosis. This work proposes a registration method based only on geometrical and anatomical information from liver segmentation, aimed at aligning longitudinal liver images for aided diagnosis. The proposed method is trained and tested on longitudinal liver CT scans, with 317 patients for training and 53 for testing. Our experimental results support our claims by showing that our method is better than other registration techniques by providing a smoother deformation while preserving the tumor burden (total volume of tissues considered as tumor) within the volume. Qualitative results emphasize the importance of smooth deformations in preserving tumor appearance.
zh
[CV-93] ECTIL: Label-efficient Computational Tumour Infiltrating Lymphocyte (TIL) assessment in breast cancer: Multicentre validation in 2340 patients with breast cancer
【速读】:该论文旨在解决三阴性乳腺癌(TNBC)患者中肿瘤浸润淋巴细胞(TILs)水平的评估问题,这是一个预后因素,但现有的计算TIL评估(CTA)模型依赖于大量详细的病理学家标注,耗时且劳动密集。论文提出并验证了一种基于深度学习的简化CTA模型,称为ECTIL(Efficient Computational stromal TIL assessment model),该模型能够在仅需数百个病理学家标注的情况下进行训练,且训练时间仅为十分钟。ECTIL通过从全切片图像(WSIs)中提取形态学特征,并直接回归TILs评分,显著减少了标注需求。实验结果表明,ECTIL在多个外部队列中与病理学家的评估结果高度一致(r=0.54-0.74,AUROC=0.80-0.94),并且在多变量Cox回归分析中,ECTIL评分每增加10%与患者总体生存率的改善独立相关(HR 0.86,p<0.01),与病理学家评分的结果相似(HR 0.87,p<0.001)。ECTIL的简化设计和高效标注需求使其成为潜在的临床工具,可用于患者预筛选或辅助诊断。
链接: https://arxiv.org/abs/2501.14379
作者: Yoni Schirris,Rosie Voorthuis,Mark Opdam,Marte Liefaard,Gabe S Sonke,Gwen Dackus,Vincent de Jong,Yuwei Wang,Annelot Van Rossum,Tessa G Steenbruggen,Lars C Steggink,Liesbeth G.E. de Vries,Marc van de Vijver,Roberto Salgado,Efstratios Gavves,Paul J van Diest,Sabine C Linn,Jonas Teuwen,Renee Menezes,Marleen Kok,Hugo Horlings
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. 54 pages including supplementary materials, 2 main tables, 3 main figures, 14 supplementary figures, 4 supplementary tables
点击查看摘要
Abstract:The level of tumour-infiltrating lymphocytes (TILs) is a prognostic factor for patients with (triple-negative) breast cancer (BC). Computational TIL assessment (CTA) has the potential to assist pathologists in this labour-intensive task, but current CTA models rely heavily on many detailed annotations. We propose and validate a fundamentally simpler deep learning based CTA that can be trained in only ten minutes on hundredfold fewer pathologist annotations. We collected whole slide images (WSIs) with TILs scores and clinical data of 2,340 patients with BC from six cohorts including three randomised clinical trials. Morphological features were extracted from whole slide images (WSIs) using a pathology foundation model. Our label-efficient Computational stromal TIL assessment model (ECTIL) directly regresses the TILs score from these features. ECTIL trained on only a few hundred samples (ECTIL-TCGA) showed concordance with the pathologist over five heterogeneous external cohorts (r=0.54-0.74, AUROC=0.80-0.94). Training on all slides of five cohorts (ECTIL-combined) improved results on a held-out test set (r=0.69, AUROC=0.85). Multivariable Cox regression analyses indicated that every 10% increase of ECTIL scores was associated with improved overall survival independent of clinicopathological variables (HR 0.86, p0.01), similar to the pathologist score (HR 0.87, p0.001). We demonstrate that ECTIL is highly concordant with an expert pathologist and obtains a similar hazard ratio. ECTIL has a fundamentally simpler design than existing methods and can be trained on orders of magnitude fewer annotations. Such a CTA may be used to pre-screen patients for, e.g., immunotherapy clinical trial inclusion, or as a tool to assist clinicians in the diagnostic work-up of patients with BC. Our model is available under an open source licence (this https URL).
zh
[CV-94] Automatic detection and prediction of nAMD activity change in retinal OCT using Siamese networks and Wasserstein Distance for ordinality MICCAI2024
【速读】:该论文旨在解决新生血管性年龄相关性黄斑变性(nAMD)的疾病活动检测和进展预测问题,这对于及时给药和改善患者预后至关重要。论文提出了基于深度学习的方法,利用光学相干断层扫描(OCT)视网膜体积数据来预测nAMD的严重程度变化。解决方案的关键在于两个方面:首先,采用基于Vision Transformer(ViT)的Siamese Network(孪生网络)来检测不同时间点的扫描嵌入,从而识别AMD严重程度的变化;其次,首次利用基于Earth Mover(Wasserstein)距离的损失函数来捕捉严重程度变化类别之间的序数关系,以预测3个月后的变化。这两种模型在初步排行榜上表现优异,展示了其在nAMD治疗管理中的潜在应用价值。
链接: https://arxiv.org/abs/2501.14323
作者: Taha Emre,Teresa Araújo,Marzieh Oghbaie,Dmitrii Lachinov,Guilherme Aresta,Hrvoje Bogunović
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Solution to the MICCAI 2024 MARIO Challange. First 3 authors contributed equally. Models can be found at this https URL
点击查看摘要
Abstract:Neovascular age-related macular degeneration (nAMD) is a leading cause of vision loss among older adults, where disease activity detection and progression prediction are critical for nAMD management in terms of timely drug administration and improving patient outcomes. Recent advancements in deep learning offer a promising solution for predicting changes in AMD from optical coherence tomography (OCT) retinal volumes. In this work, we proposed deep learning models for the two tasks of the public MARIO Challenge at MICCAI 2024, designed to detect and forecast changes in nAMD severity with longitudinal retinal OCT. For the first task, we employ a Vision Transformer (ViT) based Siamese Network to detect changes in AMD severity by comparing scan embeddings of a patient from different time points. To train a model to forecast the change after 3 months, we exploit, for the first time, an Earth Mover (Wasserstein) Distance-based loss to harness the ordinal relation within the severity change classes. Both models ranked high on the preliminary leaderboard, demonstrating that their predictive capabilities could facilitate nAMD treatment management.
zh
[CV-95] Snapshot multi-spectral imaging through defocusing and a Fourier imager network
【速读】:该论文旨在解决传统多光谱成像(multi-spectral imaging)系统中需要额外光谱滤波器或定制化组件的问题。传统方法通常依赖于复杂的光学元件或分光设备,增加了系统的复杂性和成本。论文提出了一种基于单色图像传感器(monochrome image sensor)的快照多光谱成像方法,无需额外的光谱滤波器或定制组件。其关键解决方案是利用波长依赖的离焦色差(chromatic aberration)作为多光谱信息的物理编码源,并通过深度学习驱动的多光谱傅里叶成像网络(multi-spectral Fourier Imager Network, mFIN)快速解码这些编码信息。实验结果表明,该方法在六种照明波段下实现了92.98%的输入通道预测准确率,并在多种测试对象上实现了鲁棒的多光谱图像重建。这一框架为生物医学、工业质量控制和农业等领域的应用提供了高质量的多光谱成像解决方案。
链接: https://arxiv.org/abs/2501.14287
作者: Xilin Yang,Michael John Fanous,Hanlong Chen,Ryan Lee,Paloma Casteleiro Costa,Yuhang Li,Luzhe Huang,Yijie Zhang,Aydogan Ozcan
机构: 未知
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
备注: 22 Pages, 7 Figures
点击查看摘要
Abstract:Multi-spectral imaging, which simultaneously captures the spatial and spectral information of a scene, is widely used across diverse fields, including remote sensing, biomedical imaging, and agricultural monitoring. Here, we introduce a snapshot multi-spectral imaging approach employing a standard monochrome image sensor with no additional spectral filters or customized components. Our system leverages the inherent chromatic aberration of wavelength-dependent defocusing as a natural source of physical encoding of multi-spectral information; this encoded image information is rapidly decoded via a deep learning-based multi-spectral Fourier Imager Network (mFIN). We experimentally tested our method with six illumination bands and demonstrated an overall accuracy of 92.98% for predicting the illumination channels at the input and achieved a robust multi-spectral image reconstruction on various test objects. This deep learning-powered framework achieves high-quality multi-spectral image reconstruction using snapshot image acquisition with a monochrome image sensor and could be useful for applications in biomedicine, industrial quality control, and agriculture, among others.
zh
[CV-96] Deep Learning-Powered Classification of Thoracic Diseases in Chest X-Rays
【速读】:该论文旨在解决胸部X光片在诊断呼吸系统疾病(如肺炎、肺结核和COVID-19)时面临的挑战,包括视觉特征重叠、图像质量差异以及严重的类别不平衡问题。这些因素阻碍了自动化分析的准确性和效率。论文提出的解决方案关键在于利用深度学习技术,特别是通过迁移学习(transfer learning)对预训练模型(如AlexNet、ResNet和InceptionNet)进行微调,并结合焦点损失(focal loss)来处理类别不平衡问题。此外,Grad-CAM(Gradient-weighted Class Activation Mapping)可视化技术被用于增强模型的可解释性,帮助识别影响预测的临床相关区域。通过这些方法,论文显著提升了模型的性能,例如InceptionV3模型在AUC和F1-Score上分别实现了28%和15%的提升,展示了深度学习在改善诊断流程和支持临床决策中的潜力。
链接: https://arxiv.org/abs/2501.14279
作者: Yiming Lei,Michael Nguyen,Tzu Chia Liu,Hyounkyun Oh
机构: Institution1; Institution2
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Chest X-rays play a pivotal role in diagnosing respiratory diseases such as pneumonia, tuberculosis, and COVID-19, which are prevalent and present unique diagnostic challenges due to overlapping visual features and variability in image quality. Severe class imbalance and the complexity of medical images hinder automated analysis. This study leverages deep learning techniques, including transfer learning on pre-trained models (AlexNet, ResNet, and InceptionNet), to enhance disease detection and classification. By fine-tuning these models and incorporating focal loss to address class imbalance, significant performance improvements were achieved. Grad-CAM visualizations further enhance model interpretability, providing insights into clinically relevant regions influencing predictions. The InceptionV3 model, for instance, achieved a 28% improvement in AUC and a 15% increase in F1-Score. These findings highlight the potential of deep learning to improve diagnostic workflows and support clinical decision-making.
zh
[CV-97] CDI: Blind Image Restoration Fidelity Evaluation based on Consistency with Degraded Image
【速读】:该论文试图解决盲图像恢复(Blind Image Restoration, BIR)方法在图像质量评估(Image Quality Assessment, IQA)中的挑战,特别是现有全参考IQA方法对高感知质量图像评分较低的问题。论文重新评估了BIR中的解非唯一性(Solution Non-Uniqueness)和退化不确定性(Degradation Indeterminacy)问题,并提出了构建一个特定的BIR IQA系统。解决方案的关键在于提出了一种基于小波域的参考引导一致性退化图像(Reference Guided Consistency with Degraded Image, CDI)算法,该算法通过计算恢复图像与退化图像的一致性来评估保真度,而无需知道退化参数。此外,论文还提出了一种无参考的CDI方法,使得在没有参考图像的情况下也能进行BIR保真度评估。为了验证CDI的合理性,论文创建了一个新的退化图像切换显示比较数据集(Degraded Images Switch Display Comparison Dataset, DISDCD),用于主观评估BIR保真度。实验结果表明,CDI在BIR保真度评估中显著优于常见的全参考IQA方法。
链接: https://arxiv.org/abs/2501.14264
作者: Xiaojun Tang,Jingru Wang,Guangwei Huang,Guannan Chen,Rui Zheng,Lian Huai,Yuyu Liu,Xingqun Jiang
机构: BOE Technology Group Co., Ltd (京东方科技集团股份有限公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advancements in Blind Image Restoration (BIR) methods, based on Generative Adversarial Networks and Diffusion Models, have significantly improved visual quality. However, they present significant challenges for Image Quality Assessment (IQA), as the existing Full-Reference IQA methods often rate images with high perceptual quality poorly. In this paper, we reassess the Solution Non-Uniqueness and Degradation Indeterminacy issues of BIR, and propose constructing a specific BIR IQA system. In stead of directly comparing a restored image with a reference image, the BIR IQA evaluates fidelity by calculating the Consistency with Degraded Image (CDI). Specifically, we propose a wavelet domain Reference Guided CDI algorithm, which can acquire the consistency with a degraded image for various types without requiring knowledge of degradation parameters. The supported degradation types include down sampling, blur, noise, JPEG and complex combined degradations etc. In addition, we propose a Reference Agnostic CDI, enabling BIR fidelity evaluation without reference images. Finally, in order to validate the rationality of CDI, we create a new Degraded Images Switch Display Comparison Dataset (DISDCD) for subjective evaluation of BIR fidelity. Experiments conducted on DISDCD verify that CDI is markedly superior to common Full Reference IQA methods for BIR fidelity evaluation. The source code and the DISDCD dataset will be publicly available shortly.
zh
[CV-98] Sparse Mixture-of-Experts for Non-Uniform Noise Reduction in MRI Images WACV
【速读】:该论文旨在解决磁共振成像(MRI)图像中的噪声问题,特别是非均匀噪声对图像质量的负面影响。传统去噪方法通常假设噪声分布均匀,难以有效处理MRI图像中常见的非均匀噪声。论文提出了一种基于稀疏专家混合框架(sparse mixture-of-experts framework)的新方法,通过多个专门针对不同图像区域噪声特性的去噪卷积神经网络(denoising convolutional neural network)进行精细调优,从而实现对MRI图像的高效去噪。该方法的优势在于其能够显著提升图像质量,同时保留解剖结构,并在合成和真实脑部MRI数据集上表现出优于现有技术的性能,且具有良好的泛化能力和适应性。
链接: https://arxiv.org/abs/2501.14198
作者: Zeyun Deng,Joseph Campbell
机构: Purdue University(普渡大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the WACV Workshop on Image Quality
点击查看摘要
Abstract:Magnetic Resonance Imaging (MRI) is an essential diagnostic tool in clinical settings, but its utility is often hindered by noise artifacts introduced during the imaging this http URL denoising is critical for enhancing image quality while preserving anatomical structures. However, traditional denoising methods, which often assume uniform noise distributions, struggle to handle the non-uniform noise commonly present in MRI images. In this paper, we introduce a novel approach leveraging a sparse mixture-of-experts framework for MRI image denoising. Each expert is a specialized denoising convolutional neural network fine-tuned to target specific noise characteristics associated with different image regions. Our method demonstrates superior performance over state-of-the-art denoising techniques on both synthetic and real-world brain MRI datasets. Furthermore, we show that it generalizes effectively to unseen datasets, highlighting its robustness and adaptability.
zh
[CV-99] Fully Guided Neural Schr"odinger bridge for Brain MR image synthesis
【速读】:该论文试图解决多模态脑部 MRI(Magnetic Resonance Imaging)成像中由于时间和成本限制难以获取所有模态的问题。传统方法主要分为配对(paired)和非配对(unpaired)两种,其中配对方法性能优越但难以获取大规模配对数据,而非配对方法虽然便于数据收集,但在保留关键图像特征(如肿瘤)方面表现不佳。为解决这些局限性,论文提出了一种基于神经薛定谔桥(Neural Schrödinger Bridges)的新框架——完全引导薛定谔桥(Fully Guided Schrödinger Bridges, FGSB)。FGSB 的关键在于利用少量配对数据实现稳定且高质量的缺失模态生成,并通过结合真实标签或特定区域的分割网络,显著减少数据需求的同时保留关键区域特征。该框架包含两个阶段:生成阶段通过融合生成图像、配对参考图像和高斯噪声,采用迭代优化避免模式崩溃并提升生成质量;训练阶段则学习从生成图像到目标模态的映射。实验表明,FGSB 仅需两个受试者的数据即可达到与大规模数据集训练方法相当的生成性能,且在结合病灶信息时显著提升了关键病灶特征的保留能力。
链接: https://arxiv.org/abs/2501.14171
作者: Hanyeol Yang,Sunggyu Kim,Yongseon Yoo,Jong-min Lee
机构: Hanyang university(汉阳大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages,4 figures
点击查看摘要
Abstract:Multi-modal brain MRI provides essential complementary information for clinical diagnosis. However, acquiring all modalities is often challenging due to time and cost constraints. To address this, various methods have been proposed to generate missing modalities from available ones. Traditional approaches can be broadly categorized into two main types: paired and unpaired methods. While paired methods offer superior performance, obtaining large-scale paired datasets is challenging in real-world scenarios. Conversely, unpaired methods facilitate large-scale data collection but struggle to preserve critical image features, such as tumors. In this paper, we propose Fully Guided Schrödinger Bridges (FGSB), a novel framework based on Neural Schrödinger Bridges, to overcome these limitations. FGSB achieves stable, high-quality generation of missing modalities using minimal paired data. Furthermore, when provided with ground truth or a segmentation network for specific regions, FGSB can generate missing modalities while preserving these critical areas with reduced data requirements. Our proposed model consists of two consecutive phases. 1) Generation Phase: Fuses a generated image, a paired reference image, and Gaussian noise, employing iterative refinement to mitigate issues such as mode collapse and improve generation quality 2) Training Phase: Learns the mapping from the generated image to the target modality. Experiments demonstrate that FGSB achieves comparable generation performance to methods trained on large datasets, while using data from only two subjects. Moreover, the utilization of lesion information with FGSB significantly enhances its ability to preserve crucial lesion features.
zh
[CV-100] Efficient 2D CT Foundation Model for Contrast Phase Classification
【速读】:该论文旨在利用2D基础模型(2D Foundation Model)开发一种对领域偏移具有鲁棒性的相位分类器,以解决医学影像中对比剂增强CT图像的相位分类问题。解决方案的关键在于使用2D基础模型从2D CT切片中生成嵌入(embeddings),并在VinDr Multiphase数据集上进行训练,随后在WAW-TACE数据集上进行外部验证。与传统的3D监督模型相比,该2D模型不仅训练速度更快,且在性能上表现相当或更好,同时对领域偏移表现出更强的鲁棒性。研究结果表明,该模型在非增强、动脉期和静脉期分类任务中表现出色,尤其在非增强和动脉期分类中AUROC和F1分数均较高,尽管静脉期分类由于标签不匹配问题表现稍逊。该模型的鲁棒性使其在自动化悬挂协议(hanging protocols)和AI算法的临床部署数据编排中具有潜在应用价值。
链接: https://arxiv.org/abs/2501.14066
作者: Benjamin Hou,Tejas Sudharshan Mathai,Pritam Mukherjee,Xinya Wang,Ronald M. Summers,Zhiyong Lub
机构: Division of Intramural Research, National Library of Medicine (国家医学图书馆院内研究部); National Center for Biotechnology Information, National Library of Medicine (国家医学图书馆国家生物技术信息中心); Imaging Biomarkers and Computer Aided Diagnosis Lab, Clinical Center (临床中心影像生物标志物与计算机辅助诊断实验室); National Institutes of Health, Bethesda, MD, USA (美国国立卫生研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Purpose: The purpose of this study is to harness the efficiency of a 2D foundation model to develop a robust phase classifier that is resilient to domain shifts. Materials and Methods: This retrospective study utilized three public datasets from separate institutions. A 2D foundation model was trained on the DeepLesion dataset (mean age: 51.2, s.d.: 17.6; 2398 males) to generate embeddings from 2D CT slices for downstream contrast phase classification. The classifier was trained on the VinDr Multiphase dataset and externally validated on the WAW-TACE dataset. The 2D model was also compared to three 3D supervised models. Results: On the VinDr dataset (146 male, 63 female, 56 unidentified), the model achieved near-perfect AUROC scores and F1 scores of 99.2%, 94.2%, and 93.1% for non-contrast, arterial, and venous phases, respectively. The `Other’ category scored lower (F1: 73.4%) due to combining multiple contrast phases into one class. On the WAW-TACE dataset (mean age: 66.1, s.d.: 10.0; 185 males), the model showed strong performance with AUROCs of 91.0% and 85.6%, and F1 scores of 87.3% and 74.1% for non-contrast and arterial phases. Venous phase performance was lower, with AUROC and F1 scores of 81.7% and 70.2% respectively, due to label mismatches. Compared to 3D supervised models, the approach trained faster, performed as well or better, and showed greater robustness to domain shifts. Conclusion: The robustness of the 2D Foundation model may be potentially useful for automation of hanging protocols and data orchestration for clinical deployment of AI algorithms. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.14066 [eess.IV] (or arXiv:2501.14066v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2501.14066 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Benjamin Hou [view email] [v1] Thu, 23 Jan 2025 20:01:33 UTC (7,379 KB)
zh
[CV-101] Leverag ing Multiphase CT for Quality Enhancement of Portal Venous CT: Utility for Pancreas Segmentation
【速读】:该论文试图解决多相CT(Multiphase CT)扫描中由于低辐射剂量、不同扫描仪以及运动和金属伪影等因素导致的图像质量问题。具体而言,研究旨在通过利用多个低质量CT相(如非增强CT、动脉期和门静脉期)来提升某一相(如门静脉期)的图像质量,从而改善下游任务(如胰腺分割)的效果。解决方案的关键在于开发了一种三维渐进融合和非局部(3D Progressive Fusion and Non-Local, PFNL)网络,该网络通过融合多个低质量CT相来增强门静脉期的图像质量。实验结果表明,该方法在胰腺分割任务中比低质量CT扫描提升了3%的准确性。这是首次利用多相CT进行扫描质量增强并改善胰腺分割的研究。
链接: https://arxiv.org/abs/2501.14013
作者: Xinya Wang,Tejas Sudharshan Mathai,Boah Kim,Ronald M. Summers
机构: Radiology and Imaging Sciences, National Institutes of Health Clinical Center (放射学和影像科学, 美国国立卫生研究院临床中心)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ISBI 2025
点击查看摘要
Abstract:Multiphase CT studies are routinely obtained in clinical practice for diagnosis and management of various diseases, such as cancer. However, the CT studies can be acquired with low radiation doses, different scanners, and are frequently affected by motion and metal artifacts. Prior approaches have targeted the quality improvement of one specific CT phase (e.g., non-contrast CT). In this work, we hypothesized that leveraging multiple CT phases for the quality enhancement of one phase may prove advantageous for downstream tasks, such as segmentation. A 3D progressive fusion and non-local (PFNL) network was developed. It was trained with three degraded (low-quality) phases (non-contrast, arterial, and portal venous) to enhance the quality of the portal venous phase. Then, the effect of scan quality enhancement was evaluated using a proxy task of pancreas segmentation, which is useful for tracking pancreatic cancer. The proposed approach improved the pancreas segmentation by 3% over the corresponding low-quality CT scan. To the best of our knowledge, we are the first to harness multiphase CT for scan quality enhancement and improved pancreas segmentation.
zh
[CV-102] Synthetic CT image generation from CBCT: A Systematic Review
【速读】:该论文旨在解决如何利用深度学习技术从锥形束CT(CBCT)数据生成合成CT(sCT)图像,以提升放射肿瘤学中的治疗规划和患者治疗效果。关键解决方案在于采用多种深度学习架构,包括卷积神经网络(CNNs)、生成对抗网络(GANs)、Transformer模型和扩散模型(diffusion models),这些方法能够有效地生成与金标准计划CT(pCT)相媲美的sCT图像。通过评估指标如平均绝对误差(MAE)、均方根误差(RMSE)、峰值信噪比(PSNR)和结构相似性指数(SSIM),研究证实了sCT图像在放射治疗中的潜在应用价值。此外,论文还探讨了临床应用中面临的挑战,如视野(FOV)差异和临床工作流程的整合,并提出了未来研究和标准化的建议。总体而言,该研究强调了sCT在个性化治疗规划和自适应放射治疗中的重要作用,有望改善肿瘤治疗的效果和患者护理。
链接: https://arxiv.org/abs/2501.13972
作者: Alzahra Altalib,Scott McGregor,Chunhui Li,Alessandro Perelli
机构: School of Science and Engineering, Center of Medical Engineering and Technology, University of Dundee (邓迪大学科学与工程学院, 医学工程与技术中心); Library & Learning Centre, University of Dundee (邓迪大学图书馆与学习中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 14 Figures, Accepted in the IEEE Transactions on Radiation and Plasma Medical Sciences
点击查看摘要
Abstract:The generation of synthetic CT (sCT) images from cone-beam CT (CBCT) data using deep learning methodologies represents a significant advancement in radiation oncology. This systematic review, following PRISMA guidelines and using the PICO model, comprehensively evaluates the literature from 2014 to 2024 on the generation of sCT images for radiation therapy planning in oncology. A total of 35 relevant studies were identified and analyzed, revealing the prevalence of deep learning approaches in the generation of sCT. This review comprehensively covers synthetic CT generation based on CBCT and proton-based studies. Some of the commonly employed architectures explored are convolutional neural networks (CNNs), generative adversarial networks (GANs), transformers, and diffusion models. Evaluation metrics including mean absolute error (MAE), root mean square error (RMSE), peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) consistently demonstrate the comparability of sCT images with gold-standard planning CTs (pCT), indicating their potential to improve treatment precision and patient outcomes. Challenges such as field-of-view (FOV) disparities and integration into clinical workflows are discussed, along with recommendations for future research and standardization efforts. In general, the findings underscore the promising role of sCT-based approaches in personalized treatment planning and adaptive radiation therapy, with potential implications for improved oncology treatment delivery and patient care.
zh
[CV-103] Patch-Based and Non-Patch-Based inputs Comparison into Deep Neural Models: Application for the Segmentation of Retinal Diseases on Optical Coherence Tomography Volumes
【速读】:该论文旨在解决视网膜疾病(如年龄相关性黄斑变性,AMD)在医学图像中的自动分割问题,特别是通过深度学习模型提高分割精度。论文的核心问题在于如何通过改进输入数据的方式来提升模型性能。研究表明,传统的2D图像处理方法在定位液体体积疾病时存在精度不足的问题。为此,论文提出了一种基于重叠图像块(Patch-Based)的输入方法,相较于直接输入完整图像(NonPatch-Based),该方法显著提升了模型的分割性能。具体而言,使用重叠图像块的深度学习模型在Dice相似系数(DSC)指标上达到了0.88,远高于非图像块方法的0.71。这一结果表明,通过优化输入数据的方式,深度学习模型能够超越人类在医学图像分割中的表现,为视网膜疾病的自动分析提供了更高效的工具。
链接: https://arxiv.org/abs/2501.13970
作者: Khaled Al-Saih,Fares Al-Shargie,Mohammed Isam Al-hiyali,Reham Alhejaili
机构: Laboratoire LIMOS, CNRS UMR 6158, Université Clermont-Auvergne (法国克莱蒙-奥弗涅大学); The School of Health Professions, Department of Rehabilitation and Movement Sciences, Rutgers University (罗格斯大学); Medical Instruments Technology Engineering Department, AL Mansour University College (曼苏尔大学学院); Department of Computer Science and Artificial Intelligence, College of Computer Science and Engineering, University of Jeddah (吉达大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 1 figures, 2 tables, submitted to 15th IEEE Symposium on Computer Applications Industrial Electronics
点击查看摘要
Abstract:Worldwide, sight loss is commonly occurred by retinal diseases, with age-related macular degeneration (AMD) being a notable facet that affects elderly patients. Approaching 170 million persons wide-ranging have been spotted with AMD, a figure anticipated to rise to 288 million by 2040. For visualizing retinal layers, optical coherence tomography (OCT) dispenses the most compelling non-invasive method. Frequent patient visits have increased the demand for automated analysis of retinal diseases, and deep learning networks have shown promising results in both image and pixel-level 2D scan classification. However, when relying solely on 2D data, accuracy may be impaired, especially when localizing fluid volume diseases. The goal of automatic techniques is to outperform humans in manually recognizing illnesses in medical data. In order to further understand the benefit of deep learning models, we studied the effects of the input size. The dice similarity coefficient (DSC) metric showed a human performance score of 0.71 for segmenting various retinal diseases. Yet, the deep models surpassed human performance to establish a new era of advancement of segmenting the diseases on medical images. However, to further improve the performance of the models, overlapping patches enhanced the performance of the deep models compared to feeding the full image. The highest score for a patch-based model in the DSC metric was 0.88 in comparison to the score of 0.71 for the same model in non-patch-based for SRF fluid segmentation. The objective of this article is to show a fair comparison between deep learning models in relation to the input (Patch-Based vs. NonPatch-Based).
zh
[CV-104] LiCAR: pseudo-RGB LiDAR image for CAR segmentation
【速读】:该论文旨在解决基于LiDAR传感器数据的汽车实例分割问题。随着计算资源的进步,越来越多的神经网络被用于图像检测和分割,但这些方法通常以RGB二维图像作为输入。LiDAR传感器通过多层扫描生成类似于低分辨率RGB相机获取的图像。为此,论文提出了一种新的数据集,用于在伪RGB图像中进行汽车分割。该数据集将LiDAR传感器提供的信息(如反射率、近红外和信号强度)组合成球面范围图像(Spherical Range Image, SRI),并将其输入到实例分割神经网络中。通过使用YOLO-v8 large模型,该方案在边界框(Bounding Box, BB)和掩码精度上分别达到了88%和81.5%的精度。此外,论文还应用了跟踪器来在视频流中跟踪每个分割出的汽车实例,并在实际实验中表现出色。解决方案的关键在于将LiDAR数据转换为伪RGB图像,并利用高效的神经网络模型进行实例分割和跟踪。
链接: https://arxiv.org/abs/2501.13960
作者: Ignacio de Loyola Páez-Ubieta,Edison P. Velasco-Sánchez,Santiago T. Puente
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This is a preprint version of the work accepted at 5th International Conference on Robotics, Computer Vision and Intelligent Systems (ROBOVIS 2025)
点击查看摘要
Abstract:With the advancement of computing resources, an increasing number of Neural Networks (NNs) are appearing for image detection and segmentation appear. However, these methods usually accept as input a RGB 2D image. On the other side, Light Detection And Ranging (LiDAR) sensors with many layers provide images that are similar to those obtained from a traditional low resolution RGB camera. Following this principle, a new dataset for segmenting cars in pseudo-RGB images has been generated. This dataset combines the information given by the LiDAR sensor into a Spherical Range Image (SRI), concretely the reflectivity, near infrared and signal intensity 2D images. These images are then fed into instance segmentation NNs. These NNs segment the cars that appear in these images, having as result a Bounding Box (BB) and mask precision of 88% and 81.5% respectively with You Only Look Once (YOLO)-v8 large. By using this segmentation NN, some trackers have been applied so as to follow each car segmented instance along a video feed, having great performance in real world experiments.
zh
人工智能
[AI-0] An Attentive Graph Agent for Topology-Adaptive Cyber Defence
链接: https://arxiv.org/abs/2501.14700
作者: Ilya Orson Sandoval,Isaac Symes Thompson,Vasilios Mavroudis,Chris Hicks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
*备注:
点击查看摘要
Abstract:As cyber threats grow increasingly sophisticated, reinforcement learning is emerging as a promising technique to create intelligent, self-improving defensive systems. However, most existing autonomous defensive agents have overlooked the inherent graph structure of computer networks subject to cyber attacks, potentially missing critical information. To address this gap, we developed a custom version of the Cyber Operations Research Gym (CybORG) environment that encodes the observable network state as a directed graph, utilizing realistic and interpretable low-level features. %, like number of open ports and unexpected detected connections. We leverage a Graph Attention Network (GAT) architecture to process node, edge, and global features, and modify its output to be compatible with policy gradient methods in reinforcement learning. GAT policies offer several advantages over standard approaches based on simplistic flattened state observations. They can handle the changes in network topology that occur at runtime when dynamic connections between hosts appear. Policies can be deployed to networks that differ in size to the ones seen during training, enabling a degree of generalisation inaccessible with alternative approaches. Furthermore, the graph neural network policies outputs are explainable in terms of tangible network properties, providing enhanced interpretability of defensive actions. We verify that our low-level graph observations are meaningful enough to train GAT defensive policies that are able to adapt to changing topologies. We evaluate how our trained policies perform when deployed on networks of varying sizes with the same subnetwork structure, comparing them against policies specifically trained for each network configuration. Our study contributes to the development of robust cyber defence systems that can better adapt to real-world network security challenges.
[AI-1] owards Automated Self-Supervised Learning for Truly Unsupervised Graph Anomaly Detection
链接: https://arxiv.org/abs/2501.14694
作者: Zhong Li,Yuhang Wang,Matthijs van Leeuwen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Manuscript submitted to Data Mining and Knowledge Discovery in May 2024 for possible publication. This is the revised version submitted in January 2025
点击查看摘要
Abstract:Self-supervised learning (SSL) is an emerging paradigm that exploits supervisory signals generated from the data itself, and many recent studies have leveraged SSL to conduct graph anomaly detection. However, we empirically found that three important factors can substantially impact detection performance across datasets: 1) the specific SSL strategy employed; 2) the tuning of the strategy’s hyperparameters; and 3) the allocation of combination weights when using multiple strategies. Most SSL-based graph anomaly detection methods circumvent these issues by arbitrarily or selectively (i.e., guided by label information) choosing SSL strategies, hyperparameter settings, and combination weights. While an arbitrary choice may lead to subpar performance, using label information in an unsupervised setting is label information leakage and leads to severe overestimation of a method’s performance. Leakage has been criticized as “one of the top ten data mining mistakes”, yet many recent studies on SSL-based graph anomaly detection have been using label information to select hyperparameters. To mitigate this issue, we propose to use an internal evaluation strategy (with theoretical analysis) to select hyperparameters in SSL for unsupervised anomaly detection. We perform extensive experiments using 10 recent SSL-based graph anomaly detection algorithms on various benchmark datasets, demonstrating both the prior issues with hyperparameter selection and the effectiveness of our proposed strategy.
[AI-2] Decoding Generalization from Memorization in Deep Neural Networks
链接: https://arxiv.org/abs/2501.14687
作者: Simran Ketha,Venkatakrishnan Ramaswamy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Overparameterized Deep Neural Networks that generalize well have been key to the dramatic success of Deep Learning in recent years. The reasons for their remarkable ability to generalize are not well understood yet. It has also been known that deep networks possess the ability to memorize training data, as evidenced by perfect or high training accuracies on models trained with corrupted data that have class labels shuffled to varying degrees. Concomitantly, such models are known to generalize poorly, i.e. they suffer from poor test accuracies, due to which it is thought that the act of memorizing substantially degrades the ability to generalize. It has, however, been unclear why the poor generalization that accompanies such memorization, comes about. One possibility is that in the process of training with corrupted data, the layers of the network irretrievably reorganize their representations in a manner that makes generalization difficult. The other possibility is that the network retains significant ability to generalize, but the trained network somehow chooses to readout in a manner that is detrimental to generalization. Here, we provide evidence for the latter possibility by demonstrating, empirically, that such models possess information in their representations for substantially improved generalization, even in the face of memorization. Furthermore, such generalization abilities can be easily decoded from the internals of the trained model, and we build a technique to do so from the outputs of specific layers of the network. We demonstrate results on multiple models trained with a number of standard datasets.
[AI-3] A Predictive Approach for Enhancing Accuracy in Remote Robotic Surgery Using Informer Model
链接: https://arxiv.org/abs/2501.14678
作者: Muhammad Hanif Lashari,Shakil Ahmed,Wafa Batayneh,Ashfaq Khokhar
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Precise and real-time estimation of the robotic arm’s position on the patient’s side is essential for the success of remote robotic surgery in Tactile Internet (TI) environments. This paper presents a prediction model based on the Transformer-based Informer framework for accurate and efficient position estimation. Additionally, it combines a Four-State Hidden Markov Model (4-State HMM) to simulate realistic packet loss scenarios. The proposed approach addresses challenges such as network delays, jitter, and packet loss to ensure reliable and precise operation in remote surgical applications. The method integrates the optimization problem into the Informer model by embedding constraints such as energy efficiency, smoothness, and robustness into its training process using a differentiable optimization layer. The Informer framework uses features such as ProbSparse attention, attention distilling, and a generative-style decoder to focus on position-critical features while maintaining a low computational complexity of O(L log L). The method is evaluated using the JIGSAWS dataset, achieving a prediction accuracy of over 90 percent under various network scenarios. A comparison with models such as TCN, RNN, and LSTM demonstrates the Informer framework’s superior performance in handling position prediction and meeting real-time requirements, making it suitable for Tactile Internet-enabled robotic surgery.
[AI-4] Neural-Symbolic Message Passing with Dynamic Pruning
链接: https://arxiv.org/abs/2501.14661
作者: Chongzhi Zhang,Junhao Zheng,Zhiping Peng,Qianli Ma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 5 figures, 16 tables
点击查看摘要
Abstract:Complex Query Answering (CQA) over incomplete Knowledge Graphs (KGs) is a challenging task. Recently, a line of message-passing-based research has been proposed to solve CQA. However, they perform unsatisfactorily on negative queries and fail to address the noisy messages between variable nodes in the query graph. Moreover, they offer little interpretability and require complex query data and resource-intensive training. In this paper, we propose a Neural-Symbolic Message Passing (NSMP) framework based on pre-trained neural link predictors. By introducing symbolic reasoning and fuzzy logic, NSMP can generalize to arbitrary existential first order logic queries without requiring training while providing interpretable answers. Furthermore, we introduce a dynamic pruning strategy to filter out noisy messages between variable nodes. Experimental results show that NSMP achieves a strong performance. Additionally, through complexity analysis and empirical verification, we demonstrate the superiority of NSMP in inference time over the current state-of-the-art neural-symbolic method. Compared to this approach, NSMP demonstrates faster inference times across all query types on benchmark datasets, with speedup ranging from 2 \times to over 150 \times .
[AI-5] MedAgent Bench: Dataset for Benchmarking LLM s as Agents in Medical Applications
链接: https://arxiv.org/abs/2501.14654
作者: Yixing Jiang,Kameron C. Black,Gloria Geng,Danny Park,Andrew Y. Ng,Jonathan H. Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:Recent large language models (LLMs) have demonstrated significant advancements, particularly in their ability to serve as agents thereby surpassing their traditional role as chatbots. These agents can leverage their planning and tool utilization capabilities to address tasks specified at a high level. However, a standardized dataset to benchmark the agent capabilities of LLMs in medical applications is currently lacking, making the evaluation of LLMs on complex tasks in interactive healthcare environments challenging. To address this gap, we introduce MedAgentBench, a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts. MedAgentBench encompasses 100 patient-specific clinically-derived tasks from 10 categories written by human physicians, realistic profiles of 100 patients with over 700,000 data elements, a FHIR-compliant interactive environment, and an accompanying codebase. The environment uses the standard APIs and communication infrastructure used in modern EMR systems, so it can be easily migrated into live EMR systems. MedAgentBench presents an unsaturated agent-oriented benchmark that current state-of-the-art LLMs exhibit some ability to succeed at. The best model (GPT-4o) achieves a success rate of 72%. However, there is still substantial space for improvement to give the community a next direction to optimize. Furthermore, there is significant variation in performance across task categories. MedAgentBench establishes this and is publicly available at this https URL , offering a valuable framework for model developers to track progress and drive continuous improvements in the agent capabilities of large language models within the medical domain.
[AI-6] Federated Domain Generalization with Data-free On-server Gradient Matching ICLR
链接: https://arxiv.org/abs/2501.14653
作者: Trong-Binh Nguyen,Minh-Duong Nguyen,Jinsun Park,Quoc-Viet Pham,Won Joo Hwang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注: 26 pages, 15 figures, ICLR
点击查看摘要
Abstract:Domain Generalization (DG) aims to learn from multiple known source domains a model that can generalize well to unknown target domains. One of the key approaches in DG is training an encoder which generates domain-invariant representations. However, this approach is not applicable in Federated Domain Generalization (FDG), where data from various domains are distributed across different clients. In this paper, we introduce a novel approach, dubbed Federated Learning via On-server Matching Gradient (FedOMG), which can \emphefficiently leverage domain information from distributed domains. Specifically, we utilize the local gradients as information about the distributed models to find an invariant gradient direction across all domains through gradient inner product maximization. The advantages are two-fold: 1) FedOMG can aggregate the characteristics of distributed models on the centralized server without incurring any additional communication cost, and 2) FedOMG is orthogonal to many existing FL/FDG methods, allowing for additional performance improvements by being seamlessly integrated with them. Extensive experimental evaluations on various settings to demonstrate the robustness of FedOMG compared to other FL/FDG baselines. Our method outperforms recent SOTA baselines on four FL benchmark datasets (MNIST, EMNIST, CIFAR-10, and CIFAR-100), and three FDG benchmark datasets (PACS, VLCS, and OfficeHome).
[AI-7] Whisper D-SGD: Correlated Noise Across Agents for Differentially Private Decentralized Learning
链接: https://arxiv.org/abs/2501.14644
作者: Angelo Rodio,Zheng Chen,Erik G. Larsson
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 6 pages, 3 figures, preprint
点击查看摘要
Abstract:Decentralized learning enables distributed agents to train a shared machine learning model through local computation and peer-to-peer communication. Although each agent retains its dataset locally, the communication of local models can still expose private information to adversaries. To mitigate these threats, local differential privacy (LDP) injects independent noise per agent, but it suffers a larger utility gap than central differential privacy (CDP). We introduce Whisper D-SGD, a novel covariance-based approach that generates correlated privacy noise across agents, unifying several state-of-the-art methods as special cases. By leveraging network topology and mixing weights, Whisper D-SGD optimizes the noise covariance to achieve network-wide noise cancellation. Experimental results show that Whisper D-SGD cancels more noise than existing pairwise-correlation schemes, substantially narrowing the CDP-LDP gap and improving model performance under the same privacy guarantees.
[AI-8] Recommending Actionable Strategies: A Semantic Approach to Integrating Analytical Frameworks with Decision Heuristics
链接: https://arxiv.org/abs/2501.14634
作者: Renato Ghisellini,Remo Pareschi,Marco Pedroni,Giovanni Battista Raggi
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We present a novel approach for recommending actionable strategies by integrating strategic frameworks with decision heuristics through semantic analysis. While strategy frameworks provide systematic models for assessment and planning, and decision heuristics encode experiential knowledge,these traditions have historically remained separate. Our methodology bridges this gap using advanced natural language processing (NLP), demonstrated through integrating frameworks like the 6C model with the Thirty-Six Stratagems. The approach employs vector space representations and semantic similarity calculations to map framework parameters to heuristic patterns, supported by a computational architecture that combines deep semantic processing with constrained use of Large Language Models. By processing both primary content and secondary elements (diagrams, matrices) as complementary linguistic representations, we demonstrate effectiveness through corporate strategy case studies. The methodology generalizes to various analytical frameworks and heuristic sets, culminating in a plug-and-play architecture for generating recommender systems that enable cohesive integration of strategic frameworks and decision heuristics into actionable guidance.
[AI-9] Extracting Problem Structure with LLM s for Optimized SAT Local Search
链接: https://arxiv.org/abs/2501.14630
作者: André Schilder,Stefan Szeider
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Local search preprocessing makes Conflict-Driven Clause Learning (CDCL) solvers faster by providing high-quality starting points and modern SAT solvers have incorporated this technique into their preprocessing steps. However, these tools rely on basic strategies that miss the structural patterns in problems. We present a method that applies Large Language Models (LLMs) to analyze Python-based encoding code. This reveals hidden structural patterns in how problems convert into SAT. Our method automatically generates specialized local search algorithms that find these patterns and use them to create strong initial assignments. This works for any problem instance from the same encoding type. Our tests show encouraging results, achieving faster solving times compared to baseline preprocessing systems.
[AI-10] ACT-JEPA: Joint-Embedding Predictive Architecture Improves Policy Representation Learning
链接: https://arxiv.org/abs/2501.14622
作者: Aleksandar Vujinovic,Aleksandar Kovacevic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Learning efficient representations for decision-making policies is a challenge in imitation learning (IL). Current IL methods require expert demonstrations, which are expensive to collect. Consequently, they often have underdeveloped world models. Self-supervised learning (SSL) offers an alternative by allowing models to learn from diverse, unlabeled data, including failures. However, SSL methods often operate in raw input space, making them inefficient. In this work, we propose ACT-JEPA, a novel architecture that integrates IL and SSL to enhance policy representations. We train a policy to predict (1) action sequences and (2) abstract observation sequences. The first objective uses action chunking to improve action prediction and reduce compounding errors. The second objective extends this idea of chunking by predicting abstract observation sequences. We utilize Joint-Embedding Predictive Architecture to predict in abstract representation space, allowing the model to filter out irrelevant details, improve efficiency, and develop a robust world model. Our experiments show that ACT-JEPA improves the quality of representations by learning temporal environment dynamics. Additionally, the model’s ability to predict abstract observation sequences results in representations that effectively generalize to action sequence prediction. ACT-JEPA performs on par with established baselines across a range of decision-making tasks.
[AI-11] Leverag ing Spatial Cues from Cochlear Implant Microphones to Efficiently Enhance Speech Separation in Real-World Listening Scenes
链接: https://arxiv.org/abs/2501.14610
作者: Feyisayo Olalere,Kiki van der Heijden,Christiaan H. Stronks,Jeroen Briaire,Johan HM Frijns,Marcel van Gerven
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 10 pages, 5 figures
点击查看摘要
Abstract:Speech separation approaches for single-channel, dry speech mixtures have significantly improved. However, real-world spatial and reverberant acoustic environments remain challenging, limiting the effectiveness of these approaches for assistive hearing devices like cochlear implants (CIs). To address this, we quantify the impact of real-world acoustic scenes on speech separation and explore how spatial cues can enhance separation quality efficiently. We analyze performance based on implicit spatial cues (inherent in the acoustic input and learned by the model) and explicit spatial cues (manually calculated spatial features added as auxiliary inputs). Our findings show that spatial cues (both implicit and explicit) improve separation for mixtures with spatially separated and nearby talkers. Furthermore, spatial cues enhance separation when spectral cues are ambiguous, such as when voices are similar. Explicit spatial cues are particularly beneficial when implicit spatial cues are weak. For instance, single CI microphone recordings provide weaker implicit spatial cues than bilateral CIs, but even single CIs benefit from explicit cues. These results emphasize the importance of training models on real-world data to improve generalizability in everyday listening scenarios. Additionally, our statistical analyses offer insights into how data properties influence model performance, supporting the development of efficient speech separation approaches for CIs and other assistive devices in real-world settings.
[AI-12] Age and Power Minimization via Meta-Deep Reinforcement Learning in UAV Networks
链接: https://arxiv.org/abs/2501.14603
作者: Sankani Sarathchandra,Eslam Eldeeb,Mohammad Shehab,Hirley Alves,Konstantin Mikhaylov,Mohamed-Slim Alouini
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 8 figures
点击查看摘要
Abstract:Age-of-information (AoI) and transmission power are crucial performance metrics in low energy wireless networks, where information freshness is of paramount importance. This study examines a power-limited internet of things (IoT) network supported by a flying unmanned aerial vehicle(UAV) that collects data. Our aim is to optimize the UAV flight trajectory and scheduling policy to minimize a varying AoI and transmission power combination. To tackle this variation, this paper proposes a meta-deep reinforcement learning (RL) approach that integrates deep Q-networks (DQNs) with model-agnostic meta-learning (MAML). DQNs determine optimal UAV decisions, while MAML enables scalability across varying objective functions. Numerical results indicate that the proposed algorithm converges faster and adapts to new objectives more effectively than traditional deep RL methods, achieving minimal AoI and transmission power overall.
[AI-13] ZETA: Leverag ing Z-order Curves for Efficient Top-k Attention ICLR
链接: https://arxiv.org/abs/2501.14577
作者: Qiuhao Zeng,Jerry Huang,Peng Lu,Gezheng Xu,Boxing Chen,Charles Ling,Boyu Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, 4 figures, accepted in International Conference on Learning Representations (ICLR) 2025
点击查看摘要
Abstract:Over recent years, the Transformer has become a fundamental building block for sequence modeling architectures. Yet at its core is the use of self-attention, whose memory and computational cost grow quadratically with the sequence length N , rendering it prohibitively expensive for long sequences. A promising approach is top- k attention, which selects only the k most relevant tokens and achieves performance comparable to vanilla self-attention while significantly reducing space and computational demands. However, causal masks require the current query token to only attend to past tokens, preventing the existing top- k attention method from efficiently searching for the most relevant tokens in parallel, thereby limiting training efficiency. In this work, we propose ZETA, leveraging \textbfZ-Order Curves for \textbfEfficient \textbfTop- k \textbfAttention, to enable parallel querying of past tokens for entire sequences. % in both space and time complexity of \mathcalO(N \log N) . We first theoretically show that the choice of key and query dimensions involves a trade-off between the curse of dimensionality and the preservation of relative distances after projection. In light of this insight, we propose reducing the dimensionality of keys and queries in contrast to values and further leverage Z -order curves to map low-dimensional keys and queries into \emphone-dimensional space, which permits parallel sorting, thereby largely improving the efficiency for top- k token selection. Experimental results demonstrate that ZETA matches the performance of standard attention on the synthetic \textscMulti-Query Associative Recall task and outperforms attention and its variants on \textscLong Range Arena and \textscWikiText-103 language modeling.
[AI-14] Hybrid Quantum-Classical Multi-Agent Pathfinding
链接: https://arxiv.org/abs/2501.14568
作者: Thore Gerlach,Loong Kuan Lee,Frédéric Barbaresco,Nico Piatkowski
类目: Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注:
点击查看摘要
Abstract:Multi-Agent Path Finding (MAPF) focuses on determining conflict-free paths for multiple agents navigating through a shared space to reach specified goal locations. This problem becomes computationally challenging, particularly when handling large numbers of agents, as frequently encountered in practical applications like coordinating autonomous vehicles. Quantum computing (QC) is a promising candidate in overcoming such limits. However, current quantum hardware is still in its infancy and thus limited in terms of computing power and error robustness. In this work, we present the first optimal hybrid quantum-classical MAPF algorithm which is based on branch-and-cut-and-prize. QC is integrated by iteratively solving QUBO problems, based on conflict graphs. Experiments on actual quantum hardware and results on benchmark data suggest that our approach dominates previous QUBO formulations and baseline MAPF solvers.
[AI-15] Distributed Conformal Prediction via Message Passing
链接: https://arxiv.org/abs/2501.14544
作者: Haifeng Wen,Hong Xing,Osvaldo Simeone
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 16 pages, 11 figures, submitted for posssible publication
点击查看摘要
Abstract:Post-hoc calibration of pre-trained models is critical for ensuring reliable inference, especially in safety-critical domains such as healthcare. Conformal Prediction (CP) offers a robust post-hoc calibration framework, providing distribution-free statistical coverage guarantees for prediction sets by leveraging held-out datasets. In this work, we address a decentralized setting where each device has limited calibration data and can communicate only with its neighbors over an arbitrary graph topology. We propose two message-passing-based approaches for achieving reliable inference via CP: quantile-based distributed conformal prediction (Q-DCP) and histogram-based distributed conformal prediction (H-DCP). Q-DCP employs distributed quantile regression enhanced with tailored smoothing and regularization terms to accelerate convergence, while H-DCP uses a consensus-based histogram estimation approach. Through extensive experiments, we investigate the trade-offs between hyperparameter tuning requirements, communication overhead, coverage guarantees, and prediction set sizes across different network topologies.
[AI-16] VERUS-LM: a Versatile Framework for Combining LLM s with Symbolic Reasoning
链接: https://arxiv.org/abs/2501.14540
作者: Benjamin Callewaert,Simon Vandevelde,Joost Vennekens
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:A recent approach to neurosymbolic reasoning is to explicitly combine the strengths of large language models (LLMs) and symbolic solvers to tackle complex reasoning tasks. However, current approaches face significant limitations, including poor generalizability due to task-specific prompts, inefficiencies caused by the lack of separation between knowledge and queries, and restricted inferential capabilities. These shortcomings hinder their scalability and applicability across diverse domains. In this paper, we introduce VERUS-LM, a novel framework designed to address these challenges. VERUS-LM employs a generic prompting mechanism, clearly separates domain knowledge from queries, and supports a wide range of different logical reasoning tasks. This framework enhances adaptability, reduces computational cost, and allows for richer forms of reasoning, such as optimization and constraint satisfaction. We show that our approach succeeds in diverse reasoning on a novel dataset, markedly outperforming LLMs. Additionally, our system achieves competitive results on common reasoning benchmarks when compared to other state-of-the-art approaches, and significantly surpasses them on the difficult AR-LSAT dataset. By pushing the boundaries of hybrid reasoning, VERUS-LM represents a significant step towards more versatile neurosymbolic AI systems
[AI-17] ABPT: Amended Backpropagation through Time with Partially Differentiable Rewards
链接: https://arxiv.org/abs/2501.14513
作者: Fanxing Li,Fangyu Sun,Tianbao Zhang,Danping Zou
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Using the exact gradients of the rewards to directly optimize policy parameters via backpropagation-through-time (BPTT) enables high training performance for quadrotor tasks. However, designing a fully differentiable reward architecture is often challenging. Partially differentiable rewards will result in biased gradient propagation that degrades training performance. To overcome this limitation, we propose Amended Backpropagation-through-Time (ABPT), a novel approach that mitigates gradient bias while preserving the training efficiency of BPTT. ABPT combines 0-step and N-step returns, effectively reducing the bias by leveraging value gradients from the learned Q-value function. Additionally, it adopts entropy regularization and state initialization mechanisms to encourage exploration during training. We evaluate ABPT on four representative quadrotor flight tasks. Experimental results demonstrate that ABPT converges significantly faster and achieves higher ultimate rewards than existing learning algorithms, particularly in tasks involving partially differentiable rewards.
[AI-18] he Pseudo-Dimension of Contracts
链接: https://arxiv.org/abs/2501.14474
作者: Paul Duetting,Michal Feldman,Tomasz Ponitka,Ermis Soumalias
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:
点击查看摘要
Abstract:Algorithmic contract design studies scenarios where a principal incentivizes an agent to exert effort on her behalf. In this work, we focus on settings where the agent’s type is drawn from an unknown distribution, and formalize an offline learning framework for learning near-optimal contracts from sample agent types. A central tool in our analysis is the notion of pseudo-dimension from statistical learning theory. Beyond its role in establishing upper bounds on the sample complexity, pseudo-dimension measures the intrinsic complexity of a class of contracts, offering a new perspective on the tradeoffs between simplicity and optimality in contract design. Our main results provide essentially optimal tradeoffs between pseudo-dimension and representation error (defined as the loss in principal’s utility) with respect to linear and bounded contracts. Using these tradeoffs, we derive sample- and time-efficient learning algorithms, and demonstrate their near-optimality by providing almost matching lower bounds on the sample complexity. Conversely, for unbounded contracts, we prove an impossibility result showing that no learning algorithm exists. Finally, we extend our techniques in three important ways. First, we provide refined pseudo-dimension and sample complexity guarantees for the combinatorial actions model, revealing a novel connection between the number of critical values and sample complexity. Second, we extend our results to menus of contracts, showing that their pseudo-dimension scales linearly with the menu size. Third, we adapt our algorithms to the online learning setting, where we show that, a polynomial number of type samples suffice to learn near-optimal bounded contracts. Combined with prior work, this establishes a formal separation between expert advice and bandit feedback for this setting. Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Theoretical Economics (econ.TH) Cite as: arXiv:2501.14474 [cs.GT] (or arXiv:2501.14474v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2501.14474 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-19] Pesti-Gen: Unleashing a Generative Molecule Approach for Toxicity Aware Pesticide Design
链接: https://arxiv.org/abs/2501.14469
作者: Taehan Kim,Wonduk Seo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Molecular Networks (q-bio.MN)
*备注: 9 pages, 2 figures, 5 tables
点击查看摘要
Abstract:Global climate change has reduced crop resilience and pesticide efficacy, making reliance on synthetic pesticides inevitable, even though their widespread use poses significant health and environmental risks. While these pesticides remain a key tool in pest management, previous machine-learning applications in pesticide and agriculture have focused on classification or regression, leaving the fundamental challenge of generating new molecular structures or designing novel candidates unaddressed. In this paper, we propose Pesti-Gen, a novel generative model based on variational auto-encoders, designed to create pesticide candidates with optimized properties for the first time. Specifically, Pesti-Gen leverages a two-stage learning process: an initial pre-training phase that captures a generalized chemical structure representation, followed by a fine-tuning stage that incorporates toxicity-specific information. The model simultaneously optimizes over multiple toxicity metrics, such as (1) livestock toxicity and (2) aqua toxicity to generate environmentally friendly pesticide candidates. Notably, Pesti-Gen achieves approximately 68% structural validity in generating new molecular structures, demonstrating the model’s effectiveness in producing optimized and feasible pesticide candidates, thereby providing a new way for safer and more sustainable pest management solutions.
[AI-20] Interpretability Analysis of Domain Adapted Dense Retrievers
链接: https://arxiv.org/abs/2501.14459
作者: Goksenin Yuksel,Jaap Kamps
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Dense retrievers have demonstrated significant potential for neural information retrieval; however, they exhibit a lack of robustness to domain shifts, thereby limiting their efficacy in zero-shot settings across diverse domains. Previous research has investigated unsupervised domain adaptation techniques to adapt dense retrievers to target domains. However, these studies have not focused on explainability analysis to understand how such adaptations alter the model’s behavior. In this paper, we propose utilizing the integrated gradients framework to develop an interpretability method that provides both instance-based and ranking-based explanations for dense retrievers. To generate these explanations, we introduce a novel baseline that reveals both query and document attributions. This method is used to analyze the effects of domain adaptation on input attributions for query and document tokens across two datasets: the financial question answering dataset (FIQA) and the biomedical information retrieval dataset (TREC-COVID). Our visualizations reveal that domain-adapted models focus more on in-domain terminology compared to non-adapted models, exemplified by terms such as “hedge,” “gold,” “corona,” and “disease.” This research addresses how unsupervised domain adaptation techniques influence the behavior of dense retrievers when adapted to new domains. Additionally, we demonstrate that integrated gradients are a viable choice for explaining and analyzing the internal mechanisms of these opaque neural models.
[AI-21] Learning more with the same effort: how randomization improves the robustness of a robotic deep reinforcement learning agent
链接: https://arxiv.org/abs/2501.14443
作者: Lucía Güitta-López,Jaime Boal,Álvaro J. López-López
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: This article was accepted and published in Applied Intelligence ( https://doi.org/10.1007/s10489-022-04227-3 )
点击查看摘要
Abstract:The industrial application of Deep Reinforcement Learning (DRL) is frequently slowed down because of the inability to generate the experience required to train the models. Collecting data often involves considerable time and economic effort that is unaffordable in most cases. Fortunately, devices like robots can be trained with synthetic experience thanks to virtual environments. With this approach, the sample efficiency problems of artificial agents are mitigated, but another issue arises: the need for efficiently transferring the synthetic experience into the real world (sim-to-real). This paper analyzes the robustness of a state-of-the-art sim-to-real technique known as progressive neural networks (PNNs) and studies how adding diversity to the synthetic experience can complement it. To better understand the drivers that lead to a lack of robustness, the robotic agent is still tested in a virtual environment to ensure total control on the divergence between the simulated and real models. The results show that a PNN-like agent exhibits a substantial decrease in its robustness at the beginning of the real training phase. Randomizing certain variables during simulation-based training significantly mitigates this issue. On average, the increase in the model’s accuracy is around 25% when diversity is introduced in the training process. This improvement can be translated into a decrease in the required real experience for the same final robustness performance. Notwithstanding, adding real experience to agents should still be beneficial regardless of the quality of the virtual experience fed into the agent. Comments: This article was accepted and published in Applied Intelligence (https://doi.org/10.1007/s10489-022-04227-3) Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) ACMclasses: I.2.6 Cite as: arXiv:2501.14443 [cs.RO] (or arXiv:2501.14443v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2501.14443 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Applied Intelligence. 53, 2023, 14903-14917 Related DOI: https://doi.org/10.1007/s10489-022-04227-3 Focus to learn more DOI(s) linking to related resources
[AI-22] Adaptive Rank Allocation for Federated Parameter-Efficient Fine-Tuning of Language Models
链接: https://arxiv.org/abs/2501.14406
作者: Fei Wu,Jia Hu,Geyong Min,Shiqiang Wang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
点击查看摘要
Abstract:Pre-trained Language Models (PLMs) have demonstrated their superiority and versatility in modern Natural Language Processing (NLP), effectively adapting to various downstream tasks through further fine-tuning. Federated Parameter-Efficient Fine-Tuning (FedPEFT) has emerged as a promising solution to address privacy and efficiency challenges in distributed training for PLMs on mobile devices. However, our measurements reveal two key limitations of FedPEFT: heterogeneous data leads to significant performance degradation, and a fixed parameter configuration results in communication inefficiency. To overcome these limitations, we propose FedARA, a novel Federated Adaptive Rank Allocation for parameter-efficient fine-tuning of language models. Specifically, FedARA employs truncated singular value decomposition (SVD) adaptation to enhance flexibility and expressiveness, significantly mitigating the adverse effects of data heterogeneity. Subsequently, it utilizes dynamic rank allocation to progressively identify critical ranks, effectively improving communication efficiency. Lastly, it leverages rank-based module pruning to remove inactive modules, steadily reducing local training time and peak memory usage in each round. Extensive experiments show that FedARA consistently outperforms weak baselines by an average of 8.49% and strong baselines by 6.95% across various datasets under data heterogeneity while significantly improving communication efficiency by 2.40(\times). Moreover, experiments on AGX Orin, Orin Nano and Raspberry Pi 5 devices demonstrate substantial decreases in total training time and energy consumption by up to 48.90% and 46.95%, respectively.
[AI-23] SKIL: Semantic Keypoint Imitation Learning for Generalizable Data-efficient Manipulation
链接: https://arxiv.org/abs/2501.14400
作者: Shengjie Wang,Jiacheng You,Yihang Hu,Jiongye Li,Yang Gao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 22 pages, 22 figures
点击查看摘要
Abstract:Real-world tasks such as garment manipulation and table rearrangement demand robots to perform generalizable, highly precise, and long-horizon actions. Although imitation learning has proven to be an effective approach for teaching robots new skills, large amounts of expert demonstration data are still indispensible for these complex tasks, resulting in high sample complexity and costly data collection. To address this, we propose Semantic Keypoint Imitation Learning (SKIL), a framework which automatically obtain semantic keypoints with help of vision foundation models, and forms the descriptor of semantic keypoints that enables effecient imitation learning of complex robotic tasks with significantly lower sample complexity. In real world experiments, SKIL doubles the performance of baseline methods in tasks such as picking a cup or mouse, while demonstrating exceptional robustness to variations in objects, environmental changes, and distractors. For long-horizon tasks like hanging a towel on a rack where previous methods fail completely, SKIL achieves a mean success rate of 70% with as few as 30 demonstrations. Furthermore, SKIL naturally supports cross-embodiment learning due to its semantic keypoints abstraction, our experiments demonstrate that even human videos bring considerable improvement to the learning performance. All these results demonstrate the great success of SKIL in achieving data-efficint generalizable robotic learning. Visualizations and code are available at: this https URL.
[AI-24] Handling Heterophily in Recommender Systems with Wavelet Hypergraph Diffusion
链接: https://arxiv.org/abs/2501.14399
作者: Darnbi Sakong,Thanh Tam Nguyen
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
点击查看摘要
Abstract:Recommender systems are pivotal in delivering personalised user experiences across various domains. However, capturing the heterophily patterns and the multi-dimensional nature of user-item interactions poses significant challenges. To address this, we introduce FWHDNN (Fusion-based Wavelet Hypergraph Diffusion Neural Networks), an innovative framework aimed at advancing representation learning in hypergraph-based recommendation tasks. The model incorporates three key components: (1) a cross-difference relation encoder leveraging heterophily-aware hypergraph diffusion to adapt message-passing for diverse class labels, (2) a multi-level cluster-wise encoder employing wavelet transform-based hypergraph neural network layers to capture multi-scale topological relationships, and (3) an integrated multi-modal fusion mechanism that combines structural and textual information through intermediate and late-fusion strategies. Extensive experiments on real-world datasets demonstrate that FWHDNN surpasses state-of-the-art methods in accuracy, robustness, and scalability in capturing high-order interconnections between users and items.
[AI-25] In System Alignments we Trust! Explainable Alignments via Projections
链接: https://arxiv.org/abs/2501.14360
作者: Dominique Sommers,Natalia Sidorova,Boudewijn van Dongen
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
*备注:
点击查看摘要
Abstract:Alignments are a well-known process mining technique for reconciling system logs and normative process models. Evidence of certain behaviors in a real system may only be present in one representation - either a log or a model - but not in the other. Since for processes in which multiple entities, like objects and resources, are involved in the activities, their interactions affect the behavior and are therefore essential to take into account in the alignments. Additionally, both logged and modeled representations of reality may be imprecise and only partially represent some of these entities, but not all. In this paper, we introduce the concept of “relaxations” through projections for alignments to deal with partially correct models and logs. Relaxed alignments help to distinguish between trustworthy and untrustworthy content of the two representations (the log and the model) to achieve a better understanding of the underlying process and expose quality issues. Subjects: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL) Cite as: arXiv:2501.14360 [cs.AI] (or arXiv:2501.14360v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2501.14360 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-26] HorNets: Learning from Discrete and Continuous Signals with Routing Neural Networks ACML
链接: https://arxiv.org/abs/2501.14346
作者: Boshko koloski,Nada Lavrač,Blaž Škrlj
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to the ACML conference journal track with the Machine Learning journal. The first and the last authors share an equal contribution
点击查看摘要
Abstract:Construction of neural network architectures suitable for learning from both continuous and discrete tabular data is a challenging research endeavor. Contemporary high-dimensional tabular data sets are often characterized by a relatively small instance count, requiring data-efficient learning. We propose HorNets (Horn Networks), a neural network architecture with state-of-the-art performance on synthetic and real-life data sets from scarce-data tabular domains. HorNets are based on a clipped polynomial-like activation function, extended by a custom discrete-continuous routing mechanism that decides which part of the neural network to optimize based on the input’s cardinality. By explicitly modeling parts of the feature combination space or combining whole space in a linear attention-like manner, HorNets dynamically decide which mode of operation is the most suitable for a given piece of data with no explicit supervision. This architecture is one of the few approaches that reliably retrieves logical clauses (including noisy XNOR) and achieves state-of-the-art classification performance on 14 real-life biomedical high-dimensional data sets. HorNets are made freely available under a permissive license alongside a synthetic generator of categorical benchmarks.
[AI-27] Exploring the sustainable scaling of AI dilemma: A projective study of corporations AI environmental impacts
链接: https://arxiv.org/abs/2501.14334
作者: Clément Desroches,Martin Chauvin,Louis Ladan,Caroline Vateau,Simon Gosset,Philippe Cordier
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The rapid growth of artificial intelligence (AI), particularly Large Language Models (LLMs), has raised concerns regarding its global environmental impact that extends beyond greenhouse gas emissions to include consideration of hardware fabrication and end-of-life processes. The opacity from major providers hinders companies’ abilities to evaluate their AI-related environmental impacts and achieve net-zero this http URL this paper, we propose a methodology to estimate the environmental impact of a company’s AI portfolio, providing actionable insights without necessitating extensive AI and Life-Cycle Assessment (LCA) expertise. Results confirm that large generative AI models consume up to 4600x more energy than traditional models. Our modelling approach, which accounts for increased AI usage, hardware computing efficiency, and changes in electricity mix in line with IPCC scenarios, forecasts AI electricity use up to 2030. Under a high adoption scenario, driven by widespread Generative AI and agents adoption associated to increasingly complex models and frameworks, AI electricity use is projected to rise by a factor of this http URL the environmental impact of Generative AI by 2030 requires coordinated efforts across the AI value chain. Isolated measures in hardware efficiency, model efficiency, or grid improvements alone are insufficient. We advocate for standardized environmental assessment frameworks, greater transparency from the all actors of the value chain and the introduction of a “Return on Environment” metric to align AI development with net-zero goals.
[AI-28] Relative Layer-Wise Relevance Propagation: a more Robust Neural Networks eXplaination
链接: https://arxiv.org/abs/2501.14322
作者: Eric Nyiri,Olivier Gibaru
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2012.14501 , arXiv:1605.01713 by other authors
点击查看摘要
Abstract:Machine learning methods are solving very successfully a plethora of tasks, but they have the disadvantage of not providing any information about their decision. Consequently, estimating the reasoning of the system provides additional information. For this, Layer-Wise Relevance Propagation (LRP) is one of the methods in eXplainable Machine Learning (XML). Its purpose is to provide contributions of any neural network output in the domain of its input. The main drawback of current methods is mainly due to division by small values. To overcome this problem, we provide a new definition called Relative LRP where the classical conservation law is satisfied up to a multiplicative factor but without divisions by small values except for Resnet skip connection. In this article, we will focus on image classification. This allows us to visualize the contributions of a pixel to the predictions of a multi-layer neural network. Pixel contributions provide a focus to further analysis on regions of potential interest. R-LRP can be applied for any dense, CNN or residual neural networks. Moreover, R-LRP doesn’t need any hyperparameters to tune contrary to other LRP methods. We then compare the R-LRP method on different datasets with simple CNN, VGG16, VGG19 and Resnet50 networks.
[AI-29] Permutation-based multi-objective evolutionary feature selection for high-dimensional data
链接: https://arxiv.org/abs/2501.14310
作者: Raquel Espinosa,Gracia Sánchez,José Palma,Fernando Jiménez
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Feature selection is a critical step in the analysis of high-dimensional data, where the number of features often vastly exceeds the number of samples. Effective feature selection not only improves model performance and interpretability but also reduces computational costs and mitigates the risk of overfitting. In this context, we propose a novel feature selection method for high-dimensional data, based on the well-known permutation feature importance approach, but extending it to evaluate subsets of attributes rather than individual features. This extension more effectively captures how interactions among features influence model performance. The proposed method employs a multi-objective evolutionary algorithm to search for candidate feature subsets, with the objectives of maximizing the degradation in model performance when the selected features are shuffled, and minimizing the cardinality of the feature subset. The effectiveness of our method has been validated on a set of 24 publicly available high-dimensional datasets for classification and regression tasks, and compared against 9 well-established feature selection methods designed for high-dimensional problems, including the conventional permutation feature importance method. The results demonstrate the ability of our approach in balancing accuracy and computational efficiency, providing a powerful tool for feature selection in complex, high-dimensional datasets.
[AI-30] A Zero-Shot LLM Framework for Automatic Assignment Grading in Higher Education
链接: https://arxiv.org/abs/2501.14305
作者: Calvin Yeung,Jeff Yu,King Chau Cheung,Tat Wing Wong,Chun Man Chan,Kin Chi Wong,Keisuke Fujii
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Automated grading has become an essential tool in education technology due to its ability to efficiently assess large volumes of student work, provide consistent and unbiased evaluations, and deliver immediate feedback to enhance learning. However, current systems face significant limitations, including the need for large datasets in few-shot learning methods, a lack of personalized and actionable feedback, and an overemphasis on benchmark performance rather than student experience. To address these challenges, we propose a Zero-Shot Large Language Model (LLM)-Based Automated Assignment Grading (AAG) system. This framework leverages prompt engineering to evaluate both computational and explanatory student responses without requiring additional training or fine-tuning. The AAG system delivers tailored feedback that highlights individual strengths and areas for improvement, thereby enhancing student learning outcomes. Our study demonstrates the system’s effectiveness through comprehensive evaluations, including survey responses from higher education students that indicate significant improvements in motivation, understanding, and preparedness compared to traditional grading methods. The results validate the AAG system’s potential to transform educational assessment by prioritizing learning experiences and providing scalable, high-quality feedback.
[AI-31] MASTER: A Multi-Agent System with LLM Specialized MCTS NAACL2025
链接: https://arxiv.org/abs/2501.14304
作者: Bingzheng Gan,Yufan Zhao,Tianyi Zhang,Jing Huang,Yusu Li,Shu Xian Teo,Changwang Zhang,Wei Shi
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by main NAACL 2025
点击查看摘要
Abstract:Large Language Models (LLM) are increasingly being explored for problem-solving tasks. However, their strategic planning capability is often viewed with skepticism. Recent studies have incorporated the Monte Carlo Tree Search (MCTS) algorithm to augment the planning capacity of LLM. Despite its potential, MCTS relies on extensive sampling simulations to approximate the true reward distribution, leading to two primary issues. Firstly, MCTS is effective for tasks like the Game of Go, where simulation results can yield objective rewards (e.g., 1 for a win and 0 for a loss). However, for tasks such as question answering, the result of a simulation is the answer to the question, which cannot obtain an objective reward without the ground truth. Secondly, obtaining statistically significant reward estimations typically requires a sample size exceeding 30 simulations, resulting in excessive token usage and time consumption. To address these challenges, we present Multi-Agent System with Tactical Execution and Reasoning using LLM Specialized MCTS (MASTER), a novel framework that coordinates agent recruitment and communication using LLM specialized MCTS. This system autonomously adjusts the number of agents based on task complexity and ensures focused communication among them. Comprehensive experiments across various tasks demonstrate the effectiveness of our proposed framework. It achieves 76% accuracy on HotpotQA and 80% on WebShop, setting new state-of-the-art performance on these datasets.
[AI-32] Active Learning for Continual Learning: Keeping the Past Alive in the Present
链接: https://arxiv.org/abs/2501.14278
作者: Jaehyun Park,Dongmin Park,Jae-Gil Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Continual learning (CL) enables deep neural networks to adapt to ever-changing data distributions. In practice, there may be scenarios where annotation is costly, leading to active continual learning (ACL), which performs active learning (AL) for the CL scenarios when reducing the labeling cost by selecting the most informative subset is preferable. However, conventional AL strategies are not suitable for ACL, as they focus solely on learning the new knowledge, leading to catastrophic forgetting of previously learned tasks. Therefore, ACL requires a new AL strategy that can balance the prevention of catastrophic forgetting and the ability to quickly learn new tasks. In this paper, we propose AccuACL, Accumulated informativeness-based Active Continual Learning, by the novel use of the Fisher information matrix as a criterion for sample selection, derived from a theoretical analysis of the Fisher-optimality preservation properties within the framework of ACL, while also addressing the scalability issue of Fisher information-based AL. Extensive experiments demonstrate that AccuACL significantly outperforms AL baselines across various CL algorithms, increasing the average accuracy and forgetting by 23.8% and 17.0%, respectively, in average.
[AI-33] Hierarchical Time-Aware Mixture of Experts for Multi-Modal Sequential Recommendation WWW2025
链接: https://arxiv.org/abs/2501.14269
作者: Shengzhe Zhang,Liyi Chen,Dazhong Shen,Chao Wang,Hui Xiong
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted to WWW 2025
点击查看摘要
Abstract:Multi-modal sequential recommendation (SR) leverages multi-modal data to learn more comprehensive item features and user preferences than traditional SR methods, which has become a critical topic in both academia and industry. Existing methods typically focus on enhancing multi-modal information utility through adaptive modality fusion to capture the evolving of user preference from user-item interaction sequences. However, most of them overlook the interference caused by redundant interest-irrelevant information contained in rich multi-modal data. Additionally, they primarily rely on implicit temporal information based solely on chronological ordering, neglecting explicit temporal signals that could more effectively represent dynamic user interest over time. To address these limitations, we propose a Hierarchical time-aware Mixture of experts for multi-modal Sequential Recommendation (HM4SR) with a two-level Mixture of Experts (MoE) and a multi-task learning strategy. Specifically, the first MoE, named Interactive MoE, extracts essential user interest-related information from the multi-modal data of each item. Then, the second MoE, termed Temporal MoE, captures user dynamic interests by introducing explicit temporal embeddings from timestamps in modality encoding. To further address data sparsity, we propose three auxiliary supervision tasks: sequence-level category prediction (CP) for item feature understanding, contrastive learning on ID (IDCL) to align sequence context with user interests, and placeholder contrastive learning (PCL) to integrate temporal information with modalities for dynamic interest modeling. Extensive experiments on four public datasets verify the effectiveness of HM4SR compared to several state-of-the-art approaches.
[AI-34] Pre-train and Fine-tune: Recommenders as Large Models WWW2025
链接: https://arxiv.org/abs/2501.14268
作者: Zhenhao Jiang,Chenghao Chen,Hao Feng,Yu Yang,Jin Liu,Jie Zhang,Jia Jia,Ning Hu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by WWW2025
点击查看摘要
Abstract:In reality, users have different interests in different periods, regions, scenes, etc. Such changes in interest are so drastic that they are difficult to be captured by recommenders. Existing multi-domain learning can alleviate this problem. However, the structure of the industrial recommendation system is complex, the amount of data is huge, and the training cost is extremely high, so it is difficult to modify the structure of the industrial recommender and re-train it. To fill this gap, we consider recommenders as large pre-trained models and fine-tune them. We first propose the theory of the information bottleneck for fine-tuning and present an explanation for the fine-tuning technique in recommenders. To tailor for recommendation, we design an information-aware adaptive kernel (IAK) technique to fine-tune the pre-trained recommender. Specifically, we define fine-tuning as two phases: knowledge compression and knowledge matching and let the training stage of IAK explicitly approximate these two phases. Our proposed approach designed from the essence of fine-tuning is well interpretable. Extensive online and offline experiments show the superiority of our proposed method. Besides, we also share unique and important lessons we learned when deploying the method in a large-scale online platform. We also present the potential issues of fine-tuning techniques in recommendation systems and the corresponding solutions. The recommender with IAK technique has been deployed on the homepage of a billion-scale online food platform for several months and has yielded considerable profits in our business.
[AI-35] op Ten Challenges Towards Agent ic Neural Graph Databases
链接: https://arxiv.org/abs/2501.14224
作者: Jiaxin Bai,Zihao Wang,Yukun Zhou,Hang Yin,Weizhi Fei,Qi Hu,Zheye Deng,Jiayang Cheng,Tianshi Zheng,Hong Ting Tsang,Yisen Gao,Zhongwei Xie,Yufei Li,Lixin Fan,Binhang Yuan,Wei Wang,Lei Chen,Xiaofang Zhou,Yangqiu Song
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
*备注: 12 Pages
点击查看摘要
Abstract:Graph databases (GDBs) like Neo4j and TigerGraph excel at handling interconnected data but lack advanced inference capabilities. Neural Graph Databases (NGDBs) address this by integrating Graph Neural Networks (GNNs) for predictive analysis and reasoning over incomplete or noisy data. However, NGDBs rely on predefined queries and lack autonomy and adaptability. This paper introduces Agentic Neural Graph Databases (Agentic NGDBs), which extend NGDBs with three core functionalities: autonomous query construction, neural query execution, and continuous learning. We identify ten key challenges in realizing Agentic NGDBs: semantic unit representation, abductive reasoning, scalable query execution, and integration with foundation models like large language models (LLMs). By addressing these challenges, Agentic NGDBs can enable intelligent, self-improving systems for modern data-driven applications, paving the way for adaptable and autonomous data management solutions.
[AI-36] FG-Flow: Training-free Guidance in Multimodal Generative Flow
链接: https://arxiv.org/abs/2501.14216
作者: Haowei Lin,Shanda Li,Haotian Ye,Yiming Yang,Stefano Ermon,Yitao Liang,Jianzhu Ma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:
点击查看摘要
Abstract:Given an unconditional generative model and a predictor for a target property (e.g., a classifier), the goal of training-free guidance is to generate samples with desirable target properties without additional training. As a highly efficient technique for steering generative models toward flexible outcomes, training-free guidance has gained increasing attention in diffusion models. However, existing methods only handle data in continuous spaces, while many scientific applications involve both continuous and discrete data (referred to as multimodality). Another emerging trend is the growing use of the simple and general flow matching framework in building generative foundation models, where guided generation remains under-explored. To address this, we introduce TFG-Flow, a novel training-free guidance method for multimodal generative flow. TFG-Flow addresses the curse-of-dimensionality while maintaining the property of unbiased sampling in guiding discrete variables. We validate TFG-Flow on four molecular design tasks and show that TFG-Flow has great potential in drug design by generating molecules with desired properties.
[AI-37] Coordinating Ride-Pooling with Public Transit using Reward-Guided Conservative Q-Learning: An Offline Training and Online Fine-Tuning Reinforcement Learning Framework
链接: https://arxiv.org/abs/2501.14199
作者: Yulong Hu,Tingting Dong,Sen Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:
点击查看摘要
Abstract:This paper introduces a novel reinforcement learning (RL) framework, termed Reward-Guided Conservative Q-learning (RG-CQL), to enhance coordination between ride-pooling and public transit within a multimodal transportation network. We model each ride-pooling vehicle as an agent governed by a Markov Decision Process (MDP) and propose an offline training and online fine-tuning RL framework to learn the optimal operational decisions of the multimodal transportation systems, including rider-vehicle matching, selection of drop-off locations for passengers, and vehicle routing decisions, with improved data efficiency. During the offline training phase, we develop a Conservative Double Deep Q Network (CDDQN) as the action executor and a supervised learning-based reward estimator, termed the Guider Network, to extract valuable insights into action-reward relationships from data batches. In the online fine-tuning phase, the Guider Network serves as an exploration guide, aiding CDDQN in effectively and conservatively exploring unknown state-action pairs. The efficacy of our algorithm is demonstrated through a realistic case study using real-world data from Manhattan. We show that integrating ride-pooling with public transit outperforms two benchmark cases solo rides coordinated with transit and ride-pooling without transit coordination by 17% and 22% in the achieved system rewards, respectively. Furthermore, our innovative offline training and online fine-tuning framework offers a remarkable 81.3% improvement in data efficiency compared to traditional online RL methods with adequate exploration budgets, with a 4.3% increase in total rewards and a 5.6% reduction in overestimation errors. Experimental results further demonstrate that RG-CQL effectively addresses the challenges of transitioning from offline to online RL in large-scale ride-pooling systems integrated with transit.
[AI-38] Distributed Multi-Agent Coordination Using Multi-Modal Foundation Models
链接: https://arxiv.org/abs/2501.14189
作者: Saaduddin Mahmud,Dorian Benhamou Goldfajn,Shlomo Zilberstein
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:Distributed Constraint Optimization Problems (DCOPs) offer a powerful framework for multi-agent coordination but often rely on labor-intensive, manual problem construction. To address this, we introduce VL-DCOPs, a framework that takes advantage of large multimodal foundation models (LFMs) to automatically generate constraints from both visual and linguistic instructions. We then introduce a spectrum of agent archetypes for solving VL-DCOPs: from a neuro-symbolic agent that delegates some of the algorithmic decisions to an LFM, to a fully neural agent that depends entirely on an LFM for coordination. We evaluate these agent archetypes using state-of-the-art LLMs (large language models) and VLMs (vision language models) on three novel VL-DCOP tasks and compare their respective advantages and drawbacks. Lastly, we discuss how this work extends to broader frontier challenges in the DCOP literature.
[AI-39] VarDrop: Enhancing Training Efficiency by Reducing Variate Redundancy in Periodic Time Series Forecasting
链接: https://arxiv.org/abs/2501.14183
作者: Junhyeok Kang,Yooju Shin,Jae-Gil Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Variate tokenization, which independently embeds each variate as separate tokens, has achieved remarkable improvements in multivariate time series forecasting. However, employing self-attention with variate tokens incurs a quadratic computational cost with respect to the number of variates, thus limiting its training efficiency for large-scale applications. To address this issue, we propose VarDrop, a simple yet efficient strategy that reduces the token usage by omitting redundant variate tokens during training. VarDrop adaptively excludes redundant tokens within a given batch, thereby reducing the number of tokens used for dot-product attention while preserving essential information. Specifically, we introduce k-dominant frequency hashing (k-DFH), which utilizes the ranked dominant frequencies in the frequency domain as a hash value to efficiently group variate tokens exhibiting similar periodic behaviors. Then, only representative tokens in each group are sampled through stratified sampling. By performing sparse attention with these selected tokens, the computational cost of scaled dot-product attention is significantly alleviated. Experiments conducted on public benchmark datasets demonstrate that VarDrop outperforms existing efficient baselines.
[AI-40] RL Transformer = A General-Purpose Problem Solver
链接: https://arxiv.org/abs/2501.14176
作者: Micah Rentschler,Jesse Roberts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:What if artificial intelligence could not only solve problems for which it was trained but also learn to teach itself to solve new problems (i.e., meta-learn)? In this study, we demonstrate that a pre-trained transformer fine-tuned with reinforcement learning over multiple episodes develops the ability to solve problems that it has never encountered before - an emergent ability called In-Context Reinforcement Learning (ICRL). This powerful meta-learner not only excels in solving unseen in-distribution environments with remarkable sample efficiency, but also shows strong performance in out-of-distribution environments. In addition, we show that it exhibits robustness to the quality of its training data, seamlessly stitches together behaviors from its context, and adapts to non-stationary environments. These behaviors demonstrate that an RL-trained transformer can iteratively improve upon its own solutions, making it an excellent general-purpose problem solver.
[AI-41] LoCoML: A Framework for Real-World ML Inference Pipelines ICSE
链接: https://arxiv.org/abs/2501.14165
作者: Kritin Maddireddy,Santhosh Kotekal Methukula,Chandrasekar Sridhar,Karthik Vaidhyanathan
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: The paper has been accepted for presentation at the 4th International Conference on AI Engineering (CAIN) 2025 co-located with 47th IEEE/ACM International Conference on Software Engineering (ICSE) 2025
点击查看摘要
Abstract:The widespread adoption of machine learning (ML) has brought forth diverse models with varying architectures, and data requirements, introducing new challenges in integrating these systems into real-world applications. Traditional solutions often struggle to manage the complexities of connecting heterogeneous models, especially when dealing with varied technical specifications. These limitations are amplified in large-scale, collaborative projects where stakeholders contribute models with different technical specifications. To address these challenges, we developed LoCoML, a low-code framework designed to simplify the integration of diverse ML models within the context of the \textitBhashini Project - a large-scale initiative aimed at integrating AI-driven language technologies such as automatic speech recognition, machine translation, text-to-speech, and optical character recognition to support seamless communication across more than 20 languages. Initial evaluations show that LoCoML adds only a small amount of computational load, making it efficient and effective for large-scale ML integration. Our practical insights show that a low-code approach can be a practical solution for connecting multiple ML models in a collaborative environment.
[AI-42] he Role of Generative AI in Software Student CollaborAItion
链接: https://arxiv.org/abs/2501.14084
作者: Natalie Kiesler,Jacqueline Smith,Juho Leinonen,Armando Fox,Stephen MacNeil,Petri Ihantola
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 7 pages, 1 figure
点击查看摘要
Abstract:Collaboration is a crucial part of computing education. The increase in AI capabilities over the last couple of years is bound to profoundly affect all aspects of systems and software engineering, including collaboration. In this position paper, we consider a scenario where AI agents would be able to take on any role in collaborative processes in computing education. We outline these roles, the activities and group dynamics that software development currently include, and discuss if and in what way AI could facilitate these roles and activities. The goal of our work is to envision and critically examine potential futures. We present scenarios suggesting how AI can be integrated into existing collaborations. These are contrasted by design fictions that help demonstrate the new possibilities and challenges for computing education in the AI era.
[AI-43] GraphRAG under Fire
链接: https://arxiv.org/abs/2501.14050
作者: Jiacheng Liang,Yuhui Wang,Changjiang Li,Rongyi Zhu,Tanqiu Jiang,Neil Gong,Ting Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 13 pages
点击查看摘要
Abstract:GraphRAG advances retrieval-augmented generation (RAG) by structuring external knowledge as multi-scale knowledge graphs, enabling language models to integrate both broad context and granular details in their reasoning. While GraphRAG has demonstrated success across domains, its security implications remain largely unexplored. To bridge this gap, this work examines GraphRAG’s vulnerability to poisoning attacks, uncovering an intriguing security paradox: compared to conventional RAG, GraphRAG’s graph-based indexing and retrieval enhance resilience against simple poisoning attacks; meanwhile, the same features also create new attack surfaces. We present GRAGPoison, a novel attack that exploits shared relations in the knowledge graph to craft poisoning text capable of compromising multiple queries simultaneously. GRAGPoison employs three key strategies: i) relation injection to introduce false knowledge, ii) relation enhancement to amplify poisoning influence, and iii) narrative generation to embed malicious content within coherent text. Empirical evaluation across diverse datasets and models shows that GRAGPoison substantially outperforms existing attacks in terms of effectiveness (up to 98% success rate) and scalability (using less than 68% poisoning text). We also explore potential defensive measures and their limitations, identifying promising directions for future research.
[AI-44] Human-Alignment Influences the Utility of AI-assisted Decision Making
链接: https://arxiv.org/abs/2501.14035
作者: Nina L. Corvelo Benz,Manuel Gomez Rodriguez
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Whenever an AI model is used to predict a relevant (binary) outcome in AI-assisted decision making, it is widely agreed that, together with each prediction, the model should provide an AI confidence value. However, it has been unclear why decision makers have often difficulties to develop a good sense on when to trust a prediction using AI confidence values. Very recently, Corvelo Benz and Gomez Rodriguez have argued that, for rational decision makers, the utility of AI-assisted decision making is inherently bounded by the degree of alignment between the AI confidence values and the decision maker’s confidence on their own predictions. In this work, we empirically investigate to what extent the degree of alignment actually influences the utility of AI-assisted decision making. To this end, we design and run a large-scale human subject study (n=703) where participants solve a simple decision making task - an online card game - assisted by an AI model with a steerable degree of alignment. Our results show a positive association between the degree of alignment and the utility of AI-assisted decision making. In addition, our results also show that post-processing the AI confidence values to achieve multicalibration with respect to the participants’ confidence on their own predictions increases both the degree of alignment and the utility of AI-assisted decision making.
[AI-45] ransfer Learning of Surrogate Models via Domain Affine Transformation Across Synthetic and Real-World Benchmarks
链接: https://arxiv.org/abs/2501.14012
作者: Shuaiqun Pan,Diederick Vermetten,Manuel López-Ibáñez,Thomas Bäck,Hao Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Surrogate models are frequently employed as efficient substitutes for the costly execution of real-world processes. However, constructing a high-quality surrogate model often demands extensive data acquisition. A solution to this issue is to transfer pre-trained surrogate models for new tasks, provided that certain invariances exist between tasks. This study focuses on transferring non-differentiable surrogate models (e.g., random forest) from a source function to a target function, where we assume their domains are related by an unknown affine transformation, using only a limited amount of transfer data points evaluated on the target. Previous research attempts to tackle this challenge for differentiable models, e.g., Gaussian process regression, which minimizes the empirical loss on the transfer data by tuning the affine transformations. In this paper, we extend the previous work to the random forest model and assess its effectiveness on a widely-used artificial problem set - Black-Box Optimization Benchmark (BBOB) testbed, and on four real-world transfer learning problems. The results highlight the significant practical advantages of the proposed method, particularly in reducing both the data requirements and computational costs of training surrogate models for complex real-world scenarios.
[AI-46] Scalable and Explainable Verification of Image-based Neural Network Controllers for Autonomous Vehicles
链接: https://arxiv.org/abs/2501.14009
作者: Aditya Parameshwaran,Yue Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 11 pages, 5 figures
点击查看摘要
Abstract:Existing formal verification methods for image-based neural network controllers in autonomous vehicles often struggle with high-dimensional inputs, computational inefficiency, and a lack of explainability. These challenges make it difficult to ensure safety and reliability, as processing high-dimensional image data is computationally intensive and neural networks are typically treated as black boxes. To address these issues, we propose \textbfSEVIN (Scalable and Explainable Verification of Image-Based Neural Network Controllers), a framework that leverages a Variational Autoencoders (VAE) to encode high-dimensional images into a lower-dimensional, explainable latent space. By annotating latent variables with corresponding control actions, we generate convex polytopes that serve as structured input spaces for verification, significantly reducing computational complexity and enhancing scalability. Integrating the VAE’s decoder with the neural network controller allows for formal and robustness verification using these explainable polytopes. Our approach also incorporates robustness verification under real-world perturbations by augmenting the dataset and retraining the VAE to capture environmental variations. Experimental results demonstrate that SEVIN achieves efficient and scalable verification while providing explainable insights into controller behavior, bridging the gap between formal verification techniques and practical applications in safety-critical systems.
[AI-47] Asymmetrical Latent Representation for Individual Treatment Effect Modeling
链接: https://arxiv.org/abs/2501.14006
作者: Armand Lacombe,Michèle Sebag
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Conditional Average Treatment Effect (CATE) estimation, at the heart of counterfactual reasoning, is a crucial challenge for causal modeling both theoretically and applicatively, in domains such as healthcare, sociology, or advertising. Borrowing domain adaptation principles, a popular design maps the sample representation to a latent space that balances control and treated populations while enabling the prediction of the potential outcomes. This paper presents a new CATE estimation approach based on the asymmetrical search for two latent spaces called Asymmetrical Latent Representation for Individual Treatment Effect (ALRITE), where the two latent spaces are respectively intended to optimize the counterfactual prediction accuracy on the control and the treated samples. Under moderate assumptions, ALRITE admits an upper bound on the precision of the estimation of heterogeneous effects (PEHE), and the approach is empirically successfully validated compared to the state-of-the-art
[AI-48] Local Control Networks (LCNs): Optimizing Flexibility in Neural Network Data Pattern Capture
链接: https://arxiv.org/abs/2501.14000
作者: Hy Nguyen,Duy Khoa Pham,Srikanth Thudumu,Hung Du,Rajesh Vasa,Kon Mouzakis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The widespread use of Multi-layer perceptrons (MLPs) often relies on a fixed activation function (e.g., ReLU, Sigmoid, Tanh) for all nodes within the hidden layers. While effective in many scenarios, this uniformity may limit the networks ability to capture complex data patterns. We argue that employing the same activation function at every node is suboptimal and propose leveraging different activation functions at each node to increase flexibility and adaptability. To achieve this, we introduce Local Control Networks (LCNs), which leverage B-spline functions to enable distinct activation curves at each node. Our mathematical analysis demonstrates the properties and benefits of LCNs over conventional MLPs. In addition, we demonstrate that more complex architectures, such as Kolmogorov-Arnold Networks (KANs), are unnecessary in certain scenarios, and LCNs can be a more efficient alternative. Empirical experiments on various benchmarks and datasets validate our theoretical findings. In computer vision tasks, LCNs achieve marginal improvements over MLPs and outperform KANs by approximately 5%, while also being more computationally efficient than KANs. In basic machine learning tasks, LCNs show a 1% improvement over MLPs and a 0.6% improvement over KANs. For symbolic formula representation tasks, LCNs perform on par with KANs, with both architectures outperforming MLPs. Our findings suggest that diverse activations at the node level can lead to improved performance and efficiency.
[AI-49] Predictive Learning in Energy-based Models with Attractor Structures
链接: https://arxiv.org/abs/2501.13997
作者: Xingsi Dong,Pengxiang Yuan,Si Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Predictive models are highly advanced in understanding the mechanisms of brain function. Recent advances in machine learning further underscore the power of prediction for optimal representation in learning. However, there remains a gap in creating a biologically plausible model that explains how the neural system achieves prediction. In this paper, we introduce a framework that employs an energy-based model (EBM) to capture the nuanced processes of predicting observation after action within the neural system, encompassing prediction, learning, and inference. We implement the EBM with a hierarchical structure and integrate a continuous attractor neural network for memory, constructing a biologically plausible model. In experimental evaluations, our model demonstrates efficacy across diverse scenarios. The range of actions includes eye movement, motion in environments, head turning, and static observation while the environment changes. Our model not only makes accurate predictions for environments it was trained on, but also provides reasonable predictions for unseen environments, matching the performances of machine learning methods in multiple tasks. We hope that this study contributes to a deep understanding of how the neural system performs prediction.
[AI-50] Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization
链接: https://arxiv.org/abs/2501.13992
作者: Hy Nguyen,Nguyen Hung Nguyen,Nguyen Linh Bao Nguyen,Srikanth Thudumu,Hung Du,Rajesh Vasa,Kon Mouzakis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The Hierarchical Navigable Small World (HNSW) algorithm is widely used for approximate nearest neighbor (ANN) search, leveraging the principles of navigable small-world graphs. However, it faces some limitations. The first is the local optima problem, which arises from the algorithm’s greedy search strategy, selecting neighbors based solely on proximity at each step. This often leads to cluster disconnections. The second limitation is that HNSW frequently fails to achieve logarithmic complexity, particularly in high-dimensional datasets, due to the exhaustive traversal through each layer. To address these limitations, we propose a novel algorithm that mitigates local optima and cluster disconnections while enhancing the construction speed, maintaining inference speed. The first component is a dual-branch HNSW structure with LID-based insertion mechanisms, enabling traversal from multiple directions. This improves outlier node capture, enhances cluster connectivity, accelerates construction speed and reduces the risk of local minima. The second component incorporates a bridge-building technique that bypasses redundant intermediate layers, maintaining inference and making up the additional computational overhead introduced by the dual-branch structure. Experiments on various benchmarks and datasets showed that our algorithm outperforms the original HNSW in both accuracy and speed. We evaluated six datasets across Computer Vision (CV), and Natural Language Processing (NLP), showing recall improvements of 18% in NLP, and up to 30% in CV tasks while reducing the construction time by up to 20% and maintaining the inference speed. We did not observe any trade-offs in our algorithm. Ablation studies revealed that LID-based insertion had the greatest impact on performance, followed by the dual-branch structure and bridge-building components.
[AI-51] FreEformer: Frequency Enhanced Transformer for Multivariate Time Series Forecasting
链接: https://arxiv.org/abs/2501.13989
作者: Wenzhen Yue,Yong Liu,Xianghua Ying,Bowei Xing,Ruohao Guo,Ji Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:This paper presents \textbfFreEformer, a simple yet effective model that leverages a \textbfFrequency \textbfEnhanced Trans\textbfformer for multivariate time series forecasting. Our work is based on the assumption that the frequency spectrum provides a global perspective on the composition of series across various frequencies and is highly suitable for robust representation learning. Specifically, we first convert time series into the complex frequency domain using the Discrete Fourier Transform (DFT). The Transformer architecture is then applied to the frequency spectra to capture cross-variate dependencies, with the real and imaginary parts processed independently. However, we observe that the vanilla attention matrix exhibits a low-rank characteristic, thus limiting representation diversity. This could be attributed to the inherent sparsity of the frequency domain and the strong-value-focused nature of Softmax in vanilla attention. To address this, we enhance the vanilla attention mechanism by introducing an additional learnable matrix to the original attention matrix, followed by row-wise L1 normalization. Theoretical analysis~demonstrates that this enhanced attention mechanism improves both feature diversity and gradient flow. Extensive experiments demonstrate that FreEformer consistently outperforms state-of-the-art models on eighteen real-world benchmarks covering electricity, traffic, weather, healthcare and finance. Notably, the enhanced attention mechanism also consistently improves the performance of state-of-the-art Transformer-based forecasters.
[AI-52] OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting
链接: https://arxiv.org/abs/2501.13987
作者: Xing Hu,Yuan Cheng,Dawei Yang,Zukang Xu,Zhihang Yuan,Jiangyong Yu,Chen Xu,Zhe Jiang,Sifan Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 Pages
点击查看摘要
Abstract:Post-training quantization (PTQ) has emerged as a widely adopted technique for compressing and accelerating Large Language Models (LLMs). The major challenge in LLM quantization is that uneven and heavy-tailed data distributions can expand the quantization range, thereby reducing bit precision for most values. Recent methods attempt to eliminate outliers and balance inter-channel differences by employing linear transformations; however, they remain heuristic and are often overlook optimizing the data distribution across the entire quantization this http URL this paper, we introduce Quantization Space Utilization Rate (QSUR), a novel metric that effectively assesses the quantizability of transformed data by measuring the space utilization of the data in the quantization space. We complement QSUR with mathematical derivations that examine the effects and limitations of various transformations, guiding our development of Orthogonal and Scaling Transformation-based Quantization (OSTQuant). OSQuant employs a learnable equivalent transformation, consisting of an orthogonal transformation and a scaling transformation, to optimize the distributions of weights and activations across the entire quantization space. Futhermore, we propose the KL-Top loss function, designed to mitigate noise during optimization while retaining richer semantic information within the limited calibration data imposed by PTQ. OSTQuant outperforms existing work on various LLMs and benchmarks. In the W4-only setting, it retains 99.5% of the floating-point accuracy. In the more challenging W4A4KV4 configuration, OSTQuant reduces the performance gap by 32% on the LLaMA-3-8B model compared to state-of-the-art methods. \hrefthis https URLthis https URL.
[AI-53] An Efficient Sparse Kernel Generator for O(3)-Equivariant Deep Networks
链接: https://arxiv.org/abs/2501.13986
作者: Vivek Bharadwaj,Austin Scott Glover,Aydin Buluc,James Demmel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 9 figures, 3 tables
点击查看摘要
Abstract:Rotation equivariant graph neural networks, i.e., networks designed to guarantee certain geometric relations between their inputs and outputs, yield state-of-the-art performance on spatial deep learning tasks. They exhibit high data efficiency during training and significantly reduced inference time for interatomic potential calculations compared to classical approaches. Key to these models is the Clebsch-Gordon (CG) tensor product, a kernel that contracts two dense feature vectors with a highly structured sparse tensor to produce a dense output vector. The operation, which may be repeated millions of times for typical equivariant models, is a costly and inefficient bottleneck. We introduce a GPU sparse kernel generator for the CG tensor product that provides significant speedup over the best existing open and closed-source implementations. Our implementation achieves high performance by carefully managing GPU shared memory through static analysis at model compile-time, minimizing reads and writes to global memory. We break the tensor product into a series of kernels with operands that fit entirely into registers, enabling us to emit long arithmetic instruction streams that maximize instruction-level parallelism. By fusing the CG tensor product with a subsequent graph convolution, we reduce both intermediate storage and global memory traffic over naive approaches that duplicate input data. We also provide optimized kernels for the gradient of the CG tensor product and a novel identity for the higher partial derivatives required to predict interatomic forces. Our fused kernels offer up to 4.5x speedup for the forward pass and 3x for the backward pass over NVIDIA cuEquivariance, as well as 10x speedup over the widely-used e3nn package. We offer up to 5.3x inference-time speedup for the MACE chemistry foundation model over the original unoptimized version.
[AI-54] ZKLoRA: Efficient Zero-Knowledge Proofs for LoRA Verification
链接: https://arxiv.org/abs/2501.13965
作者: Bidhan Roy,Peter Potash,Marcos Villagra
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages, 3 figures
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) is a widely adopted method for customizing large-scale language models. In distributed, untrusted training environments, an open source base model user may want to use LoRA weights created by an external contributor, leading to two requirements: (1) the base model user must confirm that the LoRA weights are effective when paired with the intended base model, and (2) the LoRA contributor must keep their proprietary weights private until compensation is assured. We present ZKLoRA, a zero-knowledge verification protocol that relies on succinct proofs and our novel Multi-Party Inference procedure to verify LoRA-base model compatibility without exposing LoRA weights. ZKLoRA produces deterministic correctness guarantees and validates each LoRA module in only 1-2 seconds on state-of-the-art large language models. This low-latency approach enables nearly real-time verification and promotes secure collaboration among geographically decentralized teams and contract-based training pipelines. The protocol ensures that the delivered LoRA module works as claimed, safeguarding the contributor’s intellectual property while providing the base model user with verification of compatibility and lineage. Comments: 7 pages, 3 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2501.13965 [cs.CR] (or arXiv:2501.13965v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2501.13965 Focus to learn more arXiv-issued DOI via DataCite
[AI-55] Adaptive Cyber-Attack Detection in IIoT Using Attention-Based LSTM-CNN Models
链接: https://arxiv.org/abs/2501.13962
作者: Afrah Gueriani,Hamza Kheddar,Ahmed Cherif Mazari
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:The rapid expansion of the industrial Internet of things (IIoT) has introduced new challenges in securing critical infrastructures against sophisticated cyberthreats. This study presents the development and evaluation of an advanced Intrusion detection (IDS) based on a hybrid LSTM-convolution neural network (CNN)-Attention architecture, specifically designed to detect and classify cyberattacks in IIoT environments. The research focuses on two key classification tasks: binary and multi-class classification. The proposed models was rigorously tested using the Edge-IIoTset dataset. To mitigate the class imbalance in the dataset, the synthetic minority over-sampling technique (SMOTE) was employed to generate synthetic samples for the underrepresented classes. This ensured that the model could learn effectively from all classes, thereby improving the overall classification performance. Through systematic experimentation, various deep learning (DL) models were compared, ultimately demonstrating that the LSTM-CNN-Attention model consistently outperformed others across key performance metrics. In binary classification, the model achieved near-perfect accuracy, while in multi-class classification, it maintained a high accuracy level (99.04%), effectively categorizing different attack types with a loss value of 0.0220%.
[AI-56] Prompt-Based Monte Carlo Tree Search for Mitigating Hallucinations in Large Models
链接: https://arxiv.org/abs/2501.13942
作者: Zhihua Duan,Jialin Wang
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:With the rapid development of large models in the field of artificial intelligence, how to enhance their application capabilities in handling complex problems in the field of scientific research remains a challenging problem to be solved. This study proposes an improved Monte Carlo Tree Search (MCTS) method based on prompt words. In the simulation search stage, it introduces dynamic adjustment of exploration parameters and adaptive selection strategies, which can better balance exploration and exploitation, thereby reducing the hallucination phenomenon. This paper takes the four subsets of the SciEval dataset as the test objects, and compares the Glm-4-flash+Improved MCTS method with the methods of several existing models. The results show that the Improved MCTS method performs better, providing new ideas and methods for the application of large models in the field of scientific research.
[AI-57] On the Transfer of Knowledge in Quantum Algorithms
链接: https://arxiv.org/abs/2501.14120
作者: Esther Villar-Rodriguez,Eneko Osaba,Izaskun Oregi,Sebastián V. Romero,Julián Ferreiro-Vélez
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注: 12 pages, 8 figures, 4 tables. Paper submitted for its review in Expert Systems journal
点击查看摘要
Abstract:The field of quantum computing is generating significant anticipation within the scientific and industrial communities due to its potential to revolutionize computing paradigms. Recognizing this potential, this paper explores the integration of transfer of knowledge techniques, traditionally used in classical artificial intelligence, into quantum computing. We present a comprehensive classification of the transfer models, focusing on Transfer Learning and Transfer Optimization. Additionally, we analyze relevant schemes in quantum computing that can benefit from knowledge sharing, and we delve into the potential synergies, supported by theoretical insights and initial experimental results. Our findings suggest that leveraging the transfer of knowledge can enhance the efficiency and effectiveness of quantum algorithms, particularly in the context of hybrid solvers. This approach not only accelerates the optimization process but also reduces the computational burden on quantum processors, making it a valuable tool for advancing quantum computing technologies.
[AI-58] Adaptive Genetic Algorithms for Pulse-Level Quantum Error Mitigation
链接: https://arxiv.org/abs/2501.14007
作者: William Aguilar-Calvo,Santiago Núñez-Corrales
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: 21 pages, 11 figures
点击查看摘要
Abstract:Noise remains a fundamental challenge in quantum computing, significantly affecting pulse fidelity and overall circuit performance. This paper introduces an adaptive algorithm for pulse-level quantum error mitigation, designed to enhance fidelity by dynamically responding to noise conditions without modifying circuit gates. By targeting pulse parameters directly, this method reduces the impact of various noise sources, improving algorithm resilience in quantum circuits. We show the latter by applying our protocol to Grover’s and Deutsch-Jozsa algorithms. Experimental results show that this pulse-level strategy provides a flexible and efficient solution for increasing fidelity during the noisy execution of quantum circuits. Our work contributes to advancements in error mitigation techniques, essential for robust quantum computing.
[AI-59] PaMMA-Net: Plasmas magnetic measurement evolution based on data-driven incremental accumulative prediction
链接: https://arxiv.org/abs/2501.14003
作者: Yunfei Ling,Zijie Liu,Jun Du,Yao Huang,Yuehang Wang,Bingjia Xiao,Xin Fang
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI)
*备注: 20 pages, 8 figures
点击查看摘要
Abstract:An accurate evolution model is crucial for effective control and in-depth study of fusion plasmas. Evolution methods based on physical models often encounter challenges such as insufficient robustness or excessive computational costs. Given the proven strong fitting capabilities of deep learning methods across various fields, including plasma research, this paper introduces a deep learning-based magnetic measurement evolution method named PaMMA-Net (Plasma Magnetic Measurements Incremental Accumulative Prediction Network). This network is capable of evolving magnetic measurements in tokamak discharge experiments over extended periods or, in conjunction with equilibrium reconstruction algorithms, evolving macroscopic parameters such as plasma shape. Leveraging a incremental prediction approach and data augmentation techniques tailored for magnetic measurements, PaMMA-Net achieves superior evolution results compared to existing studies. The tests conducted on real experimental data from EAST validate the high generalization capability of the proposed method.
机器学习
[LG-0] MLPs at the EOC: Concentration of the NTK
链接: https://arxiv.org/abs/2501.14724
作者: Dávid Terjék,Diego González-Sánchez
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 36 pages, 1 figure
点击查看摘要
Abstract:We study the concentration of the Neural Tangent Kernel (NTK) K_\theta : \mathbbR^m_0 \times \mathbbR^m_0 \to \mathbbR^m_l \times m_l of l -layer Multilayer Perceptrons (MLPs) N : \mathbbR^m_0 \times \Theta \to \mathbbR^m_l equipped with activation functions \phi(s) = a s + b \vert s \vert for some a,b \in \mathbbR with the parameter \theta \in \Theta being initialized at the Edge Of Chaos (EOC). Without relying on the gradient independence assumption that has only been shown to hold asymptotically in the infinitely wide limit, we prove that an approximate version of gradient independence holds at finite width. Showing that the NTK entries K_\theta(x_i_1,x_i_2) for i_1,i_2 \in [1:n] over a dataset \x_1,\cdots,x_n\ \subset \mathbbR^m_0 concentrate simultaneously via maximal inequalities, we prove that the NTK matrix K(\theta) = [\frac1n K_\theta(x_i_1,x_i_2) : i_1,i_2 \in [1:n]] \in \mathbbR^nm_l \times nm_l concentrates around its infinitely wide limit \overset\scriptscriptstyle\inftyK \in \mathbbR^nm_l \times nm_l without the need for linear overparameterization. Our results imply that in order to accurately approximate the limit, hidden layer widths have to grow quadratically as m_k = k^2 m for some m \in \mathbbN+1 for sufficient concentration. For such MLPs, we obtain the concentration bound \mathbbP( \Vert K(\theta) - \overset\scriptscriptstyle\inftyK \Vert \leq O((\Delta_\phi^-2 + m_l^\frac12 l) \kappa_\phi^2 m^-\frac12)) \geq 1-O(m^-1) modulo logarithmic terms, where we denoted \Delta_\phi = \fracb^2a^2+b^2 and \kappa_\phi = \frac\vert a \vert + \vert b \vert\sqrta^2 + b^2 . This reveals in particular that the absolute value ( \Delta_\phi=1 , \kappa_\phi=1 ) beats the ReLU ( \Delta_\phi=\frac12 , \kappa_\phi=\sqrt2 ) in terms of the concentration of the NTK.
[LG-1] CodeMonkeys: Scaling Test-Time Compute for Software Engineering
链接: https://arxiv.org/abs/2501.14723
作者: Ryan Ehrlich,Bradley Brown,Jordan Juravsky,Ronald Clark,Christopher Ré,Azalia Mirhoseini
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale “serial” test-time compute by increasing the number of iterations per trajectory and “parallel” test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at this https URL.
[LG-2] Decision-Focused Learning for Complex System Identification: HVAC Management System Application
链接: https://arxiv.org/abs/2501.14708
作者: Pietro Favaro,Jean-François Toubeau,François Vallée,Yury Dvorkin
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 12 pages, 9 figures, submitted to ACM e-energy 2025
点击查看摘要
Abstract:As opposed to conventional training methods tailored to minimize a given statistical metric or task-agnostic loss (e.g., mean squared error), Decision-Focused Learning (DFL) trains machine learning models for optimal performance in downstream decision-making tools. We argue that DFL can be leveraged to learn the parameters of system dynamics, expressed as constraint of the convex optimization control policy, while the system control signal is being optimized, thus creating an end-to-end learning framework. This is particularly relevant for systems in which behavior changes once the control policy is applied, hence rendering historical data less applicable. The proposed approach can perform system identification - i.e., determine appropriate parameters for the system analytical model - and control simultaneously to ensure that the model’s accuracy is focused on areas most relevant to control. Furthermore, because black-box systems are non-differentiable, we design a loss function that requires solely to measure the system response. We propose pre-training on historical data and constraint relaxation to stabilize the DFL and deal with potential infeasibilities in learning. We demonstrate the usefulness of the method on a building Heating, Ventilation, and Air Conditioning day-ahead management system for a realistic 15-zone building located in Denver, US. The results show that the conventional RC building model, with the parameters obtained from historical data using supervised learning, underestimates HVAC electrical power consumption. For our case study, the ex-post cost is on average six times higher than the expected one. Meanwhile, the same RC model with parameters obtained via DFL underestimates the ex-post cost only by 3%.
[LG-3] Decoupled SGDA for Games with Intermittent Strategy Communication
链接: https://arxiv.org/abs/2501.14652
作者: Ali Zindari,Parham Yazdkhasti,Anton Rodomanov,Tatjana Chavdarova,Sebastian U. Stich
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We focus on reducing communication overhead in multiplayer games, where frequently exchanging strategies between players is not feasible and players have noisy or outdated strategies of the other players. We introduce Decoupled SGDA, a novel adaptation of Stochastic Gradient Descent Ascent (SGDA). In this approach, players independently update their strategies based on outdated opponent strategies, with periodic synchronization to align strategies. For Strongly-Convex-Strongly-Concave (SCSC) games, we demonstrate that Decoupled SGDA achieves near-optimal communication complexity comparable to the best-known GDA rates. For weakly coupled games where the interaction between players is lower relative to the non-interactive part of the game, Decoupled SGDA significantly reduces communication costs compared to standard SGDA. Our findings extend to multi-player games. To provide insights into the effect of communication frequency and convergence, we extensively study the convergence of Decoupled SGDA for quadratic minimax problems. Lastly, in settings where the noise over the players is imbalanced, Decoupled SGDA significantly outperforms federated minimax methods.
[LG-4] owards Scalable Topological Regularizers ICLR2025
链接: https://arxiv.org/abs/2501.14641
作者: Hiu-Tung Wong,Darrick Lee,Hong Yan
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 31 pages, accepted to ICLR 2025
点击查看摘要
Abstract:Latent space matching, which consists of matching distributions of features in latent space, is a crucial component for tasks such as adversarial attacks and defenses, domain adaptation, and generative modelling. Metrics for probability measures, such as Wasserstein and maximum mean discrepancy, are commonly used to quantify the differences between such distributions. However, these are often costly to compute, or do not appropriately take the geometric and topological features of the distributions into consideration. Persistent homology is a tool from topological data analysis which quantifies the multi-scale topological structure of point clouds, and has recently been used as a topological regularizer in learning tasks. However, computation costs preclude larger scale computations, and discontinuities in the gradient lead to unstable training behavior such as in adversarial tasks. We propose the use of principal persistence measures, based on computing the persistent homology of a large number of small subsamples, as a topological regularizer. We provide a parallelized GPU implementation of this regularizer, and prove that gradients are continuous for smooth densities. Furthermore, we demonstrate the efficacy of this regularizer on shape matching, image generation, and semi-supervised learning tasks, opening the door towards a scalable regularizer for topological features.
[LG-5] A Paired Autoencoder Framework for Inverse Problems via Bayes Risk Minimization
链接: https://arxiv.org/abs/2501.14636
作者: Emma Hart,Julianne Chung,Matthias Chung
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 22 pages, 9 figures
点击查看摘要
Abstract:In this work, we describe a new data-driven approach for inverse problems that exploits technologies from machine learning, in particular autoencoder network structures. We consider a paired autoencoder framework, where two autoencoders are used to efficiently represent the input and target spaces separately and optimal mappings are learned between latent spaces, thus enabling forward and inverse surrogate mappings. We focus on interpretations using Bayes risk and empirical Bayes risk minimization, and we provide various theoretical results and connections to existing works on low-rank matrix approximations. Similar to end-to-end approaches, our paired approach creates a surrogate model for forward propagation and regularized inversion. However, our approach outperforms existing approaches in scenarios where training data for unsupervised learning are readily available but training pairs for supervised learning are scarce. Furthermore, we show that cheaply computable evaluation metrics are available through this framework and can be used to predict whether the solution for a new sample should be predicted well.
[LG-6] Accelerated Preference Elicitation with LLM -Based Proxies
链接: https://arxiv.org/abs/2501.14625
作者: David Huang,Francisco Marmolejo-Cossío,Edwin Lock,David Parkes
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Bidders in combinatorial auctions face significant challenges when describing their preferences to an auctioneer. Classical work on preference elicitation focuses on query-based techniques inspired from proper learning–often via proxies that interface between bidders and an auction mechanism–to incrementally learn bidder preferences as needed to compute efficient allocations. Although such elicitation mechanisms enjoy theoretical query efficiency, the amount of communication required may still be too cognitively taxing in practice. We propose a family of efficient LLM-based proxy designs for eliciting preferences from bidders using natural language. Our proposed mechanism combines LLM pipelines and DNF-proper-learning techniques to quickly approximate preferences when communication is limited. To validate our approach, we create a testing sandbox for elicitation mechanisms that communicate in natural language. In our experiments, our most promising LLM proxy design reaches approximately efficient outcomes with five times fewer queries than classical proper learning based elicitation mechanisms. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2501.14625 [cs.GT] (or arXiv:2501.14625v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2501.14625 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-7] Inverse Evolution Data Augmentation for Neural PDE Solvers
链接: https://arxiv.org/abs/2501.14604
作者: Chaoyu Liu,Chris Budd,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Neural networks have emerged as promising tools for solving partial differential equations (PDEs), particularly through the application of neural operators. Training neural operators typically requires a large amount of training data to ensure accuracy and generalization. In this paper, we propose a novel data augmentation method specifically designed for training neural operators on evolution equations. Our approach utilizes insights from inverse processes of these equations to efficiently generate data from random initialization that are combined with original data. To further enhance the accuracy of the augmented data, we introduce high-order inverse evolution schemes. These schemes consist of only a few explicit computation steps, yet the resulting data pairs can be proven to satisfy the corresponding implicit numerical schemes. In contrast to traditional PDE solvers that require small time steps or implicit schemes to guarantee accuracy, our data augmentation method employs explicit schemes with relatively large time steps, thereby significantly reducing computational costs. Accuracy and efficacy experiments confirm the effectiveness of our approach. Additionally, we validate our approach through experiments with the Fourier Neural Operator and UNet on three common evolution equations that are Burgers’ equation, the Allen-Cahn equation and the Navier-Stokes equation. The results demonstrate a significant improvement in the performance and robustness of the Fourier Neural Operator when coupled with our inverse evolution data augmentation method.
[LG-8] Data Assetization via Resources-decoupled Federated Learning
链接: https://arxiv.org/abs/2501.14588
作者: Jianzhe Zhao,Feida Zhu,Lingyan He,Zixin Tang,Mingce Gao,Shiyu Yang,Guibing Guo
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:With the development of the digital economy, data is increasingly recognized as an essential resource for both work and life. However, due to privacy concerns, data owners tend to maximize the value of data through information flow rather than direct data transfer. Federated learning (FL) provides an effective approach to collaborative training models while preserving privacy. However, different data owners not only have variations in the quantity and quality of their data resources but also face mismatches between data and computing resources as model parameters and training data grow. These challenges hinder data owners’ willingness to participate and reduce the effectiveness of data assetization. In this work, we first identify the resource-decoupled FL environment, which includes model owners, data owners, and computing centers. We design a Tripartite Stackelberg Model and theoretically analyze the Stackelberg-Nash Equilibrium (SNE) for participants to optimize global utility. We propose the Quality-aware Dynamic Resources-decoupled FL algorithm (QD-RDFL), in which we derive and solve the optimal strategies of all parties to achieve SHE using backward induction, and a dynamic optimization mechanism is designed to improve the optimal strategy profile by evaluating the contribution of data quality from data owners to the global model during real training. Our comprehensive experiments demonstrate that our method effectively encourages the linkage of the three parties involved, maximizing global utility and data asset value.
[LG-9] Fairness of Deep Ensembles: On the interplay between per-group task difficulty and under-representation
链接: https://arxiv.org/abs/2501.14551
作者: Estanislao Claucich,Sara Hooker,Diego H. Milone,Enzo Ferrante,Rodrigo Echeveste
类目: Machine Learning (cs.LG)
*备注: 12 pages, 6 figures
点击查看摘要
Abstract:Ensembling is commonly regarded as an effective way to improve the general performance of models in machine learning, while also increasing the robustness of predictions. When it comes to algorithmic fairness, heterogeneous ensembles, composed of multiple model types, have been employed to mitigate biases in terms of demographic attributes such as sex, age or ethnicity. Moreover, recent work has shown how in multi-class problems even simple homogeneous ensembles may favor performance of the worst-performing target classes. While homogeneous ensembles are simpler to implement in practice, it is not yet clear whether their benefits translate to groups defined not in terms of their target class, but in terms of demographic or protected attributes, hence improving fairness. In this work we show how this simple and straightforward method is indeed able to mitigate disparities, particularly benefiting under-performing subgroups. Interestingly, this can be achieved without sacrificing overall performance, which is a common trade-off observed in bias mitigation strategies. Moreover, we analyzed the interplay between two factors which may result in biases: sub-group under-representation and the inherent difficulty of the task for each group. These results revealed that, contrary to popular assumptions, having balanced datasets may be suboptimal if the task difficulty varies between subgroups. Indeed, we found that a perfectly balanced dataset may hurt both the overall performance and the gap between groups. This highlights the importance of considering the interaction between multiple forces at play in fairness.
[LG-10] Reducing Action Space for Deep Reinforcement Learning via Causal Effect Estimation
链接: https://arxiv.org/abs/2501.14543
作者: Wenzhang Liu,Lianjun Jin,Lu Ren,Chaoxu Mu,Changyin Sun
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Intelligent decision-making within large and redundant action spaces remains challenging in deep reinforcement learning. Considering similar but ineffective actions at each step can lead to repetitive and unproductive trials. Existing methods attempt to improve agent exploration by reducing or penalizing redundant actions, yet they fail to provide quantitative and reliable evidence to determine redundancy. In this paper, we propose a method to improve exploration efficiency by estimating the causal effects of actions. Unlike prior methods, our approach offers quantitative results regarding the causality of actions for one-step transitions. We first pre-train an inverse dynamics model to serve as prior knowledge of the environment. Subsequently, we classify actions across the entire action space at each time step and estimate the causal effect of each action to suppress redundant actions during exploration. We provide a theoretical analysis to demonstrate the effectiveness of our method and present empirical results from simulations in environments with redundant actions to evaluate its performance. Our implementation is available at this https URL.
[LG-11] A Recurrent Spiking Network with Hierarchical Intrinsic Excitability Modulation for Schema Learning
链接: https://arxiv.org/abs/2501.14539
作者: Yingchao Yu,Yaochu Jin,Yuchen Xiao,Yuping Yan
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 31 pages, 9 figures
点击查看摘要
Abstract:Schema, a form of structured knowledge that promotes transfer learning, is attracting growing attention in both neuroscience and artificial intelligence (AI). Current schema research in neural computation is largely constrained to a single behavioral paradigm and relies heavily on recurrent neural networks (RNNs) which lack the neural plausibility and biological interpretability. To address these limitations, this work first constructs a generalized behavioral paradigm framework for schema learning and introduces three novel cognitive tasks, thus supporting a comprehensive schema exploration. Second, we propose a new model using recurrent spiking neural networks with hierarchical intrinsic excitability modulation (HM-RSNNs). The top level of the model selects excitability properties for task-specific demands, while the bottom level fine-tunes these properties for intra-task problems. Finally, extensive visualization analyses of HM-RSNNs are conducted to showcase their computational advantages, track the intrinsic excitability evolution during schema learning, and examine neural coordination differences across tasks. Biologically inspired lesion studies further uncover task-specific distributions of intrinsic excitability within schemas. Experimental results show that HM-RSNNs significantly outperform RSNN baselines across all tasks and exceed RNNs in three novel cognitive tasks. Additionally, HM-RSNNs offer deeper insights into neural dynamics underlying schema learning.
[LG-12] On Hardening DNNs against Noisy Computations
链接: https://arxiv.org/abs/2501.14531
作者: Xiao Wang,Hendrik Borras,Bernhard Klein,Holger Fröning
类目: Machine Learning (cs.LG)
*备注: Presented at AccML workshop co-located HiPEAC 2025
点击查看摘要
Abstract:The success of deep learning has sparked significant interest in designing computer hardware optimized for the high computational demands of neural network inference. As further miniaturization of digital CMOS processors becomes increasingly challenging, alternative computing paradigms, such as analog computing, are gaining consideration. Particularly for compute-intensive tasks such as matrix multiplication, analog computing presents a promising alternative due to its potential for significantly higher energy efficiency compared to conventional digital technology. However, analog computations are inherently noisy, which makes it challenging to maintain high accuracy on deep neural networks. This work investigates the effectiveness of training neural networks with quantization to increase the robustness against noise. Experimental results across various network architectures show that quantization-aware training with constant scaling factors enhances robustness. We compare these methods with noisy training, which incorporates a noise injection during training that mimics the noise encountered during inference. While both two methods increase tolerance against noise, noisy training emerges as the superior approach for achieving robust neural network performance, especially in complex neural architectures.
[LG-13] Automated Assignment Grading with Large Language Models : Insights From a Bioinformatics Course
链接: https://arxiv.org/abs/2501.14499
作者: Pavlin G. Poličar,Martin Špendl,Tomaž Curk,Blaž Zupan
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
点击查看摘要
Abstract:Providing students with individualized feedback through assignments is a cornerstone of education that supports their learning and development. Studies have shown that timely, high-quality feedback plays a critical role in improving learning outcomes. However, providing personalized feedback on a large scale in classes with large numbers of students is often impractical due to the significant time and effort required. Recent advances in natural language processing and large language models (LLMs) offer a promising solution by enabling the efficient delivery of personalized feedback. These technologies can reduce the workload of course staff while improving student satisfaction and learning outcomes. Their successful implementation, however, requires thorough evaluation and validation in real classrooms. We present the results of a practical evaluation of LLM-based graders for written assignments in the 2024/25 iteration of the Introduction to Bioinformatics course at the University of Ljubljana. Over the course of the semester, more than 100 students answered 36 text-based questions, most of which were automatically graded using LLMs. In a blind study, students received feedback from both LLMs and human teaching assistants without knowing the source, and later rated the quality of the feedback. We conducted a systematic evaluation of six commercial and open-source LLMs and compared their grading performance with human teaching assistants. Our results show that with well-designed prompts, LLMs can achieve grading accuracy and feedback quality comparable to human graders. Our results also suggest that open-source LLMs perform as well as commercial LLMs, allowing schools to implement their own grading systems while maintaining privacy.
[LG-14] MLMC: Interactive multi-label multi-classifier evaluation without confusion matrices
链接: https://arxiv.org/abs/2501.14460
作者: Aleksandar Doknic,Torsten Möller
类目: Machine Learning (cs.LG)
*备注: 12 pages
点击查看摘要
Abstract:Machine learning-based classifiers are commonly evaluated by metrics like accuracy, but deeper analysis is required to understand their strengths and weaknesses. MLMC is a visual exploration tool that tackles the challenge of multi-label classifier comparison and evaluation. It offers a scalable alternative to confusion matrices which are commonly used for such tasks, but don’t scale well with a large number of classes or labels. Additionally, MLMC allows users to view classifier performance from an instance perspective, a label perspective, and a classifier perspective. Our user study shows that the techniques implemented by MLMC allow for a powerful multi-label classifier evaluation while preserving user friendliness.
[LG-15] A Survey of Optimization Methods for Training DL Models: Theoretical Perspective on Convergence and Generalization
链接: https://arxiv.org/abs/2501.14458
作者: Jing Wang,Anna Choromanska
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:As data sets grow in size and complexity, it is becoming more difficult to pull useful features from them using hand-crafted feature extractors. For this reason, deep learning (DL) frameworks are now widely popular. The Holy Grail of DL and one of the most mysterious challenges in all of modern ML is to develop a fundamental understanding of DL optimization and generalization. While numerous optimization techniques have been introduced in the literature to navigate the exploration of the highly non-convex DL optimization landscape, many survey papers reviewing them primarily focus on summarizing these methodologies, often overlooking the critical theoretical analyses of these methods. In this paper, we provide an extensive summary of the theoretical foundations of optimization methods in DL, including presenting various methodologies, their convergence analyses, and generalization abilities. This paper not only includes theoretical analysis of popular generic gradient-based first-order and second-order methods, but it also covers the analysis of the optimization techniques adapting to the properties of the DL loss landscape and explicitly encouraging the discovery of well-generalizing optimal points. Additionally, we extend our discussion to distributed optimization methods that facilitate parallel computations, including both centralized and decentralized approaches. We provide both convex and non-convex analysis for the optimization algorithms considered in this survey paper. Finally, this paper aims to serve as a comprehensive theoretical handbook on optimization methods for DL, offering insights and understanding to both novice and seasoned researchers in the field.
[LG-16] Optimal Strategies for Federated Learning Maintaining Client Privacy
链接: https://arxiv.org/abs/2501.14453
作者: Uday Bhaskar,Varul Srivastava,Avyukta Manjunatha Vummintala,Naresh Manwani,Sujit Gujar
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Federated Learning (FL) emerged as a learning method to enable the server to train models over data distributed among various clients. These clients are protective about their data being leaked to the server, any other client, or an external adversary, and hence, locally train the model and share it with the server rather than sharing the data. The introduction of sophisticated inferencing attacks enabled the leakage of information about data through access to model parameters. To tackle this challenge, privacy-preserving federated learning aims to achieve differential privacy through learning algorithms like DP-SGD. However, such methods involve adding noise to the model, data, or gradients, reducing the model’s performance. This work provides a theoretical analysis of the tradeoff between model performance and communication complexity of the FL system. We formally prove that training for one local epoch per global round of training gives optimal performance while preserving the same privacy budget. We also investigate the change of utility (tied to privacy) of FL models with a change in the number of clients and observe that when clients are training using DP-SGD and argue that for the same privacy budget, the utility improved with increased clients. We validate our findings through experiments on real-world datasets. The results from this paper aim to improve the performance of privacy-preserving federated learning systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.14453 [cs.LG] (or arXiv:2501.14453v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.14453 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-17] Impact of Batch Normalization on Convolutional Network Representations
链接: https://arxiv.org/abs/2501.14441
作者: Hermanus L. Potgieter,Coenraad Mouton,Marelie H. Davel
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Batch normalization (BatchNorm) is a popular layer normalization technique used when training deep neural networks. It has been shown to enhance the training speed and accuracy of deep learning models. However, the mechanics by which BatchNorm achieves these benefits is an active area of research, and different perspectives have been proposed. In this paper, we investigate the effect of BatchNorm on the resulting hidden representations, that is, the vectors of activation values formed as samples are processed at each hidden layer. Specifically, we consider the sparsity of these representations, as well as their implicit clustering – the creation of groups of representations that are similar to some extent. We contrast image classification models trained with and without batch normalization and highlight consistent differences observed. These findings highlight that BatchNorm’s effect on representational sparsity is not a significant factor affecting generalization, while the representations of models trained with BatchNorm tend to show more advantageous clustering characteristics.
[LG-18] Convergence of gradient based training for linear Graph Neural Networks
链接: https://arxiv.org/abs/2501.14440
作者: Dhiraj Patel,Anton Savostianov,Michael T. Schaub
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Social and Information Networks (cs.SI); Numerical Analysis (math.NA)
*备注: 27 pages, 8 figures
点击查看摘要
Abstract:Graph Neural Networks (GNNs) are powerful tools for addressing learning problems on graph structures, with a wide range of applications in molecular biology and social networks. However, the theoretical foundations underlying their empirical performance are not well understood. In this article, we examine the convergence of gradient dynamics in the training of linear GNNs. Specifically, we prove that the gradient flow training of a linear GNN with mean squared loss converges to the global minimum at an exponential rate. The convergence rate depends explicitly on the initial weights and the graph shift operator, which we validate on synthetic datasets from well-known graph models and real-world datasets. Furthermore, we discuss the gradient flow that minimizes the total weights at the global minimum. In addition to the gradient flow, we study the convergence of linear GNNs under gradient descent training, an iterative scheme viewed as a discretization of gradient flow.
[LG-19] Data-efficient Performance Modeling via Pre-training
链接: https://arxiv.org/abs/2501.14438
作者: Chunting Liu,Riyadh Baghdadi
类目: Programming Languages (cs.PL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Performance models are essential for automatic code optimization, enabling compilers to predict the effects of code transformations on performance and guide search for optimal transformations. Building state-of-the-art performance models with deep learning, however, requires vast labeled datasets of random programs – an expensive and time-consuming process, stretching over months. This paper introduces a self-supervised pre-training scheme with autoencoders to reduce the need for labeled data. By pre-training on a large dataset of random programs, the autoencoder learns representations of code and transformations, which are then used to embed programs for the performance model. Implemented in the Tiramisu autoscheduler, our approach improves model accuracy with less data. For example, to achieve a MAPE of 20.72%, the original model requires 18 million data points, whereas our method achieves a similar MAPE of 22.44% with only 3.6 million data points, reducing data requirements by 5x.
[LG-20] Remining Hard Negatives for Generative Pseudo Labeled Domain Adaptation
链接: https://arxiv.org/abs/2501.14434
作者: Goksenin Yuksel,David Rau,Jaap Kamps
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Dense retrievers have demonstrated significant potential for neural information retrieval; however, they exhibit a lack of robustness to domain shifts, thereby limiting their efficacy in zero-shot settings across diverse domains. A state-of-the-art domain adaptation technique is Generative Pseudo Labeling (GPL). GPL uses synthetic query generation and initially mined hard negatives to distill knowledge from cross-encoder to dense retrievers in the target domain. In this paper, we analyze the documents retrieved by the domain-adapted model and discover that these are more relevant to the target queries than those of the non-domain-adapted model. We then propose refreshing the hard-negative index during the knowledge distillation phase to mine better hard negatives. Our remining R-GPL approach boosts ranking performance in 13/14 BEIR datasets and 9/12 LoTTe datasets. Our contributions are (i) analyzing hard negatives returned by domain-adapted and non-domain-adapted models and (ii) applying the GPL training with and without hard-negative re-mining in LoTTE and BEIR datasets.
[LG-21] GraphBC: Improving LLM s for Better Graph Data Processing
链接: https://arxiv.org/abs/2501.14427
作者: Xu Chu,Hanlin Xue,Zhijie Tan,Bingce Wang,Tong Mo,Weiping Li
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The success of Large Language Models (LLMs) in various domains has led researchers to apply them to graph-related problems by converting graph data into natural language text. However, unlike graph data, natural language inherently has sequential order. We observe that when the order of nodes or edges in the natural language description of a graph is shuffled, despite describing the same graph, model performance fluctuates between high performance and random guessing. Additionally, due to the limited input context length of LLMs, current methods typically randomly sample neighbors of target nodes as representatives of their neighborhood, which may not always be effective for accurate reasoning. To address these gaps, we introduce GraphBC. This novel model framework features an Order Selector Module to ensure proper serialization order of the graph and a Subgraph Sampling Module to sample subgraphs with better structure for better reasoning. Furthermore, we propose Graph CoT obtained through distillation, and enhance LLM’s reasoning and zero-shot learning capabilities for graph tasks through instruction tuning. Experiments on multiple datasets for node classification and graph question-answering demonstrate that GraphBC improves LLMs’ performance and generalization ability on graph tasks.
[LG-22] CENTS: Generating synthetic electricity consumption time series for rare and unseen scenarios
链接: https://arxiv.org/abs/2501.14426
作者: Michael Fuest,Alfredo Cuesta,Kalyan Veeramachaneni
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Recent breakthroughs in large-scale generative modeling have demonstrated the potential of foundation models in domains such as natural language, computer vision, and protein structure prediction. However, their application in the energy and smart grid sector remains limited due to the scarcity and heterogeneity of high-quality data. In this work, we propose a method for creating high-fidelity electricity consumption time series data for rare and unseen context variables (e.g. location, building type, photovoltaics). Our approach, Context Encoding and Normalizing Time Series Generation, or CENTS, includes three key innovations: (i) A context normalization approach that enables inverse transformation for time series context variables unseen during training, (ii) a novel context encoder to condition any state-of-the-art time-series generator on arbitrary numbers and combinations of context variables, (iii) a framework for training this context encoder jointly with a time-series generator using an auxiliary context classification loss designed to increase expressivity of context embeddings and improve model performance. We further provide a comprehensive overview of different evaluation metrics for generative time series models. Our results highlight the efficacy of the proposed method in generating realistic household-level electricity consumption data, paving the way for training larger foundation models in the energy domain on synthetic as well as real-world data.
[LG-23] SoK: What Makes Private Learning Unfair?
链接: https://arxiv.org/abs/2501.14414
作者: Kai Yao,Marc Juarez
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Systemization of Knowledge (SoK) paper. This work has been accepted for publication in the 3rd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML’25). The final version will be available on IEEE Xplore
点击查看摘要
Abstract:Differential privacy has emerged as the most studied framework for privacy-preserving machine learning. However, recent studies show that enforcing differential privacy guarantees can not only significantly degrade the utility of the model, but also amplify existing disparities in its predictive performance across demographic groups. Although there is extensive research on the identification of factors that contribute to this phenomenon, we still lack a complete understanding of the mechanisms through which differential privacy exacerbates disparities. The literature on this problem is muddled by varying definitions of fairness, differential privacy mechanisms, and inconsistent experimental settings, often leading to seemingly contradictory results. This survey provides the first comprehensive overview of the factors that contribute to the disparate effect of training models with differential privacy guarantees. We discuss their impact and analyze their causal role in such a disparate effect. Our analysis is guided by a taxonomy that categorizes these factors by their position within the machine learning pipeline, allowing us to draw conclusions about their interaction and the feasibility of potential mitigation strategies. We find that factors related to the training dataset and the underlying distribution play a decisive role in the occurrence of disparate impact, highlighting the need for research on these factors to address the issue. Comments: Systemization of Knowledge (SoK) paper. This work has been accepted for publication in the 3rd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML’25). The final version will be available on IEEE Xplore Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2501.14414 [cs.LG] (or arXiv:2501.14414v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.14414 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-24] Reinforcement Learning for Efficient Returns Management
链接: https://arxiv.org/abs/2501.14394
作者: Pascal Linden,Nathalie Paul,Tim Wirtz,Stefan Wrobel
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In retail warehouses, returned products are typically placed in an intermediate storage until a decision regarding further shipment to stores is made. The longer products are held in storage, the higher the inefficiency and costs of the returns management process, since enough storage area has to be provided and maintained while the products are not placed for sale. To reduce the average product storage time, we consider an alternative solution where reallocation decisions for products can be made instantly upon their arrival in the warehouse allowing only a limited number of products to still be stored simultaneously. We transfer the problem to an online multiple knapsack problem and propose a novel reinforcement learning approach to pack the items (products) into the knapsacks (stores) such that the overall value (expected revenue) is maximized. Empirical evaluations on simulated data demonstrate that, compared to the usual offline decision procedure, our approach comes with a performance gap of only 3% while significantly reducing the average storage time of a product by 96%.
[LG-25] Distinguishing Parkinsons Patients Using Voice-Based Feature Extraction and Classification
链接: https://arxiv.org/abs/2501.14390
作者: Burak Çelik,Ayhan Akbal
类目: Machine Learning (cs.LG)
*备注: Presented at the 13th International Marmara Science Congress (IMASCON 2024)
点击查看摘要
Abstract:Parkinson’s disease (PD) is a progressive neurodegenerative disorder that impacts motor functions and speech characteristics This study focuses on differentiating individuals with Parkinson’s disease from healthy controls through the extraction and classification of speech features. Patients were further divided into 2 groups. Med On represents the patient with medication, while Med Off represents the patient without medication. The dataset consisted of patients and healthy individuals who read a predefined text using the H1N Zoom microphone in a suitable recording environment at Fırat University Neurology Department. Speech recordings from PD patients and healthy controls were analyzed, and 19 key features were extracted, including jitter, luminance, zero-crossing rate (ZCR), root mean square (RMS) energy, entropy, skewness, and this http URL features were visualized in graphs and statistically evaluated to identify distinctive patterns in PD patients. Using MATLAB’s Classification Learner toolbox, several machine learning classification algorithm models were applied to classify groups and significant accuracy rates were achieved. The accuracy of our 3-layer artificial neural network architecture was also compared with classical machine learning algorithms. This study highlights the potential of noninvasive voice analysis combined with machine learning for early detection and monitoring of PD patients. Future research can improve diagnostic accuracy by optimizing feature selection and exploring advanced classification techniques.
[LG-26] Fat-to-Thin Policy Optimization: Offline RL with Sparse Policies ICLR2025
链接: https://arxiv.org/abs/2501.14373
作者: Lingwei Zhu,Han Wang,Yukie Nagai
类目: Machine Learning (cs.LG)
*备注: accepted by ICLR 2025; code available at this https URL
点击查看摘要
Abstract:Sparse continuous policies are distributions that can choose some actions at random yet keep strictly zero probability for the other actions, which are radically different from the Gaussian. They have important real-world implications, e.g. in modeling safety-critical tasks like medicine. The combination of offline reinforcement learning and sparse policies provides a novel paradigm that enables learning completely from logged datasets a safety-aware sparse policy. However, sparse policies can cause difficulty with the existing offline algorithms which require evaluating actions that fall outside of the current support. In this paper, we propose the first offline policy optimization algorithm that tackles this challenge: Fat-to-Thin Policy Optimization (FtTPO). Specifically, we maintain a fat (heavy-tailed) proposal policy that effectively learns from the dataset and injects knowledge to a thin (sparse) policy, which is responsible for interacting with the environment. We instantiate FtTPO with the general q -Gaussian family that encompasses both heavy-tailed and sparse policies and verify that it performs favorably in a safety-critical treatment simulation and the standard MuJoCo suite. Our code is available at \urlthis https URL.
[LG-27] Facies Classification with Copula Entropy
链接: https://arxiv.org/abs/2501.14351
作者: Jian Ma
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph); Applications (stat.AP)
*备注: 12 pages, 5 figures, 3 tables. arXiv admin note: text overlap with arXiv:2310.16633
点击查看摘要
Abstract:In this paper we propose to apply copula entropy (CE) to facies classification. In our method, the correlations between geological variables and facies classes are measured with CE and then the variables associated with large negative CEs are selected for classification. We verified the proposed method on a typical facies dataset for facies classification and the experimental results show that the proposed method can select less geological variables for facies classification without sacrificing classification performance. The geological variables such selected are also interpretable to geologists with geological meanings due to the rigorous definition of CE.
[LG-28] Online Inverse Linear Optimization: Improved Regret Bound Robustness to Suboptimality and Toward Tight Regret Analysis
链接: https://arxiv.org/abs/2501.14349
作者: Shinsaku Sakaue,Taira Tsuchiya,Han Bao,Taihei Oki
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We study an online learning problem where, over T rounds, a learner observes both time-varying sets of feasible actions and an agent’s optimal actions, selected by solving linear optimization over the feasible actions. The learner sequentially makes predictions of the agent’s underlying linear objective function, and their quality is measured by the regret, the cumulative gap between optimal objective values and those achieved by following the learner’s predictions. A seminal work by Bärmann et al. (ICML 2017) showed that online learning methods can be applied to this problem to achieve regret bounds of O(\sqrtT) . Recently, Besbes et al. (COLT 2021, Oper. Res. 2023) significantly improved the result by achieving an O(n^4\ln T) regret bound, where n is the dimension of the ambient space of objective vectors. Their method, based on the ellipsoid method, runs in polynomial time but is inefficient for large n and T . In this paper, we obtain an O(n\ln T) regret bound, improving upon the previous bound of O(n^4\ln T) by a factor of n^3 . Our method is simple and efficient: we apply the online Newton step (ONS) to appropriate exp-concave loss functions. Moreover, for the case where the agent’s actions are possibly suboptimal, we establish an O(n\ln T+\sqrt\Delta_Tn\ln T) regret bound, where \Delta_T is the cumulative suboptimality of the agent’s actions. This bound is achieved by using MetaGrad, which runs ONS with \Theta(\ln T) different learning rates in parallel. We also provide a simple instance that implies an \Omega(n) lower bound, showing that our O(n\ln T) bound is tight up to an O(\ln T) factor. This gives rise to a natural question: can the O(\ln T) factor in the upper bound be removed? For the special case of n=2 , we show that an O(1) regret bound is possible, while we delineate challenges in extending this result to higher dimensions.
[LG-29] Domain Expansion: Parameter-Efficient Modules as Building Blocks for Composite Domains
链接: https://arxiv.org/abs/2501.14321
作者: Mann Patel,Divyajyoti Panda,Hilay Mehta,Parth Patel,Dhruv Parikh
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 2 tables
点击查看摘要
Abstract:Parameter-Efficient Fine-Tuning (PEFT) is an efficient alternative to full scale fine-tuning, gaining popularity recently. With pre-trained model sizes growing exponentially, PEFT can be effectively utilized to fine-tune compact modules, Parameter-Efficient Modules (PEMs), trained to be domain experts over diverse domains. In this project, we explore composing such individually fine-tuned PEMs for distribution generalization over the composite domain. To compose PEMs, simple composing functions are used that operate purely on the weight space of the individually fine-tuned PEMs, without requiring any additional fine-tuning. The proposed method is applied to the task of representing the 16 Myers-Briggs Type Indicator (MBTI) composite personalities via 4 building block dichotomies, comprising of 8 individual traits which can be merged (composed) to yield a unique personality. We evaluate the individual trait PEMs and the composed personality PEMs via an online MBTI personality quiz questionnaire, validating the efficacy of PEFT to fine-tune PEMs and merging PEMs without further fine-tuning for domain composition.
[LG-30] Graph Feedback Bandits on Similar Arms: With and Without Graph Structures
链接: https://arxiv.org/abs/2501.14314
作者: Han Qi,Fei Guo,Li Zhu,Qiaosheng Zhang,Xuelong Li
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.11171
点击查看摘要
Abstract:In this paper, we study the stochastic multi-armed bandit problem with graph feedback. Motivated by applications in clinical trials and recommendation systems, we assume that two arms are connected if and only if they are similar (i.e., their means are close to each other). We establish a regret lower bound for this problem under the novel feedback structure and introduce two upper confidence bound (UCB)-based algorithms: Double-UCB, which has problem-independent regret upper bounds, and Conservative-UCB, which has problem-dependent upper bounds. Leveraging the similarity structure, we also explore a scenario where the number of arms increases over time (referred to as the \emphballooning setting). Practical applications of this scenario include Q\A platforms (e.g., Reddit, Stack Overflow, Quora) and product reviews on platforms like Amazon and Flipkart, where answers (or reviews) continuously appear, and the goal is to display the best ones at the top. We extend these two UCB-based algorithms to the ballooning setting. Under mild assumptions, we provide regret upper bounds for both algorithms and discuss their sub-linearity. Furthermore, we propose a new version of the corresponding algorithms that do not rely on prior knowledge of the graph’s structural information and provide regret upper bounds. Finally, we conduct experiments to validate the theoretical results.
[LG-31] Locality-aware Fair Scheduling in LLM Serving
链接: https://arxiv.org/abs/2501.14312
作者: Shiyi Cao,Yichuan Wang,Ziming Mao,Pin-Lun Hsu,Liangsheng Yin,Tian Xia,Dacheng Li,Shu Liu,Yineng Zhang,Yang Zhou,Ying Sheng,Joseph Gonzalez,Ion Stoica
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Large language model (LLM) inference workload dominates a wide variety of modern AI applications, ranging from multi-turn conversation to document analysis. Balancing fairness and efficiency is critical for managing diverse client workloads with varying prefix patterns. Unfortunately, existing fair scheduling algorithms for LLM serving, such as Virtual Token Counter (VTC), fail to take prefix locality into consideration and thus suffer from poor performance. On the other hand, locality-aware scheduling algorithms in existing LLM serving frameworks tend to maximize the prefix cache hit rate without considering fair sharing among clients. This paper introduces the first locality-aware fair scheduling algorithm, Deficit Longest Prefix Match (DLPM), which can maintain a high degree of prefix locality with a fairness guarantee. We also introduce a novel algorithm, Double Deficit LPM (D ^2 LPM), extending DLPM for the distributed setup that can find a balance point among fairness, locality, and load-balancing. Our extensive evaluation demonstrates the superior performance of DLPM and D ^2 LPM in ensuring fairness while maintaining high throughput (up to 2.87 \times higher than VTC) and low per-client (up to 7.18 \times lower than state-of-the-art distributed LLM serving system) latency. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2501.14312 [cs.DC] (or arXiv:2501.14312v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2501.14312 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-32] An Efficient Real Time DDoS Detection Model Using Machine Learning Algorithms
链接: https://arxiv.org/abs/2501.14311
作者: Debashis Kar Suvra
类目: Machine Learning (cs.LG)
*备注: 7 pages, 14 figures
点击查看摘要
Abstract:Distributed Denial of Service attacks have become a significant threat to industries and governments leading to substantial financial losses. With the growing reliance on internet services, DDoS attacks can disrupt services by overwhelming servers with false traffic causing downtime and data breaches. Although various detection techniques exist, selecting an effective method remains challenging due to trade-offs between time efficiency and accuracy. This research focuses on developing an efficient real-time DDoS detection system using machine learning algorithms leveraging the UNB CICDDoS2019 dataset including various traffic features. The study aims to classify DDoS and non-DDoS traffic through various ML classifiers including Logistic Regression, K-Nearest Neighbors, Random Forest, Support Vector Machine, Naive Bayes. The dataset is preprocessed through data cleaning, standardization and feature selection techniques using Principal Component Analysis. The research explores the performance of these algorithms in terms of precision, recall and F1-score as well as time complexity to create a reliable system capable of real-time detection and mitigation of DDoS attacks. The findings indicate that RF, AdaBoost and XGBoost outperform other algorithms in accuracy and efficiency, making them ideal candidates for real-time applications.
[LG-33] Advances in Temporal Point Processes: Bayesian Deep and LLM Approaches
链接: https://arxiv.org/abs/2501.14291
作者: Feng Zhou,Quyu Kong,Yixuan Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Temporal point processes (TPPs) are stochastic process models used to characterize event sequences occurring in continuous time. Traditional statistical TPPs have a long-standing history, with numerous models proposed and successfully applied across diverse domains. In recent years, advances in deep learning have spurred the development of neural TPPs, enabling greater flexibility and expressiveness in capturing complex temporal dynamics. The emergence of large language models (LLMs) has further sparked excitement, offering new possibilities for modeling and analyzing event sequences by leveraging their rich contextual understanding. This survey presents a comprehensive review of recent research on TPPs from three perspectives: Bayesian, deep learning, and LLM approaches. We begin with a review of the fundamental concepts of TPPs, followed by an in-depth discussion of model design and parameter estimation techniques in these three frameworks. We also revisit classic application areas of TPPs to highlight their practical relevance. Finally, we outline challenges and promising directions for future research.
[LG-34] LXML: Task-Level Explanation of Meta-Learning via Influence Functions
链接: https://arxiv.org/abs/2501.14271
作者: Yoshihiro Mitsuka,Shadan Golestan,Zahin Sufiyan,Sheila Schoepp,Shotaro Miwa,Osmar R. Zaïane
类目: Machine Learning (cs.LG)
*备注: 22 pages
点击查看摘要
Abstract:The scheme of adaptation via meta-learning is seen as an ingredient for solving the problem of data shortage or distribution shift in real-world applications, but it also brings the new risk of inappropriate updates of the model in the user environment, which increases the demand for explainability. Among the various types of XAI methods, establishing a method of explanation based on past experience in meta-learning requires special consideration due to its bi-level structure of training, which has been left unexplored. In this work, we propose influence functions for explaining meta-learning that measure the sensitivities of training tasks to adaptation and inference. We also argue that the approximation of the Hessian using the Gauss-Newton matrix resolves computational barriers peculiar to meta-learning. We demonstrate the adequacy of the method through experiments on task distinction and task distribution distinction using image classification tasks with MAML and Prototypical Network.
[LG-35] rajFlow: A Generative Framework for Occupancy Density Estimation Using Normalizing Flows
链接: https://arxiv.org/abs/2501.14266
作者: Mitch Kosieradzki,Seongjin Choi
类目: Machine Learning (cs.LG)
*备注: 10 pages 6 figures 3 tables
点击查看摘要
Abstract:In transportation systems and autonomous vehicles, intelligent agents must understand the future motion of traffic participants to effectively plan motion trajectories. At the same time, the motion of traffic participants is inherently uncertain. In this paper, we propose TrajFlow, a generative framework for estimating the occupancy density of traffic participants. Our framework utilizes a causal encoder to extract semantically meaningful embeddings of the observed trajectory, as well as a normalizing flow to decode these embeddings and determine the most likely future location of traffic participants at some time point in the future. Our formulation differs from existing approaches because we model the marginal distribution of spatial locations instead of the joint distribution of unobserved trajectories. The advantages of a marginal formulation are numerous. First, we demonstrate that the marginal formulation produces higher accuracy on challenging trajectory forecasting benchmarks. Second, the marginal formulation allows for a fully continuous sampling of future locations. Finally, marginal densities are better suited for downstream tasks as they allow for the computation of per-agent motion trajectories and occupancy grids, the two most commonly used representations for motion forecasting. We present a novel architecture based entirely on neural differential equations as an implementation of this framework and provide ablations to demonstrate the advantages of a continuous implementation over a more traditional discrete neural network based approach. The code is available at this https URL .
[LG-36] Revisiting Applicable and Comprehensive Knowledge Tracing in Large-Scale Data
链接: https://arxiv.org/abs/2501.14256
作者: Yiyun Zhou,Wenkang Han,Jingyuan Chen
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Knowledge Tracing (KT) is a fundamental component of Intelligent Tutoring Systems (ITS), enabling the modeling of students’ knowledge states to predict future performance. The introduction of Deep Knowledge Tracing (DKT), the first deep learning-based KT (DLKT) model, has brought significant advantages in terms of applicability and comprehensiveness. However, recent DLKT models, such as Attentive Knowledge Tracing (AKT), have often prioritized predictive performance at the expense of these benefits. While deep sequential models like DKT have shown potential, they face challenges related to parallel computing, storage decision modification, and limited storage capacity. To address these limitations, we propose DKT2, a novel KT model that leverages the recently developed xLSTM architecture. DKT2 enhances input representation using the Rasch model and incorporates Item Response Theory (IRT) for interpretability, allowing for the decomposition of learned knowledge into familiar and unfamiliar knowledge. By integrating this knowledge with predicted questions, DKT2 generates comprehensive knowledge states. Extensive experiments conducted across three large-scale datasets demonstrate that DKT2 consistently outperforms 17 baseline models in various prediction tasks, underscoring its potential for real-world educational applications. This work bridges the gap between theoretical advancements and practical implementation in this http URL code and datasets will be available at this https URL.
[LG-37] A Data-driven Dynamic Temporal Correlation Modeling Framework for Renewable Energy Scenario Generation
链接: https://arxiv.org/abs/2501.14233
作者: Xiaochong Dong,Yilin Liu,Xuemin Zhang,Shengwei Mei
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Renewable energy power is influenced by the atmospheric system, which exhibits nonlinear and time-varying features. To address this, a dynamic temporal correlation modeling framework is proposed for renewable energy scenario generation. A novel decoupled mapping path is employed for joint probability distribution modeling, formulating regression tasks for both marginal distributions and the correlation structure using proper scoring rules to ensure the rationality of the modeling process. The scenario generation process is divided into two stages. Firstly, the dynamic correlation network models temporal correlations based on a dynamic covariance matrix, capturing the time-varying features of renewable energy while enhancing the interpretability of the black-box model. Secondly, the implicit quantile network models the marginal quantile function in a nonparametric, continuous manner, enabling scenario generation through marginal inverse sampling. Experimental results demonstrate that the proposed dynamic correlation quantile network outperforms state-of-the-art methods in quantifying uncertainty and capturing dynamic correlation for short-term renewable energy scenario generation.
[LG-38] When GNNs meet symmetry in ILPs: an orbit-based feature augmentation approach
链接: https://arxiv.org/abs/2501.14211
作者: Qian Chen,Lei Li,Qian Li,Jianghua Wu,Akang Wang,Ruoyu Sun,Xiaodong Luo,Tsung-Hui Chang,Qingjiang Shi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:A common characteristic in integer linear programs (ILPs) is symmetry, allowing variables to be permuted without altering the underlying problem structure. Recently, GNNs have emerged as a promising approach for solving ILPs. However, a significant challenge arises when applying GNNs to ILPs with symmetry: classic GNN architectures struggle to differentiate between symmetric variables, which limits their predictive accuracy. In this work, we investigate the properties of permutation equivariance and invariance in GNNs, particularly in relation to the inherent symmetry of ILP formulations. We reveal that the interaction between these two factors contributes to the difficulty of distinguishing between symmetric variables. To address this challenge, we explore the potential of feature augmentation and propose several guiding principles for constructing augmented features. Building on these principles, we develop an orbit-based augmentation scheme that first groups symmetric variables and then samples augmented features for each group from a discrete uniform distribution. Empirical results demonstrate that our proposed approach significantly enhances both training efficiency and predictive performance.
[LG-39] Bi-directional Curriculum Learning for Graph Anomaly Detection: Dual Focus on Homogeneity and Heterogeneity
链接: https://arxiv.org/abs/2501.14197
作者: Yitong Hao,Enbo He,Yue Zhang,Guisheng Yin
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注: 8pages, 5 figures
点击查看摘要
Abstract:Graph anomaly detection (GAD) aims to identify nodes from a graph that are significantly different from normal patterns. Most previous studies are model-driven, focusing on enhancing the detection effect by improving the model structure. However, these approaches often treat all nodes equally, neglecting the different contributions of various nodes to the training. Therefore, we introduce graph curriculum learning as a simple and effective plug-and-play module to optimize GAD methods. The existing graph curriculum learning mainly focuses on the homogeneity of graphs and treats nodes with high homogeneity as easy nodes. In fact, GAD models can handle not only graph homogeneity but also heterogeneity, which leads to the unsuitability of these existing methods. To address this problem, we propose an innovative Bi-directional Curriculum Learning strategy (BCL), which considers nodes with higher and lower similarity to neighbor nodes as simple nodes in the direction of focusing on homogeneity and focusing on heterogeneity, respectively, and prioritizes their training. Extensive experiments show that BCL can be quickly integrated into existing detection processes and significantly improves the performance of ten GAD anomaly detection models on seven commonly used datasets.
[LG-40] Cybersecurity Assessment of Smart Grid Exposure Using a Machine Learning Based Approach
链接: https://arxiv.org/abs/2501.14175
作者: Mofe O. Jeje
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
点击查看摘要
Abstract:Given that disturbances to the stable and normal operation of power systems have grown phenomenally, particularly in terms of unauthorized access to confidential and critical data, injection of malicious software, and exploitation of security vulnerabilities in a poorly patched software among others; then developing, as a countermeasure, an assessment solutions with machine learning capabilities to match up in real-time, with the growth and fast pace of these cyber-attacks, is not only critical to the security, reliability and safe operation of power system, but also germane to guaranteeing advanced monitoring and efficient threat detection. Using the Mississippi State University and Oak Ridge National Laboratory dataset, the study used an XGB Classifier modeling approach in machine learning to diagnose and assess power system disturbances, in terms of Attack Events, Natural Events and No-Events. As test results show, the model, in all the three sub-datasets, generally demonstrates good performance on all metrics, as it relates to accurately identifying and classifying all the three power system events.
[LG-41] Argos: Agent ic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models
链接: https://arxiv.org/abs/2501.14170
作者: Yile Gu,Yifan Xiong,Jonathan Mace,Yuting Jiang,Yigong Hu,Baris Kasikci,Peng Cheng
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:Observability in cloud infrastructure is critical for service providers, driving the widespread adoption of anomaly detection systems for monitoring metrics. However, existing systems often struggle to simultaneously achieve explainability, reproducibility, and autonomy, which are three indispensable properties for production use. We introduce Argos, an agentic system for detecting time-series anomalies in cloud infrastructure by leveraging large language models (LLMs). Argos proposes to use explainable and reproducible anomaly rules as intermediate representation and employs LLMs to autonomously generate such rules. The system will efficiently train error-free and accuracy-guaranteed anomaly rules through multiple collaborative agents and deploy the trained rules for low-cost online anomaly detection. Through evaluation results, we demonstrate that Argos outperforms state-of-the-art methods, increasing F_1 scores by up to 9.5% and 28.3% on public anomaly detection datasets and an internal dataset collected from Microsoft, respectively.
[LG-42] Multimodal Prescriptive Deep Learning
链接: https://arxiv.org/abs/2501.14152
作者: Dimitris Bertsimas,Lisa Everest,Vasiliki Stoumpou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We introduce a multimodal deep learning framework, Prescriptive Neural Networks (PNNs), that combines ideas from optimization and machine learning, and is, to the best of our knowledge, the first prescriptive method to handle multimodal data. The PNN is a feedforward neural network trained on embeddings to output an outcome-optimizing prescription. In two real-world multimodal datasets, we demonstrate that PNNs prescribe treatments that are able to significantly improve estimated outcomes in transcatheter aortic valve replacement (TAVR) procedures by reducing estimated postoperative complication rates by 32% and in liver trauma injuries by reducing estimated mortality rates by over 40%. In four real-world, unimodal tabular datasets, we demonstrate that PNNs outperform or perform comparably to other well-known, state-of-the-art prescriptive models; importantly, on tabular datasets, we also recover interpretability through knowledge distillation, fitting interpretable Optimal Classification Tree models onto the PNN prescriptions as classification targets, which is critical for many real-world applications. Finally, we demonstrate that our multimodal PNN models achieve stability across randomized data splits comparable to other prescriptive methods and produce realistic prescriptions across the different datasets.
[LG-43] An Extensive and Methodical Review of Smart Grids for Sustainable Energy Management-Addressing Challenges with AI Renewable Energy Integration and Leading-edge Technologies
链接: https://arxiv.org/abs/2501.14143
作者: Parag Biswas,Abdur Rashid,abdullah al masum,MD Abdullah Al Nasim,A.S.M Anas Ferdous,Kishor Datta Gupta,Angona Biswas
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
点击查看摘要
Abstract:Energy management decreases energy expenditures and consumption while simultaneously increasing energy efficiency, reducing carbon emissions, and enhancing operational performance. Smart grids are a type of sophisticated energy infrastructure that increase the generation and distribution of electricity’s sustainability, dependability, and efficiency by utilizing digital communication technologies. They combine a number of cutting-edge techniques and technology to improve energy resource management. A large amount of research study on the topic of smart grids for energy management has been completed in the last several years. The authors of the present study want to cover a number of topics, including smart grid benefits and components, technical developments, integrating renewable energy sources, using artificial intelligence and data analytics, cybersecurity, and privacy. Smart Grids for Energy Management are an innovative field of study aiming at tackling various difficulties and magnifying the efficiency, dependability, and sustainability of energy systems, including: 1) Renewable sources of power like solar and wind are intermittent and unpredictable 2) Defending smart grid system from various cyber-attacks 3) Incorporating an increasing number of electric vehicles into the system of power grid without overwhelming it. Additionally, it is proposed to use AI and data analytics for better performance on the grid, reliability, and energy management. It also looks into how AI and data analytics can be used to optimize grid performance, enhance reliability, and improve energy management. The authors will explore these significant challenges and ongoing research. Lastly, significant issues in this field are noted, and recommendations for further work are provided.
[LG-44] Saliency Maps are Ambiguous: Analysis of Logical Relations on First and Second Order Attributions
链接: https://arxiv.org/abs/2501.14136
作者: Leonid Schwenke,Martin Atzmueller
类目: Machine Learning (cs.LG)
*备注: 20 pages for the main article including references, 14 main article figures, 5 tables, 7 appendix figures
点击查看摘要
Abstract:Recent work uncovered potential flaws in \eg attribution or heatmap based saliency methods. A typical flaw is a confirmations bias, where the scores are compared to human expectation. Since measuring the quality of saliency methods is hard due to missing ground truth model reasoning, finding general limitations is also hard. This is further complicated, because masking-based evaluation on complex data can easily introduce a bias, as most methods cannot fully ignore inputs. In this work, we extend our previous analysis on the logical dataset framework ANDOR, where we showed that all analysed saliency methods fail to grasp all needed classification information for all possible scenarios. Specifically, this paper extends our previous work using analysis on more datasets, in order to better understand in which scenarios the saliency methods fail. Further, we apply the Global Coherence Representation as an additional evaluation method in order to enable actual input omission.
[LG-45] Selecting Critical Scenarios of DER Adoption in Distribution Grids Using Bayesian Optimization
链接: https://arxiv.org/abs/2501.14118
作者: Olivier Mulkin,Miguel Heleno,Mike Ludkovski
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 10 pages, 2 tables, 11 figures
点击查看摘要
Abstract:We develop a new methodology to select scenarios of DER adoption most critical for distribution grids. Anticipating risks of future voltage and line flow violations due to additional PV adopters is central for utility investment planning but continues to rely on deterministic or ad hoc scenario selection. We propose a highly efficient search framework based on multi-objective Bayesian Optimization. We treat underlying grid stress metrics as computationally expensive black-box functions, approximated via Gaussian Process surrogates and design an acquisition function based on probability of scenarios being Pareto-critical across a collection of line- and bus-based violation objectives. Our approach provides a statistical guarantee and offers an order of magnitude speed-up relative to a conservative exhaustive search. Case studies on realistic feeders with 200-400 buses demonstrate the effectiveness and accuracy of our approach.
[LG-46] Personalized Interpolation: An Efficient Method to Tame Flexible Optimization Window Estimation
链接: https://arxiv.org/abs/2501.14103
作者: Xin Zhang,Weiliang Li,Rui Li,Zihang Fu,Tongyi Tang,Zhengyu Zhang,Wen-Yen Chen,Nima Noorshams,Nirav Jasapara,Xiaowen Ding,Ellie Wen,Xue Feng
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In the realm of online advertising, optimizing conversions is crucial for delivering relevant products to users and enhancing business outcomes. Predicting conversion events is challenging due to variable delays between user interactions, such as impressions or clicks, and the actual conversions. These delays differ significantly across various advertisers and products, necessitating distinct optimization time windows for targeted conversions. To address this, we introduce a novel approach named the \textitPersonalized Interpolation method, which innovatively builds upon existing fixed conversion window models to estimate flexible conversion windows. This method allows for the accurate estimation of conversions across a variety of delay ranges, thus meeting the diverse needs of advertisers without increasing system complexity. To validate the efficacy of our proposed method, we conducted comprehensive experiments using ads conversion model. Our experiments demonstrate that this method not only achieves high prediction accuracy but also does so more efficiently than other existing solutions. This validation underscores the potential of our Personalized Interpolation method to significantly enhance conversion optimization in real-world online advertising systems, promising improved targeting and effectiveness in advertising strategies.
[LG-47] 5G LDPC Linear Transformer for Channel Decoding
链接: https://arxiv.org/abs/2501.14102
作者: Mario Hernandez,Fernando Pinero
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 8 pages, 9 figures
点击查看摘要
Abstract:This work introduces a novel, fully differentiable linear-time complexity transformer decoder and a transformer decoder to correct 5G New Radio (NR) LDPC. We propose a scalable approach to decode linear block codes with O(n) complexity rather than O(n^2) for regular transformers. The architectures’ performances are compared to Belief Propagation (BP), the production-level decoding algorithm used for 5G New Radio (NR) LDPC codes. We achieve bit error rate performance that matches a regular Transformer decoder and surpases one iteration BP, also achieving competitive time performance against BP, even for larger block codes. We utilize Sionna, Nvidia’s 5G 6G physical layer research software, for reproducible results.
[LG-48] Datasheets for AI and medical datasets (DAIMS): a data validation and documentation framework before machine learning analysis in medical research
链接: https://arxiv.org/abs/2501.14094
作者: Ramtin Zargari Marandi(1),Anne Svane Frahm(1),Maja Milojevic(1)
类目: Machine Learning (cs.LG)
*备注: 10 pages, 1 figure, 2 tables
点击查看摘要
Abstract:Despite progresses in data engineering, there are areas with limited consistencies across data validation and documentation procedures causing confusions and technical problems in research involving machine learning. There have been progresses by introducing frameworks like “Datasheets for Datasets”, however there are areas for improvements to prepare datasets, ready for ML pipelines. Here, we extend the framework to “Datasheets for AI and medical datasets - DAIMS.” Our publicly available solution, DAIMS, provides a checklist including data standardization requirements, a software tool to assist the process of the data preparation, an extended form for data documentation and pose research questions, a table as data dictionary, and a flowchart to suggest ML analyses to address the research questions. The checklist consists of 24 common data standardization requirements, where the tool checks and validate a subset of them. In addition, we provided a flowchart mapping research questions to suggested ML methods. DAIMS can serve as a reference for standardizing datasets and a roadmap for researchers aiming to apply effective ML techniques in their medical research endeavors. DAIMS is available on GitHub and as an online app to automate key aspects of dataset evaluation, facilitating efficient preparation of datasets for ML studies.
[LG-49] Making Reliable and Flexible Decisions in Long-tailed Classification
链接: https://arxiv.org/abs/2501.14090
作者: Bolian Li,Ruqi Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Long-tailed classification is challenging due to its heavy imbalance in class probabilities. While existing methods often focus on overall accuracy or accuracy for tail classes, they overlook a critical aspect: certain types of errors can carry greater risks than others in real-world long-tailed problems. For example, misclassifying patients (a tail class) as healthy individuals (a head class) entails far more serious consequences than the reverse scenario. To address this critical issue, we introduce Making Reliable and Flexible Decisions in Long-tailed Classification (RF-DLC), a novel framework aimed at reliable predictions in long-tailed problems. Leveraging Bayesian Decision Theory, we introduce an integrated gain to seamlessly combine long-tailed data distributions and the decision-making procedure. We further propose an efficient variational optimization strategy for the decision risk objective. Our method adapts readily to diverse utility matrices, which can be designed for specific tasks, ensuring its flexibility for different problem settings. In empirical evaluation, we design a new metric, False Head Rate, to quantify tail-sensitivity risk, along with comprehensive experiments on multiple real-world tasks, including large-scale image classification and uncertainty quantification, to demonstrate the reliability and flexibility of our method.
[LG-50] Efficient Precision Control in Object Detection Models for Enhanced and Reliable Ovarian Follicle Counting
链接: https://arxiv.org/abs/2501.14036
作者: Vincent Blot,Alexandra Lorenzo de Brionne,Ines Sellami,Olivier Trassard,Isabelle Beau,Charlotte Sonigo,Nicolas J-B. Brunel
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures
点击查看摘要
Abstract:Image analysis is a key tool for describing the detailed mechanisms of folliculogenesis, such as evaluating the quantity of mouse Primordial ovarian Follicles (PMF) in the ovarian reserve. The development of high-resolution virtual slide scanners offers the possibility of quantifying, robustifying and accelerating the histopathological procedure. A major challenge for machine learning is to control the precision of predictions while enabling a high recall, in order to provide reproducibility. We use a multiple testing procedure that gives an overperforming way to solve the standard Precision-Recall trade-off that gives probabilistic guarantees on the precision. In addition, we significantly improve the overall performance of the models (increase of F1-score) by selecting the decision threshold using contextual biological information or using an auxiliary model. As it is model-agnostic, this contextual selection procedure paves the way to the development of a strategy that can improve the performance of any model without the need of retraining it.
[LG-51] Overcoming Fairness Trade-offs via Pre-processing: A Causal Perspective
链接: https://arxiv.org/abs/2501.14710
作者: Charlotte Leininger,Simon Rittel,Ludwig Bothmann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Training machine learning models for fair decisions faces two key challenges: The \emphfairness-accuracy trade-off results from enforcing fairness which weakens its predictive performance in contrast to an unconstrained model. The incompatibility of different fairness metrics poses another trade-off – also known as the \emphimpossibility theorem. Recent work identifies the bias within the observed data as a possible root cause and shows that fairness and predictive performance are in fact in accord when predictive performance is measured on unbiased data. We offer a causal explanation for these findings using the framework of the FiND (fictitious and normatively desired) world, a “fair” world, where protected attributes have no causal effects on the target variable. We show theoretically that (i) classical fairness metrics deemed to be incompatible are naturally satisfied in the FiND world, while (ii) fairness aligns with high predictive performance. We extend our analysis by suggesting how one can benefit from these theoretical insights in practice, using causal pre-processing methods that approximate the FiND world. Additionally, we propose a method for evaluating the approximation of the FiND world via pre-processing in practical use cases where we do not have access to the FiND world. In simulations and empirical studies, we demonstrate that these pre-processing methods are successful in approximating the FiND world and resolve both trade-offs. Our results provide actionable solutions for practitioners to achieve fairness and high predictive performance simultaneously.
[LG-52] End-to-end workflow for machine learning-based qubit readout with QICK and hls4ml
链接: https://arxiv.org/abs/2501.14663
作者: Giuseppe Di Guglielmo,Botao Du,Javier Campos,Alexandra Boltasseva,Akash V. Dixit,Farah Fahim,Zhaxylyk Kudyshev,Santiago Lopez,Ruichao Ma,Gabriel N. Perdue,Nhan Tran,Omer Yesilyurt,Daniel Bowring
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We present an end-to-end workflow for superconducting qubit readout that embeds co-designed Neural Networks (NNs) into the Quantum Instrumentation Control Kit (QICK). Capitalizing on the custom firmware and software of the QICK platform, which is built on Xilinx RFSoC FPGAs, we aim to leverage machine learning (ML) to address critical challenges in qubit readout accuracy and scalability. The workflow utilizes the hls4ml package and employs quantization-aware training to translate ML models into hardware-efficient FPGA implementations via user-friendly Python APIs. We experimentally demonstrate the design, optimization, and integration of an ML algorithm for single transmon qubit readout, achieving 96% single-shot fidelity with a latency of 32ns and less than 16% FPGA look-up table resource utilization. Our results offer the community an accessible workflow to advance ML-driven readout and adaptive control in quantum information processing applications.
[LG-53] Mean-field limit from general mixtures of experts to quantum neural networks
链接: https://arxiv.org/abs/2501.14660
作者: Anderson Melchor Hernandez,Davide Pastorello,Giacomo De Palma
类目: Mathematical Physics (math-ph); Machine Learning (cs.LG); Probability (math.PR)
*备注:
点击查看摘要
Abstract:In this work, we study the asymptotic behavior of Mixture of Experts (MoE) trained via gradient flow on supervised learning problems. Our main result establishes the propagation of chaos for a MoE as the number of experts diverges. We demonstrate that the corresponding empirical measure of their parameters is close to a probability measure that solves a nonlinear continuity equation, and we provide an explicit convergence rate that depends solely on the number of experts. We apply our results to a MoE generated by a quantum neural network.
[LG-54] Optimal Transport Barycenter via Nonconvex-Concave Minimax Optimization
链接: https://arxiv.org/abs/2501.14635
作者: Kaheon Kim,Rentian Yao,Changbo Zhu,Xiaohui Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The optimal transport barycenter (a.k.a. Wasserstein barycenter) is a fundamental notion of averaging that extends from the Euclidean space to the Wasserstein space of probability distributions. Computation of the unregularized barycenter for discretized probability distributions on point clouds is a challenging task when the domain dimension d 1 . Most practical algorithms for approximating the barycenter problem are based on entropic regularization. In this paper, we introduce a nearly linear time O(m \logm) and linear space complexity O(m) primal-dual algorithm, the Wasserstein-Descent \dot\mathbbH^1 -Ascent (WDHA) algorithm, for computing the exact barycenter when the input probability density functions are discretized on an m -point grid. The key success of the WDHA algorithm hinges on alternating between two different yet closely related Wasserstein and Sobolev optimization geometries for the primal barycenter and dual Kantorovich potential subproblems. Under reasonable assumptions, we establish the convergence rate and iteration complexity of WDHA to its stationary point when the step size is appropriately chosen. Superior computational efficacy, scalability, and accuracy over the existing Sinkhorn-type algorithms are demonstrated on high-resolution (e.g., 1024 \times 1024 images) 2D synthetic and real data.
[LG-55] Single-neuron deep generative model uncovers underlying physics of neuronal activity in Ca imaging data
链接: https://arxiv.org/abs/2501.14615
作者: Jordi Abante,Angelo Piga,Berta Ros,Clara F López-León,Josep M Canals,Jordi Soriano
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, ECCB 2025
点击查看摘要
Abstract:Calcium imaging has become a powerful alternative to electrophysiology for studying neuronal activity, offering spatial resolution and the ability to measure large populations of neurons in a minimally invasive manner. This technique has broad applications in neuroscience, neuroengineering, and medicine, enabling researchers to explore the relationship between neuron location and activity. Recent advancements in deep generative models (DGMs) have facilitated the modeling of neuronal population dynamics, uncovering latent representations that provide insights into behavior prediction and neuronal variance. However, these models often rely on spike inference algorithms and primarily focus on population-level dynamics, limiting their applicability for single-neuron analyses. To address this gap, we propose a novel framework for single-neuron representation learning using autoregressive variational autoencoders (AVAEs). Our approach embeds individual neurons’ spatiotemporal signals into a reduced-dimensional space without the need for spike inference algorithms. The AVAE excels over traditional linear methods by generating more informative and discriminative latent representations, improving tasks such as visualization, clustering, and the understanding of neuronal activity. Additionally, the reconstruction performance of the AVAE outperforms the state of the art, demonstrating its ability to accurately recover the original fluorescence signal from the learned representation. Using realistic simulations, we show that our model captures underlying physical properties and connectivity patterns, enabling it to distinguish between different firing and connectivity types. These findings position the AVAE as a versatile and powerful tool for advancing single-neuron analysis and lays the groundwork for future integration of multimodal single-cell datasets in neuroscience.
[LG-56] coverforest: Conformal Predictions with Random Forest in Python
链接: https://arxiv.org/abs/2501.14570
作者: Panisara Meehinkong,Donlapark Ponnoprat
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: In peer review
点击查看摘要
Abstract:Conformal prediction provides a framework for uncertainty quantification, specifically in the forms of prediction intervals and sets with distribution-free guaranteed coverage. While recent cross-conformal techniques such as CV+ and Jackknife±after-bootstrap achieve better data efficiency than traditional split conformal methods, they incur substantial computational costs due to required pairwise comparisons between training and test samples’ out-of-bag scores. Observing that these methods naturally extend from ensemble models, particularly random forests, we leverage existing optimized random forest implementations to enable efficient cross-conformal predictions. We present coverforest, a Python package that implements efficient conformal prediction methods specifically optimized for random forests. coverforest supports both regression and classification tasks through various conformal prediction methods, including split conformal, CV+, Jackknife±after-bootstrap, and adaptive prediction sets. Our package leverages parallel computing and Cython optimizations to speed up out-of-bag calculations. Our experiments demonstrate that coverforest’s predictions achieve the desired level of coverage. In addition, its training and prediction times can be faster than an existing implementation by 2–9 times. The source code for the coverforest is hosted on GitHub at this https URL. Comments: In peer review Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO) Cite as: arXiv:2501.14570 [stat.ML] (or arXiv:2501.14570v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2501.14570 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-57] Statistical Verification of Linear Classifiers
链接: https://arxiv.org/abs/2501.14430
作者: Anton Zhiyanov,Alexander Shklyaev,Alexey Galatenko,Vladimir Galatenko,Alexander Tonevitsky
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Applications (stat.AP)
*备注: 16 pages, 3 figures
点击查看摘要
Abstract:We propose a homogeneity test closely related to the concept of linear separability between two samples. Using the test one can answer the question whether a linear classifier is merely ``random’’ or effectively captures differences between two classes. We focus on establishing upper bounds for the test’s \emphp-value when applied to two-dimensional samples. Specifically, for normally distributed samples we experimentally demonstrate that the upper bound is highly accurate. Using this bound, we evaluate classifiers designed to detect ER-positive breast cancer recurrence based on gene pair expression. Our findings confirm significance of IGFBP6 and ELOVL5 genes in this process.
[LG-58] Distributionally Robust Coreset Selection under Covariate Shift
链接: https://arxiv.org/abs/2501.14253
作者: Tomonari Tanaka,Hiroyuki Hanada,Hanting Yang,Tatsuya Aoyama,Yu Inatsu,Satoshi Akahane,Yoshito Okura,Noriaki Hashimoto,Taro Murayama,Hanju Lee,Shinya Kojima,Ichiro Takeuchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Coreset selection, which involves selecting a small subset from an existing training dataset, is an approach to reducing training data, and various approaches have been proposed for this method. In practical situations where these methods are employed, it is often the case that the data distributions differ between the development phase and the deployment phase, with the latter being unknown. Thus, it is challenging to select an effective subset of training data that performs well across all deployment scenarios. We therefore propose Distributionally Robust Coreset Selection (DRCS). DRCS theoretically derives an estimate of the upper bound for the worst-case test error, assuming that the future covariate distribution may deviate within a defined range from the training distribution. Furthermore, by selecting instances in a way that suppresses the estimate of the upper bound for the worst-case test error, DRCS achieves distributionally robust training instance selection. This study is primarily applicable to convex training computation, but we demonstrate that it can also be applied to deep learning under appropriate approximations. In this paper, we focus on covariate shift, a type of data distribution shift, and demonstrate the effectiveness of DRCS through experiments.
[LG-59] Adaptive Progressive Attention Graph Neural Network for EEG Emotion Recognition
链接: https://arxiv.org/abs/2501.14246
作者: Tianzhi Feng,Chennan Wu,Yi Niu,Fu Li,Boxun Fu,Zhifu Zhao,Xiaotian Wang,Guangming Shi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In recent years, numerous neuroscientific studies have shown that human emotions are closely linked to specific brain regions, with these regions exhibiting variability across individuals and emotional states. To fully leverage these neural patterns, we propose an Adaptive Progressive Attention Graph Neural Network (APAGNN), which dynamically captures the spatial relationships among brain regions during emotional processing. The APAGNN employs three specialized experts that progressively analyze brain topology. The first expert captures global brain patterns, the second focuses on region-specific features, and the third examines emotion-related channels. This hierarchical approach enables increasingly refined analysis of neural activity. Additionally, a weight generator integrates the outputs of all three experts, balancing their contributions to produce the final predictive label. Extensive experiments on three publicly available datasets (SEED, SEED-IV and MPED) demonstrate that the proposed method enhances EEG emotion recognition performance, achieving superior results compared to baseline methods.
[LG-60] Learning to Price with Resource Constraints: From Full Information to Machine-Learned Prices
链接: https://arxiv.org/abs/2501.14155
作者: Ruicheng Ao,Jiashuo Jiang,David Simchi-Levi
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 28 pages, 4 figures
点击查看摘要
Abstract:We study the dynamic pricing problem with knapsack, addressing the challenge of balancing exploration and exploitation under resource constraints. We introduce three algorithms tailored to different informational settings: a Boundary Attracted Re-solve Method for full information, an online learning algorithm for scenarios with no prior information, and an estimate-then-select re-solve algorithm that leverages machine-learned informed prices with known upper bound of estimation errors. The Boundary Attracted Re-solve Method achieves logarithmic regret without requiring the non-degeneracy condition, while the online learning algorithm attains an optimal O(\sqrtT) regret. Our estimate-then-select approach bridges the gap between these settings, providing improved regret bounds when reliable offline data is available. Numerical experiments validate the effectiveness and robustness of our algorithms across various scenarios. This work advances the understanding of online resource allocation and dynamic pricing, offering practical solutions adaptable to different informational structures.
[LG-61] EFiGP: Eigen-Fourier Physics-Informed Gaussian Process for Inference of Dynamic Systems
链接: https://arxiv.org/abs/2501.14107
作者: Jianhong Chen,Shihao Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Parameter estimation and trajectory reconstruction for data-driven dynamical systems governed by ordinary differential equations (ODEs) are essential tasks in fields such as biology, engineering, and physics. These inverse problems – estimating ODE parameters from observational data – are particularly challenging when the data are noisy, sparse, and the dynamics are nonlinear. We propose the Eigen-Fourier Physics-Informed Gaussian Process (EFiGP), an algorithm that integrates Fourier transformation and eigen-decomposition into a physics-informed Gaussian Process framework. This approach eliminates the need for numerical integration, significantly enhancing computational efficiency and accuracy. Built on a principled Bayesian framework, EFiGP incorporates the ODE system through probabilistic conditioning, enforcing governing equations in the Fourier domain while truncating high-frequency terms to achieve denoising and computational savings. The use of eigen-decomposition further simplifies Gaussian Process covariance operations, enabling efficient recovery of trajectories and parameters even in dense-grid settings. We validate the practical effectiveness of EFiGP on three benchmark examples, demonstrating its potential for reliable and interpretable modeling of complex dynamical systems while addressing key challenges in trajectory recovery and computational cost.
[LG-62] Improved subsample-and-aggregate via the private modified winsorized mean
链接: https://arxiv.org/abs/2501.14095
作者: Kelly Ramsay,Dylan Spicker
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 40 pages, 2 figures
点击查看摘要
Abstract:We develop a univariate, differentially private mean estimator, called the private modified winsorized mean designed to be used as the aggregator in subsample-and-aggregate. We demonstrate, via real data analysis, that common differentially private multivariate mean estimators may not perform well as the aggregator, even with a dataset with 8000 observations, motivating our developments. We show that the modified winsorized mean is minimax optimal for several, large classes of distributions, even under adversarial contamination. We also demonstrate that, empirically, the modified winsorized mean performs well compared to other private mean estimates. We consider the modified winsorized mean as the aggregator in subsample-and-aggregate, deriving a finite sample deviations bound for a subsample-and-aggregate estimate generated with the new aggregator. This result yields two important insights: (i) the optimal choice of subsamples depends on the bias of the estimator computed on the subsamples, and (ii) the rate of convergence of the subsample-and-aggregate estimator depends on the robustness of the estimator computed on the subsamples.
[LG-63] Low rank matrix completion and realization of graphs: results and problems
链接: https://arxiv.org/abs/2501.13935
作者: S. Dzhenzher,T. Garaev,O. Nikitenko,A. Petukhov,A. Skopenkov,A. Voropaev
类目: History and Overview (math.HO); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO); Geometric Topology (math.GT)
*备注: 21 pages, 6 figures
点击查看摘要
Abstract:The Netflix problem (from machine learning) asks the following. Given a ratings matrix in which each entry (i,j) represents the rating of movie j by customer i , if customer i has watched movie j , and is otherwise missing, we would like to predict the remaining entries in order to make good recommendations to customers on what to watch next. The remaining entries are predicted so as to minimize the \it rank of the completed matrix. In this survey we study a more general problem, in which instead of knowing specific matrix elements, we know linear relations on such elements. We describe applications of these results to embeddings of graphs in surfaces (more precisely, embeddings with rotation systems, and embeddings modulo 2). Comments: 21 pages, 6 figures Subjects: History and Overview (math.HO); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO); Geometric Topology (math.GT) MSC classes: 15-02, 15A83, 57-02, 57M15, 57Q35, 05C10 Cite as: arXiv:2501.13935 [math.HO] (or arXiv:2501.13935v1 [math.HO] for this version) https://doi.org/10.48550/arXiv.2501.13935 Focus to learn more arXiv-issued DOI via DataCite
信息检索
[IR-0] Knowledge Graphs Construction from Criminal Court Appeals: Insights from the French Cassation Court
链接: https://arxiv.org/abs/2501.14579
作者: Alexander V. Belikov,Sacha Raoult
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Despite growing interest, accurately and reliably representing unstructured data, such as court decisions, in a structured form, remains a challenge. Recent advancements in generative AI applied to language modeling enabled the transformation of text into knowledge graphs, unlocking new opportunities for analysis and modeling. This paper presents a framework for constructing knowledge graphs from appeals to the French Cassation Court. The framework includes a domain-specific ontology and a derived dataset, offering a foundation for structured legal data representation and analysis.
[IR-1] On Correlating Factors for Domain Adaptation Performance
链接: https://arxiv.org/abs/2501.14466
作者: Goksenin Yuksel,Jaap Kamps
类目: Information Retrieval (cs.IR); Applications (stat.AP)
*备注:
点击查看摘要
Abstract:Dense retrievers have demonstrated significant potential for neural information retrieval; however, they lack robustness to domain shifts, limiting their efficacy in zero-shot settings across diverse domains. In this paper, we set out to analyze the possible factors that lead to successful domain adaptation of dense retrievers. We include domain similarity proxies between generated queries to test and source domains. Furthermore, we conduct a case study comparing two powerful domain adaptation techniques. We find that generated query type distribution is an important factor, and generating queries that share a similar domain to the test documents improves the performance of domain adaptation methods. This study further emphasizes the importance of domain-tailored generated queries.
[IR-2] CAMEO: Autocorrelation-Preserving Line Simplification for Lossy Time Series Compression
链接: https://arxiv.org/abs/2501.14432
作者: Carlos Enrique Muñiz-Cuza,Matthias Boehm,Torben Bach Pedersen
类目: Databases (cs.DB); Information Retrieval (cs.IR); Information Theory (cs.IT)
*备注: 14 pages, 13 figures
点击查看摘要
Abstract:Time series data from a variety of sensors and IoT devices need effective compression to reduce storage and I/O bandwidth requirements. While most time series databases and systems rely on lossless compression, lossy techniques offer even greater space-saving with a small loss in precision. However, the unknown impact on downstream analytics applications requires a semi-manual trial-and-error exploration. We initiate work on lossy compression that provides guarantees on complex statistical features (which are strongly correlated with the accuracy of the downstream analytics). Specifically, we propose a new lossy compression method that provides guarantees on the autocorrelation and partial-autocorrelation functions (ACF/PACF) of a time series. Our method leverages line simplification techniques as well as incremental maintenance of aggregates, blocking, and parallelization strategies for effective and efficient compression. The results show that our method improves compression ratios by 2x on average and up to 54x on selected datasets, compared to previous lossy and lossless compression methods. Moreover, we maintain – and sometimes even improve – the forecasting accuracy by preserving the autocorrelation properties of the time series. Our framework is extensible to multivariate time series and other statistical features of the time series.
[IR-3] Multi-stage Large Language Model Pipelines Can Outperform GPT -4o in Relevance Assessment WWW’25
链接: https://arxiv.org/abs/2501.14296
作者: Julian A. Schnabel,Johanne R. Trippas,Falk Scholer,Danula Hettiachchi
类目: Information Retrieval (cs.IR)
*备注: WebConf’25, WWW’25
点击查看摘要
Abstract:The effectiveness of search systems is evaluated using relevance labels that indicate the usefulness of documents for specific queries and users. While obtaining these relevance labels from real users is ideal, scaling such data collection is challenging. Consequently, third-party annotators are employed, but their inconsistent accuracy demands costly auditing, training, and monitoring. We propose an LLM-based modular classification pipeline that divides the relevance assessment task into multiple stages, each utilising different prompts and models of varying sizes and capabilities. Applied to TREC Deep Learning (TREC-DL), one of our approaches showed an 18.4% Krippendorff’s \alpha accuracy increase over OpenAI’s GPT-4o mini while maintaining a cost of about 0.2 USD per million input tokens, offering a more efficient and scalable solution for relevance assessment. This approach beats the baseline performance of GPT-4o (5 USD). With a pipeline approach, even the accuracy of the GPT-4o flagship model, measured in \alpha , could be improved by 9.7%.
附件下载
点击下载今日全部论文列表